\chapter{Queries with SQL} \section{Overview of SQL Querying} The Structured Query Language (SQL) serves as the primary interface for interacting with relational database management systems. At its core, SQL is a declarative language, meaning that a programmer specifies what data should be retrieved rather than providing the step-by-step procedure for how to find it. This approach allows the database's internal query optimizer to determine the most efficient path to the data, leveraging indexes and specific storage structures. SQL is fundamentally set-based, manipulating entire relations as units rather than processing individual records in a procedural loop. This aligns closely with the mathematical foundations of relational algebra, where operations like selection, projection, and join act on sets or bags of tuples. The language is typically divided into two main components: the Data Definition Language (DDL) and the Data Manipulation Language (DML). DDL is concerned with the metadata and the structural level of the database, involving the creation, modification, and deletion of tables and columns. DML, which is the focus of this summary, operates at the record level. It allows for the insertion of new data, the updating of existing records, the deletion of information, and most importantly, the querying of the stored data. \dfn{SQL}{Structured Query Language, a declarative and set-based language used to define and manipulate data within a relational database management system.} \thm{Declarative Programming}{The principle where the programmer describes the desired result of a computation without specifying the control flow or algorithmic steps to achieve it.} \section{Fundamental Selection and Projection} The most basic form of a SQL query follows the select-from-where structure. This construct allows a user to extract specific attributes from one or more tables based on certain criteria. To understand this in the context of relational algebra, we can view the three main clauses as distinct algebraic operators. The FROM clause identifies the source relations; if multiple relations are listed, it effectively represents a Cartesian product. The WHERE clause functions as a selection operator ($\sigma$), filtering the tuples produced by the product based on a logical predicate. Finally, the SELECT clause acts as a projection operator ($\pi$), narrowing the result set to only the desired columns. In its simplest manifestation, a query can use the asterisk symbol (*) to denote that all columns from the source table should be included in the output. While this is useful for exploration, explicit projection is preferred in production environments to minimize data transfer and clarify the schema of the result set. Within the SELECT clause, we are not limited to just listing attributes; we can also include arithmetic expressions, such as calculating a value based on existing columns, or constants to provide context in the output. \dfn{Query}{A formal request for information from a database, typically expressed in SQL to retrieve specific data matching a set of conditions.} \thm{SQL to Relational Algebra Mapping}{A simple select-from-where query is equivalent to the relational algebra expression $\pi_L(\sigma_C(R_1 \times R_2 \times ... \times R_n))$, where L is the select list, C is the where condition, and $R_i$ are the relations in the from list.} \section{Renaming and Aliasing} When a query is executed, the resulting relation has column headers that defaults to the names of the attributes in the source tables. However, there are many cases where these names may be ambiguous or uninformative, particularly when calculations are involved. SQL provides the `AS` keyword to assign an alias to a column or an expression. This allows the programmer to rename the output for better readability or to comply with the requirements of an application. Aliases can also be applied to relations in the FROM clause. These are referred to as tuple variables or correlation names. Aliasing a relation is essential when a query must compare different rows within the same table, a process known as a self-join. By assigning different aliases to the same table, the query can treat them as two distinct sources of data, enabling comparisons between tuples. \dfn{Alias}{A temporary name assigned to a table or a column within the scope of a single SQL query to improve clarity or disambiguate references.} \section{String Patterns and Comparison Operators} SQL provides a robust set of comparison operators to filter data within the WHERE clause. These include equality (=), inequality (<>), and various ordering comparisons (<, >, <=, >=). While numeric comparisons are straightforward, string comparisons follow lexicographical order. For more flexible string matching, the `LIKE` operator is used with patterns. This operator allows for partial matches using two special wildcard characters: the percent sign (%) and the underscore (_). The percent sign matches any sequence of zero or more characters, while the underscore matches exactly one character. This is particularly useful for finding records where only a portion of a string is known or for identifying specific substrings. \dfn{Predicate}{A logical expression that evaluates to true, false, or unknown, used in the WHERE clause to determine which tuples satisfy the query criteria.} \thm{Lexicographical Comparison}{The method of ordering strings based on the alphabetical order of their component characters, where a string is "less than" another if it appears earlier in a dictionary.} \section{Handling Incomplete Information with Null Values} In real-world databases, it is common for certain pieces of information to be missing or inapplicable. SQL represents this missing data with a special marker called `NULL`. It is important to recognize that `NULL` is not a value in the same way 0 or an empty string is; it is a placeholder indicating the absence of a value. Because `NULL` represents unknown data, comparisons involving `NULL` cannot result in a standard true or false. Instead, SQL employs a three-valued logic system that includes `UNKNOWN`. For example, if we compare a column containing a `NULL` to a constant, the result is `UNKNOWN`. To explicitly check for these placeholders, SQL provides the `IS NULL` and `IS NOT NULL` operators. Standard equality comparisons like `= NULL` will always evaluate to `UNKNOWN` and therefore fail to filter the desired records. \dfn{NULL}{A special marker in SQL used to indicate that a data value does not exist in the database, either because it is unknown or not applicable.} \thm{Three-Valued Logic}{A system of logic where expressions can evaluate to TRUE, FALSE, or UNKNOWN, requiring specialized truth tables for AND, OR, and NOT operations.} \section{Logic and Truth Tables in SQL} The presence of `UNKNOWN` values necessitates a clear understanding of how logical operators behave. When combining conditions with `AND`, the result is the minimum of the truth values, where TRUE is 1, UNKNOWN is 0.5, and FALSE is 0. Conversely, `OR` takes the maximum of the truth values. The `NOT` operator subtracts the truth value from 1. In the context of a WHERE clause, a tuple is only included in the final result set if the entire condition evaluates to `TRUE`. Tuples for which the condition is `FALSE` or `UNKNOWN` are excluded. This behavior can lead to unintuitive results, such as a query for "all records where X is 10 OR X is not 10" failing to return records where X is `NULL`, because the result of that OR operation would be `UNKNOWN`. \dfn{Truth Table}{A mathematical table used to determine the result of logical operations given all possible combinations of input truth values.} \section{Multi-Relation Queries and the Cartesian Product} When a query involves data spread across multiple tables, the FROM clause lists all the relevant relations. The logical starting point for such a query is the Cartesian product, which pairs every tuple from the first relation with every tuple from the second, and so on. This produces a very large intermediate relation where each row represents a potential combination of the source data. To make this product useful, the WHERE clause must contain join conditions that link the relations based on common attributes. For instance, if we are joining a 'Movies' table with a 'Producers' table, we might equate the 'producerID' column in both. This filtering process discards the vast majority of the Cartesian product, leaving only the rows where the related data actually matches. When attributes in different tables share the same name, we use the dot notation (e.g., TableName.AttributeName) to disambiguate the references. \thm{The Join-Selection Equivalence}{The principle that a natural join or an equijoin can be logically expressed as a selection performed on a Cartesian product of relations.} \section{Interpretation of Multi-Relation Queries} There are multiple ways to interpret the execution of a query involving several relations. One helpful mental model is the "nested loops" approach. In this model, we imagine a loop for each relation in the FROM clause. The outermost loop iterates through every tuple of the first relation, and for each of those, the next loop iterates through the second relation, and so on. Inside the innermost loop, the WHERE condition is tested against the current combination of tuples. If the condition is met, the SELECT clause produces an output row. Another interpretation is based on parallel assignment. In this view, we consider all possible assignments of tuples to the variables representing the relations. We then filter for those assignments that satisfy the condition. While the nested loop model is more algorithmic, the parallel assignment model highlights the declarative nature of the query, emphasizing that the order of the relations in the FROM clause should not, in theory, affect the result. \dfn{Tuple Variable}{A variable that ranges over the tuples of a relation, often implicitly created for each table in the FROM clause or explicitly defined as an alias.} \section{Set Operators and Bag Semantics} SQL provides operators for the traditional set-theoretic actions: `UNION`, `INTERSECT`, and `EXCEPT`. These allow the results of two queries to be combined, provided they have the same schema (compatible attribute types and order). By default, these operators follow set semantics, meaning that they automatically eliminate duplicate tuples from the result. However, since SQL is fundamentally based on bags (multisets), it also provides versions of these operators that preserve duplicates using the `ALL` keyword. `UNION ALL` simply concatenates the two result sets. `INTERSECT ALL` produces a tuple as many times as it appears in both inputs (taking the minimum count). `EXCEPT ALL` produces a tuple as many times as it appears in the first input minus the number of times it appears in the second (taking the difference). Using bag semantics is often more efficient because the system does not need to perform the expensive work of sorting or hashing the data to find and remove duplicates. \dfn{Bag}{A collection of elements that allows for multiple occurrences of the same element, where the order of elements remains immaterial.} \thm{Closure of Bag Operations}{The property that the result of any operation on bags is also a bag, ensuring that the relational model remains consistent through complex sequences of operations.} \section{Nested Queries and Scalar Subqueries} A subquery is a query nested within another query. Subqueries can appear in various parts of a SQL statement, including the WHERE, FROM, and HAVING clauses. A scalar subquery is one that returns exactly one row and one column—a single value. Because it evaluates to a scalar, it can be used anywhere a constant or an attribute would be valid, such as in a comparison. If a scalar subquery is designed to find, for instance, the specific ID of a person, and the data actually contains two people with that name, the query will fail at runtime. The system expects a single value and cannot resolve the ambiguity. If the subquery returns no rows, it is treated as a `NULL`. \dfn{Scalar}{A single atomic value, such as an integer or a string, as opposed to a collection of values like a row or a table.} \section{Conditions on Relations: IN and EXISTS} When a subquery returns a set of values rather than a single scalar, it can be used with relational operators like `IN`. The expression `x IN (subquery)` evaluates to true if the value of x is found in the result set produced by the subquery. This is a powerful way to filter data based on membership in a dynamically calculated set. The `EXISTS` operator is another tool for dealing with subqueries. It takes a subquery as an argument and returns true if the subquery returns at least one row. Unlike `IN`, `EXISTS` does not look at the actual values returned; it only checks for the existence of results. This is often used in correlated subqueries to check for the presence of related records in another table. \thm{Existence Quantification}{The logical principle of checking whether there is at least one element in a set that satisfies a given property, implemented in SQL via the EXISTS operator.} \section{Correlated Subqueries and Scoping} A correlated subquery is a nested query that refers to attributes of the outer query. Because of this dependency, the subquery must, in concept, be re-evaluated for every row processed by the outer query. This creates a link between the two levels of the query, allowing for complex logic like "find all employees whose salary is higher than the average salary in their specific department." Scoping rules in SQL dictate how attribute names are resolved. An attribute in a subquery will first be looked for in the tables mentioned in that subquery's own FROM clause. If it is not found there, the system looks at the FROM clause of the next level out, and so on. If the same attribute name appears in multiple levels, we must use aliases to ensure the correct column is referenced. Correlated subqueries are often more expressive than simple joins but can be more computationally expensive if the optimizer cannot unnest them into a join. \dfn{Correlated Subquery}{A subquery that depends on the current row being processed by the outer query, identified by references to attributes defined in the outer scope.} \section{Join Expressions and Syntax Variants} While the select-from-where structure can express most joins, SQL also provides explicit join syntax. A `CROSS JOIN` is a direct representation of the Cartesian product. A `JOIN ... ON` allows the join condition to be specified explicitly in the FROM clause, which many developers find clearer than placing the condition in the WHERE clause. A `NATURAL JOIN` is a specialized form of join that automatically equates all columns with the same name in both tables and removes the redundant copies of those columns. While natural joins are concise, they can be risky because they depend on attribute names. If a schema change adds a column to one table that happens to share a name with a column in another, the natural join logic will change automatically and potentially break the query. The `USING` clause provides a middle ground, allowing the user to specify exactly which common columns should be used for the join. \dfn{Natural Join}{A join operation that matches tuples based on all attributes that have the same name in both relations, producing a result that contains only one copy of each common attribute.} \section{Outer Joins and Data Preservation} In a standard (inner) join, tuples that do not have a match in the other table are discarded. These are called "dangling tuples." If we wish to preserve these tuples in our result set, we use an `OUTER JOIN`. There are three types: `LEFT`, `RIGHT`, and `FULL`. A `LEFT OUTER JOIN` includes all tuples from the left table; if a tuple has no match in the right table, the columns from the right table are filled with `NULL`. The `RIGHT OUTER JOIN` is symmetric, preserving all rows from the right table. A `FULL OUTER JOIN` preserves all rows from both tables, ensuring that no information from either source is lost. Outer joins are essential when we need a comprehensive list of items, even if some of those items lack certain related data. \dfn{Dangling Tuple}{A tuple in one relation that does not match any tuple in another relation based on the join criteria.} \thm{The Outer Join Property}{The guarantee that all tuples of the specified operand relations will be represented in the result, with NULL values used to fill in missing components for non-matching rows.} \section{Aggregation and Data Summarization} SQL includes several built-in functions to perform calculations across entire columns of data. These are known as aggregation operators. The five standard operators are `SUM`, `AVG`, `MIN`, `MAX`, and `COUNT`. `SUM` and `AVG` can only be applied to numeric data, while `MIN` and `MAX` can also be applied to strings (using lexicographical order) or dates. The `COUNT` operator is versatile; `COUNT(*)` counts every row in a table, while `COUNT(attribute)` counts only the non-null values in that specific column. If we wish to count only the unique values, we can use the `DISTINCT` keyword inside the aggregation, such as `COUNT(DISTINCT studioName)`. It is vital to remember that all aggregations except for `COUNT` return `NULL` if they are applied to an empty set of values. `COUNT` returns 0 for an empty set. \dfn{Aggregation}{The process of summarizing multiple values into a single value through functions like summation or averaging.} \section{Grouping and Partitioning} The `GROUP BY` clause allows us to partition the rows of a relation into groups based on their values in one or more attributes. When a query contains a `GROUP BY` clause, the SELECT clause is limited in what it can contain. Every attribute listed in the SELECT clause must either be an attribute used for grouping or be part of an aggregate function. Conceptually, the system first creates the groups and then applies the aggregate functions to each group independently. The result is a single row for each unique combination of values in the grouping attributes. This is the primary way to generate reports and statistics, such as "the total number of movies produced by each studio per year." \dfn{Grouping Attribute}{An attribute used in the GROUP BY clause to define the partitions upon which aggregation functions will operate.} \section{Post-Aggregation Filtering with HAVING} Sometimes we want to filter the results of a query based on an aggregate value. However, the WHERE clause is evaluated before any grouping or aggregation takes place. Therefore, we cannot use a condition like `WHERE SUM(length) > 500`. To solve this, SQL provides the `HAVING` clause. The `HAVING` clause is evaluated after the groups have been formed and the aggregations have been calculated. It allows the programmer to specify conditions that apply to the group as a whole. Only the groups that satisfy the `HAVING` condition will appear in the final output. While `HAVING` can technically contain any condition, it is best practice to only use it for conditions involving aggregates, leaving all tuple-level filtering to the WHERE clause. \dfn{HAVING}{A clause in SQL used to specify conditions that filter groups of rows created by the GROUP BY clause, typically involving aggregate functions.} \thm{Query Execution Order}{The logical sequence of operations in a SQL query: FROM (and JOINs), then WHERE, then GROUP BY, then HAVING, and finally SELECT (and DISTINCT) and ORDER BY.} \section{Ordering and Sorting the Result} The final step in many queries is to present the data in a specific order for the user. The `ORDER BY` clause facilitates this, allowing for sorting by one or more columns in either ascending (`ASC`) or descending (`DESC`) order. Sorting is the last operation performed before the data is returned; even if a column is not projected in the SELECT clause, it can still be used for sorting if it was available in the source tables. If multiple columns are listed in the `ORDER BY` clause, the system sorts by the first column first. If there are ties, it uses the second column to break them, and so on. This ensures a deterministic and readable presentation of the retrieved information. \dfn{Sorting}{The process of arranging the rows of a result set in a specific sequence based on the values of one or more attributes.} \section{Extended Projection and Constants} The extended projection operator allows for more than just choosing columns. It enables the use of expressions that combine attributes or apply functions to them. In SQL, this is manifested in the SELECT list, where we can perform additions, concatenations, or even call stored functions. Constants are also frequently used in the SELECT list. For example, a query might select "Movie", title, year from a table. Every resulting row would have the string literal "Movie" as its first column. This is often used to label different parts of a union or to provide fixed formatting for an external application. \thm{Functional Dependency in Aggregation}{The rule that in a grouped query, any attribute in the SELECT list that is not aggregated must be functionally determined by the grouping attributes to ensure the result is well-defined.} \section{Nested Queries in the FROM Clause} SQL allows a subquery to be placed in the FROM clause. In this case, the subquery acts as a temporary table that exists only for the duration of the outer query. This is particularly useful when we need to perform multiple levels of aggregation or when we want to join a table with a summarized version of itself. When a subquery is used in the FROM clause, it must be assigned an alias. This alias allows the outer query to refer to the columns produced by the subquery. This technique is often a cleaner alternative to using complex correlated subqueries in the WHERE clause, as it makes the flow of data more explicit. \dfn{Derived Table}{A temporary result set returned by a subquery in the FROM clause, which is then used by the outer query as if it were a physical table.} \section{Summary of Advanced SQL Syntax} Throughout our exploration of Chapters 6 and the accompanying presentation, we have seen that SQL is far more than a simple tool for data retrieval. Its ability to nest logic, perform complex aggregations across partitioned data, and handle various join types allows it to solve sophisticated data analysis problems. The transition from the mathematical abstractions of relational algebra to the practical syntax of SQL reveals how each keyword serves a specific logical function in the data-processing pipeline. By understanding the declarative nature of the language and the underlying bag semantics, developers can write queries that are not only correct but also efficient. The careful management of NULLs, the strategic use of subqueries, and the mastery of grouping and having clauses form the foundation of expert database programming. This comprehensive summary has detailed the syntax and the theoretical justifications for the most critical features of SQL querying, providing a roadmap for complex data manipulation. \thm{The Universal Query Form}{The select-from-where block is the universal building block of SQL, capable of expressing any operation that can be represented by the core operators of relational algebra.}