\chapter{Queries with SQL}

Structured Query Language (SQL) serves as the primary interface for interacting with relational databases. While the Data Definition Language (DDL) handles the creation and modification of database structures, the Data Manipulation Language (DML) is used for the retrieval and modification of records. The querying aspect of SQL is essentially a high-level, declarative implementation of relational algebra. Because SQL is declarative, users specify the desired properties of the result set rather than the procedural steps required to compute it. This allows the database management system (DBMS) to utilize a query optimizer to determine the most efficient execution strategy, known as a query plan.

Modern SQL implementations typically follow a set-based or bag-based processing model. Under bag semantics, relations are treated as multisets where duplicate records are permitted, contrasting with the strict set theory used in pure relational algebra. SQL queries are processed by a query compiler that translates the high-level syntax into a tree of algebraic operators, such as selection, projection, join, and grouping.

\dfn{Declarative Language}{A programming paradigm in which the programmer defines what the result should look like (the logic of the computation) without describing its control flow (the procedural steps).}

\thm{Query Plan}{A structured sequence of internal operations, often represented as a tree of relational algebra operators, that the DBMS execution engine follows to produce the results of a query.}

\section{Basic Query Structure: SELECT-FROM-WHERE}

The fundamental building block of a SQL query is the select-from-where expression. This structure corresponds to the three most common operations in relational algebra: projection, relation selection, and tuple selection.

The \texttt{FROM} clause identifies the relations (tables) from which data is to be retrieved. This is conceptually the first step of the query, as it defines the scope of the data. The \texttt{WHERE} clause specifies a predicate used to filter the tuples. Only records that satisfy this logical condition are passed to the next stage. Finally, the \texttt{SELECT} clause identifies which attributes (columns) should be returned in the output. This is equivalent to the projection operator ($\pi$) in relational algebra. If a user wishes to retrieve all columns, a wildcard asterisk (*) is used.

\thm{SELECT-FROM-WHERE Mapping}{A basic SQL query of the form \texttt{SELECT L FROM R WHERE C} is equivalent to the relational algebra expression $\pi_{L}(\sigma_{C}(R))$.}

\nt{In SQL, the select-list can include not only existing attributes but also constants and computed expressions, functioning like an extended projection.}

\section{Logic, Comparisons, and Three-Valued Logic}

Filters in the \texttt{WHERE} clause are constructed using comparison operators such as equality (=), inequality (<> or !=), and range comparisons (<, >, <=, >=). SQL also supports pattern matching for strings through the \texttt{LIKE} operator, where the percent sign (%) matches any sequence of characters and the underscore (_) matches any single character.

A critical aspect of SQL logic is the treatment of \texttt{NULL} values. Because a \texttt{NULL} represents an unknown or missing value, comparing anything to \texttt{NULL} results in a truth value of \texttt{UNKNOWN}. This necessitates a three-valued logic system.

\dfn{Three-Valued Logic}{A logical framework where expressions can evaluate to TRUE, FALSE, or UNKNOWN, specifically required to handle comparisons involving NULL values.}

The behavior of logical operators under three-valued logic follows specific rules:

\begin{itemize}
	\item \textbf{AND}: The result is TRUE only if both operands are TRUE. If one is FALSE, the result is FALSE regardless of the other. If one is TRUE and the other is UNKNOWN, the result is UNKNOWN.
	\item \textbf{OR}: The result is TRUE if at least one operand is TRUE. If one is TRUE, the result is TRUE regardless of the other. If one is FALSE and the other is UNKNOWN, the result is UNKNOWN.
	\item \textbf{NOT}: The negation of UNKNOWN remains UNKNOWN.
\end{itemize}

\nt{The \texttt{WHERE} clause only retains tuples for which the predicate evaluates to TRUE. Records that evaluate to FALSE or UNKNOWN are filtered out.}

\section{Ordering and Limiting Results}

While relational algebra results are conceptually unordered sets or bags, SQL allows users to impose a specific order on the output using the \texttt{ORDER BY} clause. Sorting can be performed in ascending (\texttt{ASC}, the default) or descending (\texttt{DESC}) order. Multiple columns can be specified to handle ties.

Furthermore, SQL provides mechanisms to limit the size of the result set, which is particularly useful for performance and pagination. The \texttt{LIMIT} clause restricts the total number of rows returned, while the \texttt{OFFSET} clause skips a specified number of rows before beginning to return results.

\thm{List Semantics}{When an \texttt{ORDER BY} clause is applied, the result set is treated as a list rather than a bag, meaning the sequence of records is guaranteed and meaningful for the application.}

\section{Multi-Relation Queries and Joins}

SQL allows queries to involve multiple relations by listing them in the \texttt{FROM} clause. When multiple tables are listed without a joining condition, the result is a Cartesian product, where every tuple from the first relation is paired with every tuple from the second.

To perform meaningful combinations, join conditions must be specified. These conditions link related data across tables, typically by equating a primary key in one table with a foreign key in another. If attribute names are identical across tables, they must be disambiguated using the table name or a tuple variable (alias).

\dfn{Tuple Variable (Alias)}{A temporary name assigned to a table in the \texttt{FROM} clause, used to shorten queries, disambiguate column references, or allow a table to be joined with itself (self-join).}

SQL provides explicit join syntax as an alternative to the comma-separated list in the \texttt{FROM} clause:

\begin{itemize}
	\item \textbf{CROSS JOIN}: Produces the Cartesian product.
	\item \textbf{INNER JOIN}: Returns only the tuples that satisfy the join condition.
	\item \textbf{NATURAL JOIN}: Automatically joins tables based on all columns with matching names and removes the redundant duplicate column.
	\item \textbf{OUTER JOIN}: Preserves "dangling tuples" that do not have a match in the other relation, padding the missing values with \texttt{NULL}. These come in \texttt{LEFT}, \texttt{RIGHT}, and \texttt{FULL} varieties.
\end{itemize}

\nt{The \texttt{USING} clause is a safer alternative to \texttt{NATURAL JOIN} as it allows the user to explicitly specify which columns with shared names should be used for the join, preventing accidental matches on unrelated columns.}

\section{Subqueries and Nesting}

SQL is highly recursive, allowing queries to be nested within other queries. A subquery can appear in the \texttt{WHERE}, \texttt{FROM}, or \texttt{SELECT} clauses.

Subqueries that return a single row and a single column are called scalar subqueries and can be used anywhere a constant is expected. Subqueries that return a single column (a list of values) can be used with operators like \texttt{IN}, \texttt{ANY}, or \texttt{ALL}. The \texttt{EXISTS} operator is used to check if a subquery returns any results at all.

\thm{Correlated Subquery}{A subquery that references an attribute from the outer query. It conceptually requires the subquery to be evaluated once for every row processed by the outer query.}

\section{Duplicate Elimination and Set Operations}

Because SQL defaults to bag semantics, it often produces duplicate rows. The \texttt{DISTINCT} keyword can be added to the \texttt{SELECT} clause to force set semantics by removing these duplicates.

SQL also supports standard set operations: \texttt{UNION}, \texttt{INTERSECT}, and \texttt{EXCEPT} (or \texttt{MINUS}). By default, these operations eliminate duplicates. If bag semantics are desired, the \texttt{ALL} keyword must be appended (e.g., \texttt{UNION ALL}).

\nt{Duplicate elimination is a computationally expensive operation because it requires sorting or hashing the entire result set to identify matching tuples.}

\section{Aggregation and Grouping}

Aggregation allows users to summarize large volumes of data into single representative values. SQL provides five standard aggregate functions: \texttt{COUNT}, \texttt{SUM}, \texttt{AVG}, \texttt{MIN}, and \texttt{MAX}.

The \texttt{GROUP BY} clause partitions the data into groups based on the values of one or more columns. Aggregate functions are then applied to each group independently. A critical restriction exists when using grouping: any column appearing in the \texttt{SELECT} list that is not part of an aggregate function must be included in the \texttt{GROUP BY} clause.

\thm{The Aggregation Rule}{In a query using grouping, the output can only consist of the attributes used for grouping and the results of aggregate functions applied to the groups.}

For filtering data after it has been aggregated, SQL uses the \texttt{HAVING} clause. Unlike \texttt{WHERE}, which filters individual rows before they are grouped, \texttt{HAVING} filters the groups themselves based on aggregate properties.

\dfn{Aggregate Function}{A function that takes a collection of values as input and returns a single value as a summary, such as a total or an average.}

\section{Advanced Table Expressions and Common Table Expressions}

To improve query readability and maintainability, SQL provides mechanisms to define temporary relations within a single query. The \texttt{VALUES} clause can be used to construct a constant table on the fly. More importantly, the \texttt{WITH} clause allows for the definition of Common Table Expressions (CTEs).

\thm{Common Table Expression (CTE)}{A temporary named result set that exists only within the scope of a single query, providing a way to decompose complex queries into smaller, logical steps.}

CTEs can also be recursive, allowing SQL to perform operations that are impossible in standard relational algebra, such as computing the transitive closure of a graph (e.g., finding all reachable cities in a flight network).

\nt{The \texttt{WITH RECURSIVE} statement typically consists of a base case (non-recursive query) and a recursive step joined by a \texttt{UNION} operator.}

\section{Conclusion on Query Logic}

Querying with SQL represents a bridge between high-level human requirements and mathematical relational theory. By understanding the underlying relational algebra—selection, projection, products, and joins—users can write more efficient and accurate queries. The complexity of SQL arises from its need to handle real-world data nuances, such as missing information (\texttt{NULL}s) and the desire for summarized reports (aggregation). Mastering the order of operations—starting from the \texttt{FROM} clause, moving through \texttt{WHERE} and \texttt{GROUP BY}, and finally reaching \texttt{SELECT}, \texttt{HAVING}, and \texttt{ORDER BY}—is essential for any database engineer.

The relationship between SQL and its execution can be viewed as a translation process: the user speaks in "declarative" desires, while the database engine converts those desires into a "procedural" query plan, much like a chef translating a customer's order into a sequence of kitchen tasks.