Notes V.2.0.0
Rewrote Notes
This commit is contained in:
@@ -1,183 +1,108 @@
|
||||
\chapter{Queries with SQL}
|
||||
|
||||
\section{Overview of SQL Querying}
|
||||
Structured Query Language (SQL) serves as the primary interface for interacting with relational databases. While the Data Definition Language (DDL) handles the creation and modification of database structures, the Data Manipulation Language (DML) is used for the retrieval and modification of records. The querying aspect of SQL is essentially a high-level, declarative implementation of relational algebra. Because SQL is declarative, users specify the desired properties of the result set rather than the procedural steps required to compute it. This allows the database management system (DBMS) to utilize a query optimizer to determine the most efficient execution strategy, known as a query plan.
|
||||
|
||||
The Structured Query Language (SQL) serves as the primary interface for interacting with relational database management systems. At its core, SQL is a declarative language, meaning that a programmer specifies what data should be retrieved rather than providing the step-by-step procedure for how to find it. This approach allows the database's internal query optimizer to determine the most efficient path to the data, leveraging indexes and specific storage structures. SQL is fundamentally set-based, manipulating entire relations as units rather than processing individual records in a procedural loop. This aligns closely with the mathematical foundations of relational algebra, where operations like selection, projection, and join act on sets or bags of tuples.
|
||||
Modern SQL implementations typically follow a set-based or bag-based processing model. Under bag semantics, relations are treated as multisets where duplicate records are permitted, contrasting with the strict set theory used in pure relational algebra. SQL queries are processed by a query compiler that translates the high-level syntax into a tree of algebraic operators, such as selection, projection, join, and grouping.
|
||||
|
||||
The language is typically divided into two main components: the Data Definition Language (DDL) and the Data Manipulation Language (DML). DDL is concerned with the metadata and the structural level of the database, involving the creation, modification, and deletion of tables and columns. DML, which is the focus of this summary, operates at the record level. It allows for the insertion of new data, the updating of existing records, the deletion of information, and most importantly, the querying of the stored data.
|
||||
\dfn{Declarative Language}{A programming paradigm in which the programmer defines what the result should look like (the logic of the computation) without describing its control flow (the procedural steps).}
|
||||
|
||||
\dfn{SQL}{Structured Query Language, a declarative and set-based language used to define and manipulate data within a relational database management system.}
|
||||
\thm{Query Plan}{A structured sequence of internal operations, often represented as a tree of relational algebra operators, that the DBMS execution engine follows to produce the results of a query.}
|
||||
|
||||
\thm{Declarative Programming}{The principle where the programmer describes the desired result of a computation without specifying the control flow or algorithmic steps to achieve it.}
|
||||
\section{Basic Query Structure: SELECT-FROM-WHERE}
|
||||
|
||||
\section{Fundamental Selection and Projection}
|
||||
The fundamental building block of a SQL query is the select-from-where expression. This structure corresponds to the three most common operations in relational algebra: projection, relation selection, and tuple selection.
|
||||
|
||||
The most basic form of a SQL query follows the select-from-where structure. This construct allows a user to extract specific attributes from one or more tables based on certain criteria. To understand this in the context of relational algebra, we can view the three main clauses as distinct algebraic operators. The FROM clause identifies the source relations; if multiple relations are listed, it effectively represents a Cartesian product. The WHERE clause functions as a selection operator ($\sigma$), filtering the tuples produced by the product based on a logical predicate. Finally, the SELECT clause acts as a projection operator ($\pi$), narrowing the result set to only the desired columns.
|
||||
The \texttt{FROM} clause identifies the relations (tables) from which data is to be retrieved. This is conceptually the first step of the query, as it defines the scope of the data. The \texttt{WHERE} clause specifies a predicate used to filter the tuples. Only records that satisfy this logical condition are passed to the next stage. Finally, the \texttt{SELECT} clause identifies which attributes (columns) should be returned in the output. This is equivalent to the projection operator ($\pi$) in relational algebra. If a user wishes to retrieve all columns, a wildcard asterisk (*) is used.
|
||||
|
||||
In its simplest manifestation, a query can use the asterisk symbol (*) to denote that all columns from the source table should be included in the output. While this is useful for exploration, explicit projection is preferred in production environments to minimize data transfer and clarify the schema of the result set. Within the SELECT clause, we are not limited to just listing attributes; we can also include arithmetic expressions, such as calculating a value based on existing columns, or constants to provide context in the output.
|
||||
\thm{SELECT-FROM-WHERE Mapping}{A basic SQL query of the form \texttt{SELECT L FROM R WHERE C} is equivalent to the relational algebra expression $\pi_{L}(\sigma_{C}(R))$.}
|
||||
|
||||
\dfn{Query}{A formal request for information from a database, typically expressed in SQL to retrieve specific data matching a set of conditions.}
|
||||
\nt{In SQL, the select-list can include not only existing attributes but also constants and computed expressions, functioning like an extended projection.}
|
||||
|
||||
\thm{SQL to Relational Algebra Mapping}{A simple select-from-where query is equivalent to the relational algebra expression $\pi_L(\sigma_C(R_1 \times R_2 \times ... \times R_n))$, where L is the select list, C is the where condition, and $R_i$ are the relations in the from list.}
|
||||
\section{Logic, Comparisons, and Three-Valued Logic}
|
||||
|
||||
\section{Renaming and Aliasing}
|
||||
Filters in the \texttt{WHERE} clause are constructed using comparison operators such as equality (=), inequality (<> or !=), and range comparisons (<, >, <=, >=). SQL also supports pattern matching for strings through the \texttt{LIKE} operator, where the percent sign (%) matches any sequence of characters and the underscore (_) matches any single character.
|
||||
|
||||
When a query is executed, the resulting relation has column headers that defaults to the names of the attributes in the source tables. However, there are many cases where these names may be ambiguous or uninformative, particularly when calculations are involved. SQL provides the `AS` keyword to assign an alias to a column or an expression. This allows the programmer to rename the output for better readability or to comply with the requirements of an application.
|
||||
A critical aspect of SQL logic is the treatment of \texttt{NULL} values. Because a \texttt{NULL} represents an unknown or missing value, comparing anything to \texttt{NULL} results in a truth value of \texttt{UNKNOWN}. This necessitates a three-valued logic system.
|
||||
|
||||
Aliases can also be applied to relations in the FROM clause. These are referred to as tuple variables or correlation names. Aliasing a relation is essential when a query must compare different rows within the same table, a process known as a self-join. By assigning different aliases to the same table, the query can treat them as two distinct sources of data, enabling comparisons between tuples.
|
||||
\dfn{Three-Valued Logic}{A logical framework where expressions can evaluate to TRUE, FALSE, or UNKNOWN, specifically required to handle comparisons involving NULL values.}
|
||||
|
||||
\dfn{Alias}{A temporary name assigned to a table or a column within the scope of a single SQL query to improve clarity or disambiguate references.}
|
||||
The behavior of logical operators under three-valued logic follows specific rules:
|
||||
|
||||
\section{String Patterns and Comparison Operators}
|
||||
\begin{itemize}
|
||||
\item \textbf{AND}: The result is TRUE only if both operands are TRUE. If one is FALSE, the result is FALSE regardless of the other. If one is TRUE and the other is UNKNOWN, the result is UNKNOWN.
|
||||
\item \textbf{OR}: The result is TRUE if at least one operand is TRUE. If one is TRUE, the result is TRUE regardless of the other. If one is FALSE and the other is UNKNOWN, the result is UNKNOWN.
|
||||
\item \textbf{NOT}: The negation of UNKNOWN remains UNKNOWN.
|
||||
\end{itemize}
|
||||
|
||||
SQL provides a robust set of comparison operators to filter data within the WHERE clause. These include equality (=), inequality (<>), and various ordering comparisons (<, >, <=, >=). While numeric comparisons are straightforward, string comparisons follow lexicographical order.
|
||||
\nt{The \texttt{WHERE} clause only retains tuples for which the predicate evaluates to TRUE. Records that evaluate to FALSE or UNKNOWN are filtered out.}
|
||||
|
||||
For more flexible string matching, the `LIKE` operator is used with patterns. This operator allows for partial matches using two special wildcard characters: the percent sign (%) and the underscore (_). The percent sign matches any sequence of zero or more characters, while the underscore matches exactly one character. This is particularly useful for finding records where only a portion of a string is known or for identifying specific substrings.
|
||||
\section{Ordering and Limiting Results}
|
||||
|
||||
\dfn{Predicate}{A logical expression that evaluates to true, false, or unknown, used in the WHERE clause to determine which tuples satisfy the query criteria.}
|
||||
While relational algebra results are conceptually unordered sets or bags, SQL allows users to impose a specific order on the output using the \texttt{ORDER BY} clause. Sorting can be performed in ascending (\texttt{ASC}, the default) or descending (\texttt{DESC}) order. Multiple columns can be specified to handle ties.
|
||||
|
||||
\thm{Lexicographical Comparison}{The method of ordering strings based on the alphabetical order of their component characters, where a string is "less than" another if it appears earlier in a dictionary.}
|
||||
Furthermore, SQL provides mechanisms to limit the size of the result set, which is particularly useful for performance and pagination. The \texttt{LIMIT} clause restricts the total number of rows returned, while the \texttt{OFFSET} clause skips a specified number of rows before beginning to return results.
|
||||
|
||||
\section{Handling Incomplete Information with Null Values}
|
||||
\thm{List Semantics}{When an \texttt{ORDER BY} clause is applied, the result set is treated as a list rather than a bag, meaning the sequence of records is guaranteed and meaningful for the application.}
|
||||
|
||||
In real-world databases, it is common for certain pieces of information to be missing or inapplicable. SQL represents this missing data with a special marker called `NULL`. It is important to recognize that `NULL` is not a value in the same way 0 or an empty string is; it is a placeholder indicating the absence of a value.
|
||||
\section{Multi-Relation Queries and Joins}
|
||||
|
||||
Because `NULL` represents unknown data, comparisons involving `NULL` cannot result in a standard true or false. Instead, SQL employs a three-valued logic system that includes `UNKNOWN`. For example, if we compare a column containing a `NULL` to a constant, the result is `UNKNOWN`. To explicitly check for these placeholders, SQL provides the `IS NULL` and `IS NOT NULL` operators. Standard equality comparisons like `= NULL` will always evaluate to `UNKNOWN` and therefore fail to filter the desired records.
|
||||
SQL allows queries to involve multiple relations by listing them in the \texttt{FROM} clause. When multiple tables are listed without a joining condition, the result is a Cartesian product, where every tuple from the first relation is paired with every tuple from the second.
|
||||
|
||||
\dfn{NULL}{A special marker in SQL used to indicate that a data value does not exist in the database, either because it is unknown or not applicable.}
|
||||
To perform meaningful combinations, join conditions must be specified. These conditions link related data across tables, typically by equating a primary key in one table with a foreign key in another. If attribute names are identical across tables, they must be disambiguated using the table name or a tuple variable (alias).
|
||||
|
||||
\thm{Three-Valued Logic}{A system of logic where expressions can evaluate to TRUE, FALSE, or UNKNOWN, requiring specialized truth tables for AND, OR, and NOT operations.}
|
||||
\dfn{Tuple Variable (Alias)}{A temporary name assigned to a table in the \texttt{FROM} clause, used to shorten queries, disambiguate column references, or allow a table to be joined with itself (self-join).}
|
||||
|
||||
\section{Logic and Truth Tables in SQL}
|
||||
SQL provides explicit join syntax as an alternative to the comma-separated list in the \texttt{FROM} clause:
|
||||
|
||||
The presence of `UNKNOWN` values necessitates a clear understanding of how logical operators behave. When combining conditions with `AND`, the result is the minimum of the truth values, where TRUE is 1, UNKNOWN is 0.5, and FALSE is 0. Conversely, `OR` takes the maximum of the truth values. The `NOT` operator subtracts the truth value from 1.
|
||||
\begin{itemize}
|
||||
\item \textbf{CROSS JOIN}: Produces the Cartesian product.
|
||||
\item \textbf{INNER JOIN}: Returns only the tuples that satisfy the join condition.
|
||||
\item \textbf{NATURAL JOIN}: Automatically joins tables based on all columns with matching names and removes the redundant duplicate column.
|
||||
\item \textbf{OUTER JOIN}: Preserves "dangling tuples" that do not have a match in the other relation, padding the missing values with \texttt{NULL}. These come in \texttt{LEFT}, \texttt{RIGHT}, and \texttt{FULL} varieties.
|
||||
\end{itemize}
|
||||
|
||||
In the context of a WHERE clause, a tuple is only included in the final result set if the entire condition evaluates to `TRUE`. Tuples for which the condition is `FALSE` or `UNKNOWN` are excluded. This behavior can lead to unintuitive results, such as a query for "all records where X is 10 OR X is not 10" failing to return records where X is `NULL`, because the result of that OR operation would be `UNKNOWN`.
|
||||
\nt{The \texttt{USING} clause is a safer alternative to \texttt{NATURAL JOIN} as it allows the user to explicitly specify which columns with shared names should be used for the join, preventing accidental matches on unrelated columns.}
|
||||
|
||||
\dfn{Truth Table}{A mathematical table used to determine the result of logical operations given all possible combinations of input truth values.}
|
||||
\section{Subqueries and Nesting}
|
||||
|
||||
\section{Multi-Relation Queries and the Cartesian Product}
|
||||
SQL is highly recursive, allowing queries to be nested within other queries. A subquery can appear in the \texttt{WHERE}, \texttt{FROM}, or \texttt{SELECT} clauses.
|
||||
|
||||
When a query involves data spread across multiple tables, the FROM clause lists all the relevant relations. The logical starting point for such a query is the Cartesian product, which pairs every tuple from the first relation with every tuple from the second, and so on. This produces a very large intermediate relation where each row represents a potential combination of the source data.
|
||||
Subqueries that return a single row and a single column are called scalar subqueries and can be used anywhere a constant is expected. Subqueries that return a single column (a list of values) can be used with operators like \texttt{IN}, \texttt{ANY}, or \texttt{ALL}. The \texttt{EXISTS} operator is used to check if a subquery returns any results at all.
|
||||
|
||||
To make this product useful, the WHERE clause must contain join conditions that link the relations based on common attributes. For instance, if we are joining a 'Movies' table with a 'Producers' table, we might equate the 'producerID' column in both. This filtering process discards the vast majority of the Cartesian product, leaving only the rows where the related data actually matches. When attributes in different tables share the same name, we use the dot notation (e.g., TableName.AttributeName) to disambiguate the references.
|
||||
\thm{Correlated Subquery}{A subquery that references an attribute from the outer query. It conceptually requires the subquery to be evaluated once for every row processed by the outer query.}
|
||||
|
||||
\thm{The Join-Selection Equivalence}{The principle that a natural join or an equijoin can be logically expressed as a selection performed on a Cartesian product of relations.}
|
||||
\section{Duplicate Elimination and Set Operations}
|
||||
|
||||
\section{Interpretation of Multi-Relation Queries}
|
||||
Because SQL defaults to bag semantics, it often produces duplicate rows. The \texttt{DISTINCT} keyword can be added to the \texttt{SELECT} clause to force set semantics by removing these duplicates.
|
||||
|
||||
There are multiple ways to interpret the execution of a query involving several relations. One helpful mental model is the "nested loops" approach. In this model, we imagine a loop for each relation in the FROM clause. The outermost loop iterates through every tuple of the first relation, and for each of those, the next loop iterates through the second relation, and so on. Inside the innermost loop, the WHERE condition is tested against the current combination of tuples. If the condition is met, the SELECT clause produces an output row.
|
||||
SQL also supports standard set operations: \texttt{UNION}, \texttt{INTERSECT}, and \texttt{EXCEPT} (or \texttt{MINUS}). By default, these operations eliminate duplicates. If bag semantics are desired, the \texttt{ALL} keyword must be appended (e.g., \texttt{UNION ALL}).
|
||||
|
||||
Another interpretation is based on parallel assignment. In this view, we consider all possible assignments of tuples to the variables representing the relations. We then filter for those assignments that satisfy the condition. While the nested loop model is more algorithmic, the parallel assignment model highlights the declarative nature of the query, emphasizing that the order of the relations in the FROM clause should not, in theory, affect the result.
|
||||
\nt{Duplicate elimination is a computationally expensive operation because it requires sorting or hashing the entire result set to identify matching tuples.}
|
||||
|
||||
\dfn{Tuple Variable}{A variable that ranges over the tuples of a relation, often implicitly created for each table in the FROM clause or explicitly defined as an alias.}
|
||||
\section{Aggregation and Grouping}
|
||||
|
||||
\section{Set Operators and Bag Semantics}
|
||||
Aggregation allows users to summarize large volumes of data into single representative values. SQL provides five standard aggregate functions: \texttt{COUNT}, \texttt{SUM}, \texttt{AVG}, \texttt{MIN}, and \texttt{MAX}.
|
||||
|
||||
SQL provides operators for the traditional set-theoretic actions: `UNION`, `INTERSECT`, and `EXCEPT`. These allow the results of two queries to be combined, provided they have the same schema (compatible attribute types and order). By default, these operators follow set semantics, meaning that they automatically eliminate duplicate tuples from the result.
|
||||
The \texttt{GROUP BY} clause partitions the data into groups based on the values of one or more columns. Aggregate functions are then applied to each group independently. A critical restriction exists when using grouping: any column appearing in the \texttt{SELECT} list that is not part of an aggregate function must be included in the \texttt{GROUP BY} clause.
|
||||
|
||||
However, since SQL is fundamentally based on bags (multisets), it also provides versions of these operators that preserve duplicates using the `ALL` keyword. `UNION ALL` simply concatenates the two result sets. `INTERSECT ALL` produces a tuple as many times as it appears in both inputs (taking the minimum count). `EXCEPT ALL` produces a tuple as many times as it appears in the first input minus the number of times it appears in the second (taking the difference). Using bag semantics is often more efficient because the system does not need to perform the expensive work of sorting or hashing the data to find and remove duplicates.
|
||||
\thm{The Aggregation Rule}{In a query using grouping, the output can only consist of the attributes used for grouping and the results of aggregate functions applied to the groups.}
|
||||
|
||||
\dfn{Bag}{A collection of elements that allows for multiple occurrences of the same element, where the order of elements remains immaterial.}
|
||||
For filtering data after it has been aggregated, SQL uses the \texttt{HAVING} clause. Unlike \texttt{WHERE}, which filters individual rows before they are grouped, \texttt{HAVING} filters the groups themselves based on aggregate properties.
|
||||
|
||||
\thm{Closure of Bag Operations}{The property that the result of any operation on bags is also a bag, ensuring that the relational model remains consistent through complex sequences of operations.}
|
||||
\dfn{Aggregate Function}{A function that takes a collection of values as input and returns a single value as a summary, such as a total or an average.}
|
||||
|
||||
\section{Nested Queries and Scalar Subqueries}
|
||||
\section{Advanced Table Expressions and Common Table Expressions}
|
||||
|
||||
A subquery is a query nested within another query. Subqueries can appear in various parts of a SQL statement, including the WHERE, FROM, and HAVING clauses. A scalar subquery is one that returns exactly one row and one column—a single value. Because it evaluates to a scalar, it can be used anywhere a constant or an attribute would be valid, such as in a comparison.
|
||||
To improve query readability and maintainability, SQL provides mechanisms to define temporary relations within a single query. The \texttt{VALUES} clause can be used to construct a constant table on the fly. More importantly, the \texttt{WITH} clause allows for the definition of Common Table Expressions (CTEs).
|
||||
|
||||
If a scalar subquery is designed to find, for instance, the specific ID of a person, and the data actually contains two people with that name, the query will fail at runtime. The system expects a single value and cannot resolve the ambiguity. If the subquery returns no rows, it is treated as a `NULL`.
|
||||
\thm{Common Table Expression (CTE)}{A temporary named result set that exists only within the scope of a single query, providing a way to decompose complex queries into smaller, logical steps.}
|
||||
|
||||
\dfn{Scalar}{A single atomic value, such as an integer or a string, as opposed to a collection of values like a row or a table.}
|
||||
CTEs can also be recursive, allowing SQL to perform operations that are impossible in standard relational algebra, such as computing the transitive closure of a graph (e.g., finding all reachable cities in a flight network).
|
||||
|
||||
\section{Conditions on Relations: IN and EXISTS}
|
||||
\nt{The \texttt{WITH RECURSIVE} statement typically consists of a base case (non-recursive query) and a recursive step joined by a \texttt{UNION} operator.}
|
||||
|
||||
When a subquery returns a set of values rather than a single scalar, it can be used with relational operators like `IN`. The expression `x IN (subquery)` evaluates to true if the value of x is found in the result set produced by the subquery. This is a powerful way to filter data based on membership in a dynamically calculated set.
|
||||
\section{Conclusion on Query Logic}
|
||||
|
||||
The `EXISTS` operator is another tool for dealing with subqueries. It takes a subquery as an argument and returns true if the subquery returns at least one row. Unlike `IN`, `EXISTS` does not look at the actual values returned; it only checks for the existence of results. This is often used in correlated subqueries to check for the presence of related records in another table.
|
||||
Querying with SQL represents a bridge between high-level human requirements and mathematical relational theory. By understanding the underlying relational algebra—selection, projection, products, and joins—users can write more efficient and accurate queries. The complexity of SQL arises from its need to handle real-world data nuances, such as missing information (\texttt{NULL}s) and the desire for summarized reports (aggregation). Mastering the order of operations—starting from the \texttt{FROM} clause, moving through \texttt{WHERE} and \texttt{GROUP BY}, and finally reaching \texttt{SELECT}, \texttt{HAVING}, and \texttt{ORDER BY}—is essential for any database engineer.
|
||||
|
||||
\thm{Existence Quantification}{The logical principle of checking whether there is at least one element in a set that satisfies a given property, implemented in SQL via the EXISTS operator.}
|
||||
|
||||
\section{Correlated Subqueries and Scoping}
|
||||
|
||||
A correlated subquery is a nested query that refers to attributes of the outer query. Because of this dependency, the subquery must, in concept, be re-evaluated for every row processed by the outer query. This creates a link between the two levels of the query, allowing for complex logic like "find all employees whose salary is higher than the average salary in their specific department."
|
||||
|
||||
Scoping rules in SQL dictate how attribute names are resolved. An attribute in a subquery will first be looked for in the tables mentioned in that subquery's own FROM clause. If it is not found there, the system looks at the FROM clause of the next level out, and so on. If the same attribute name appears in multiple levels, we must use aliases to ensure the correct column is referenced. Correlated subqueries are often more expressive than simple joins but can be more computationally expensive if the optimizer cannot unnest them into a join.
|
||||
|
||||
\dfn{Correlated Subquery}{A subquery that depends on the current row being processed by the outer query, identified by references to attributes defined in the outer scope.}
|
||||
|
||||
\section{Join Expressions and Syntax Variants}
|
||||
|
||||
While the select-from-where structure can express most joins, SQL also provides explicit join syntax. A `CROSS JOIN` is a direct representation of the Cartesian product. A `JOIN ... ON` allows the join condition to be specified explicitly in the FROM clause, which many developers find clearer than placing the condition in the WHERE clause.
|
||||
|
||||
A `NATURAL JOIN` is a specialized form of join that automatically equates all columns with the same name in both tables and removes the redundant copies of those columns. While natural joins are concise, they can be risky because they depend on attribute names. If a schema change adds a column to one table that happens to share a name with a column in another, the natural join logic will change automatically and potentially break the query. The `USING` clause provides a middle ground, allowing the user to specify exactly which common columns should be used for the join.
|
||||
|
||||
\dfn{Natural Join}{A join operation that matches tuples based on all attributes that have the same name in both relations, producing a result that contains only one copy of each common attribute.}
|
||||
|
||||
\section{Outer Joins and Data Preservation}
|
||||
|
||||
In a standard (inner) join, tuples that do not have a match in the other table are discarded. These are called "dangling tuples." If we wish to preserve these tuples in our result set, we use an `OUTER JOIN`. There are three types: `LEFT`, `RIGHT`, and `FULL`. A `LEFT OUTER JOIN` includes all tuples from the left table; if a tuple has no match in the right table, the columns from the right table are filled with `NULL`.
|
||||
|
||||
The `RIGHT OUTER JOIN` is symmetric, preserving all rows from the right table. A `FULL OUTER JOIN` preserves all rows from both tables, ensuring that no information from either source is lost. Outer joins are essential when we need a comprehensive list of items, even if some of those items lack certain related data.
|
||||
|
||||
\dfn{Dangling Tuple}{A tuple in one relation that does not match any tuple in another relation based on the join criteria.}
|
||||
|
||||
\thm{The Outer Join Property}{The guarantee that all tuples of the specified operand relations will be represented in the result, with NULL values used to fill in missing components for non-matching rows.}
|
||||
|
||||
\section{Aggregation and Data Summarization}
|
||||
|
||||
SQL includes several built-in functions to perform calculations across entire columns of data. These are known as aggregation operators. The five standard operators are `SUM`, `AVG`, `MIN`, `MAX`, and `COUNT`. `SUM` and `AVG` can only be applied to numeric data, while `MIN` and `MAX` can also be applied to strings (using lexicographical order) or dates.
|
||||
|
||||
The `COUNT` operator is versatile; `COUNT(*)` counts every row in a table, while `COUNT(attribute)` counts only the non-null values in that specific column. If we wish to count only the unique values, we can use the `DISTINCT` keyword inside the aggregation, such as `COUNT(DISTINCT studioName)`. It is vital to remember that all aggregations except for `COUNT` return `NULL` if they are applied to an empty set of values. `COUNT` returns 0 for an empty set.
|
||||
|
||||
\dfn{Aggregation}{The process of summarizing multiple values into a single value through functions like summation or averaging.}
|
||||
|
||||
\section{Grouping and Partitioning}
|
||||
|
||||
The `GROUP BY` clause allows us to partition the rows of a relation into groups based on their values in one or more attributes. When a query contains a `GROUP BY` clause, the SELECT clause is limited in what it can contain. Every attribute listed in the SELECT clause must either be an attribute used for grouping or be part of an aggregate function.
|
||||
|
||||
Conceptually, the system first creates the groups and then applies the aggregate functions to each group independently. The result is a single row for each unique combination of values in the grouping attributes. This is the primary way to generate reports and statistics, such as "the total number of movies produced by each studio per year."
|
||||
|
||||
\dfn{Grouping Attribute}{An attribute used in the GROUP BY clause to define the partitions upon which aggregation functions will operate.}
|
||||
|
||||
\section{Post-Aggregation Filtering with HAVING}
|
||||
|
||||
Sometimes we want to filter the results of a query based on an aggregate value. However, the WHERE clause is evaluated before any grouping or aggregation takes place. Therefore, we cannot use a condition like `WHERE SUM(length) > 500`. To solve this, SQL provides the `HAVING` clause.
|
||||
|
||||
The `HAVING` clause is evaluated after the groups have been formed and the aggregations have been calculated. It allows the programmer to specify conditions that apply to the group as a whole. Only the groups that satisfy the `HAVING` condition will appear in the final output. While `HAVING` can technically contain any condition, it is best practice to only use it for conditions involving aggregates, leaving all tuple-level filtering to the WHERE clause.
|
||||
|
||||
\dfn{HAVING}{A clause in SQL used to specify conditions that filter groups of rows created by the GROUP BY clause, typically involving aggregate functions.}
|
||||
|
||||
\thm{Query Execution Order}{The logical sequence of operations in a SQL query: FROM (and JOINs), then WHERE, then GROUP BY, then HAVING, and finally SELECT (and DISTINCT) and ORDER BY.}
|
||||
|
||||
\section{Ordering and Sorting the Result}
|
||||
|
||||
The final step in many queries is to present the data in a specific order for the user. The `ORDER BY` clause facilitates this, allowing for sorting by one or more columns in either ascending (`ASC`) or descending (`DESC`) order. Sorting is the last operation performed before the data is returned; even if a column is not projected in the SELECT clause, it can still be used for sorting if it was available in the source tables.
|
||||
|
||||
If multiple columns are listed in the `ORDER BY` clause, the system sorts by the first column first. If there are ties, it uses the second column to break them, and so on. This ensures a deterministic and readable presentation of the retrieved information.
|
||||
|
||||
\dfn{Sorting}{The process of arranging the rows of a result set in a specific sequence based on the values of one or more attributes.}
|
||||
|
||||
\section{Extended Projection and Constants}
|
||||
|
||||
The extended projection operator allows for more than just choosing columns. It enables the use of expressions that combine attributes or apply functions to them. In SQL, this is manifested in the SELECT list, where we can perform additions, concatenations, or even call stored functions.
|
||||
|
||||
Constants are also frequently used in the SELECT list. For example, a query might select "Movie", title, year from a table. Every resulting row would have the string literal "Movie" as its first column. This is often used to label different parts of a union or to provide fixed formatting for an external application.
|
||||
|
||||
\thm{Functional Dependency in Aggregation}{The rule that in a grouped query, any attribute in the SELECT list that is not aggregated must be functionally determined by the grouping attributes to ensure the result is well-defined.}
|
||||
|
||||
\section{Nested Queries in the FROM Clause}
|
||||
|
||||
SQL allows a subquery to be placed in the FROM clause. In this case, the subquery acts as a temporary table that exists only for the duration of the outer query. This is particularly useful when we need to perform multiple levels of aggregation or when we want to join a table with a summarized version of itself.
|
||||
|
||||
When a subquery is used in the FROM clause, it must be assigned an alias. This alias allows the outer query to refer to the columns produced by the subquery. This technique is often a cleaner alternative to using complex correlated subqueries in the WHERE clause, as it makes the flow of data more explicit.
|
||||
|
||||
\dfn{Derived Table}{A temporary result set returned by a subquery in the FROM clause, which is then used by the outer query as if it were a physical table.}
|
||||
|
||||
\section{Summary of Advanced SQL Syntax}
|
||||
|
||||
Throughout our exploration of Chapters 6 and the accompanying presentation, we have seen that SQL is far more than a simple tool for data retrieval. Its ability to nest logic, perform complex aggregations across partitioned data, and handle various join types allows it to solve sophisticated data analysis problems. The transition from the mathematical abstractions of relational algebra to the practical syntax of SQL reveals how each keyword serves a specific logical function in the data-processing pipeline.
|
||||
|
||||
By understanding the declarative nature of the language and the underlying bag semantics, developers can write queries that are not only correct but also efficient. The careful management of NULLs, the strategic use of subqueries, and the mastery of grouping and having clauses form the foundation of expert database programming. This comprehensive summary has detailed the syntax and the theoretical justifications for the most critical features of SQL querying, providing a roadmap for complex data manipulation.
|
||||
|
||||
\thm{The Universal Query Form}{The select-from-where block is the universal building block of SQL, capable of expressing any operation that can be represented by the core operators of relational algebra.}
|
||||
The relationship between SQL and its execution can be viewed as a translation process: the user speaks in "declarative" desires, while the database engine converts those desires into a "procedural" query plan, much like a chef translating a customer's order into a sequence of kitchen tasks.
|
||||
|
||||
Reference in New Issue
Block a user