information_systems_for_eng…/sections/relational_algebra.tex

\chapter{Relational Algebra}

Relational algebra serves as the formal foundation for the manipulation of data within the relational model. It is a procedural language consisting of a set of operators that take one or more relations as input and produce a new relation as output. This mathematical framework is essential for database systems because it provides a precise way to represent queries and allows the system's query optimizer to reorganize expressions into more efficient execution plans. Unlike general-purpose programming languages such as C or Java, relational algebra is intentionally limited in power. For example, it cannot perform arbitrary calculations like factorials or determine if the number of tuples in a relation is even or odd. However, this limitation is a strategic advantage, as it enables the database engine to perform high-level optimizations that would be impossible in a more complex language.

The algebra is characterized by the property of closure, where every result is a relation that can immediately serve as an input for another operation. Operators are generally categorized into unary operators, which act on a single relation, and binary operators, which combine two relations. The core operations include set-based maneuvers—such as union, intersection, and difference—and relational-specific maneuvers like selection, projection, joins, and renaming. In modern database implementations, these concepts are often extended to support bag semantics (allowing duplicates) and complex operations like grouping and sorting.

\dfn{Relational Algebra}{A collection of mathematical operators that function on relations to perform data retrieval and manipulation. It is closed under its operations, ensuring that the output of any expression is itself a relation.}

\thm{The Power of Optimization}{By limiting the expressive power of the query language to relational algebra, database systems can effectively optimize code. This allows the system to replace inefficient execution strategies with mathematically equivalent but significantly faster algorithms.}

\section{The Unary Operators: Selection and Projection}

Selection and projection are the primary filters used to reduce the size of a relation. Selection, denoted by the Greek letter sigma ($\sigma$), acts as a horizontal filter. It examines each tuple in a relation and retains only those that satisfy a specific logical condition. This condition can involve attribute comparisons to constants or other attributes using standard operators such as equality, inequality, and logical connectors like AND and OR.

Projection, denoted by the Greek letter pi ($\pi$), acts as a vertical filter. It is used to choose specific columns from a relation while discarding others. In its classical set-based form, projection also serves to eliminate any duplicate tuples that may arise when certain attributes are removed. This ensures that the result remains a valid set. Extended versions of projection allow for the creation of new attributes through calculations or renamings of existing fields.

\dfn{Selection}{An operation $\sigma_C(R)$ that produces a relation containing all tuples from $R$ that satisfy the condition $C$. It does not change the schema of the relation but reduces the number of rows.}

\dfn{Projection}{An operation $\pi_L(R)$ that creates a new relation consisting of only the attributes listed in $L$. It transforms the schema of the relation and may reduce the number of rows if duplicates are removed.}

\section{Set Operations and Compatibility Constraints}

Relational algebra incorporates standard set operations: union ($\cup$), intersection ($\cap$), and difference ($-$). Because these operations are inherited from set theory, they require the participating relations to be "compatible." This means the relations must share the same schema—specifically, they must have the same set of attributes, and the domains (data types) associated with corresponding attributes must be identical.

Union combines all tuples from two relations into a single result. Intersection identifies tuples that appear in both input relations. Difference, which is not commutative, returns tuples found in the first relation but not the second. While these were originally defined for sets, modern systems often apply them to "bags" (multisets), where the rules for handling duplicates differ. For instance, in bag union, the number of occurrences of a tuple is the sum of its occurrences in the inputs, whereas in set union, it appears only once.

\dfn{Schema Compatibility}{The requirement that two relations involved in a set operation must have the same number of attributes, with matching names and identical data types for each corresponding column.}

\thm{Commutativity and Associativity}{Set union and intersection are both commutative ($R \cup S = S \cup R$) and associative $((R \cup S) \cup T = R \cup (S \cup T))$, allowing the system to reorder these operations for better performance.}

\section{Renaming and Relational Variables}

In complex queries, it is often necessary to change the name of an attribute or the relation itself to avoid ambiguity or to prepare a relation for a set operation. The renaming operator, denoted by the Greek letter rho ($\rho$), allows for this modification. This is particularly useful when joining a relation with itself, as it provides a way to distinguish between the two copies.

The concept of a "relvar" (relational variable) is also important here. A relvar is essentially a variable that has a name and is assigned a specific relation. In algebraic expressions, we use these names to refer to the data stored within the tables.

\dfn{Renaming}{An operator $\rho_S(R)$ that returns a new relation identical to $R$ in content but renamed to $S$. It can also be used as $\rho_{S(A_1, ..., A_n)}(R)$ to rename individual attributes.}

\section{Combining Relations: Products and Joins}

The most complex operations in relational algebra involve combining information from different relations. The Cartesian Product ($\times$) is the most basic of these, pairing every tuple of the first relation with every tuple of the second. While mathematically simple, the product often produces very large relations that contain many irrelevant pairings.

Joins are more refined versions of the product. The Theta-Join ($\bowtie_C$) performs a product followed by a selection based on a specific condition $C$. The most common join is the Natural Join ($\bowtie$), which automatically pairs tuples that have equal values in all attributes shared by the two relations. After the pairing, it removes the redundant columns, leaving a cleaner result.

\dfn{Cartesian Product}{A binary operator $R \times S$ that produces a relation whose schema is the union of the schemas of $R$ and $S$, and whose tuples are all possible concatenations of a tuple from $R$ and a tuple from $S$.}

\dfn{Natural Join}{A specific type of join $R \bowtie S$ that connects tuples based on equality across all common attributes and then projects out the duplicate columns.}

\thm{The Join-Product Relationship}{A theta-join $R \bowtie_C S$ is mathematically equivalent to the expression $\sigma_C(R \times S)$. This relationship allows query optimizers to choose between different physical execution strategies for the same logical request.}

\section{Linear Notation and Expression Trees}

Because relational algebra is a functional language, complex queries are built by nesting operations. These can be represented in two main ways. Linear notation involves a sequence of assignments to temporary variables, making the steps of a query easier to read. Alternatively, expression trees provide a graphical representation where leaves are stored relations and interior nodes are algebraic operators.

The query processor uses these trees to visualize the flow of data. By applying algebraic laws, the processor can "push" selections and projections down the tree, closer to the data sources. This reduces the size of intermediate relations as early as possible, which is a hallmark of efficient query execution.

\thm{Selection Pushing}{In a query tree, moving a selection $\sigma$ below other operators like joins or unions is almost always beneficial, as it reduces the number of tuples that subsequent, more expensive operators must process.}

\section{Extended Relational Algebra}

To meet the requirements of SQL, the basic algebra is often extended with additional operators. These include duplicate elimination ($\delta$), which explicitly turns a bag into a set; sorting ($\tau$), which orders the tuples of a relation; and grouping and aggregation ($\gamma$), which partitions tuples into groups and calculates summaries like sums or averages.

While these operators go beyond the original mathematical definition of the algebra, they are essential for practical database management. They allow the algebra to serve as a complete intermediate language for translating SQL queries into physical instructions for the machine.

\dfn{Duplicate Elimination}{The operator $\delta(R)$ that takes a bag $R$ as input and returns a set containing exactly one copy of every distinct tuple found in the input.}

\dfn{Aggregation}{The application of functions such as SUM, AVG, MIN, MAX, or COUNT to a column of a relation to produce a single summary value.}