information_systems_for_eng…/sections/relational_algebra.tex

\chapter{Relational Algebra}

Relational algebra serves as the formal mathematical foundation for manipulating data within the relational model. While Data Definition Language (DDL) is concerned with the static structure of the database, relational algebra provides the dynamic framework for the Data Manipulation Language (DML). It is a notation consisting of a set of operators that take one or more relations as input and produce a new relation as output. This operational approach allows for the expression of complex queries by nesting and combining simpler operations.

The power of relational algebra lies in its ability to abstract away from the physical storage of data, focusing instead on the logical transformation of information. It is essentially to relational tables what basic arithmetic is to numbers or matrix algebra is to vectors. By defining precise rules for how tables are filtered, combined, and restructured, it ensures that query results are predictable and mathematically sound.

\section{The Concept of Relational Variables and Closure}

A fundamental aspect of relational algebra is the way it identifies and treats data structures. In most mathematical contexts, an object does not inherently know the variable name assigned to it. However, in database theory, we often use the term "relvar" (relational variable) to describe a relation that is explicitly associated with a name. This allows the system to refer to stored data and intermediate results throughout the execution of a query.

Another critical property of relational algebra is closure. This principle dictates that because the input and output of every algebraic operator is a relation, operators can be nested indefinitely. This is identical to how the addition of two integers always results in another integer, allowing for the construction of complex arithmetic expressions.

\thm{The Property of Closure}{In relational algebra, the result of any operation is always another relation. This ensures that the output of one operator can serve as the valid input for any subsequent operator in a query tree.}

\dfn{Relational Variable (Relvar)}{A relvar is a named variable that is assigned a specific relation as its value, effectively allowing the database to track and manipulate data through an explicit identifier.}

\section{Unary Operators: Selection, Projection, and Renaming}

Unary operators are those that act upon a single relation. The three primary unary operators are selection, projection, and renaming. These tools allow a user to isolate specific rows, columns, or change the labels of the data structure.

Selection, denoted by the Greek letter sigma ($\sigma$), acts as a horizontal filter. It extracts only those records (tuples) that satisfy a specific condition, known as a predicate. This predicate can involve logical comparisons, arithmetic, and boolean operators. Importantly, selection does not change the schema of the table; the output has the exact same attributes and domains as the input.

Projection, denoted by the Greek letter pi ($\pi$), serves as a vertical filter. It allows a user to choose a specific subset of attributes from a relation, discarding the rest. Since the output is still a relation (under set semantics), any duplicate rows that might appear because of the removal of identifying columns must be eliminated.

Renaming, denoted by the Greek letter rho ($\rho$), does not change the data within a relation but alters the metadata. It can be used to change the name of the relation itself or the names of specific attributes. This is often necessary when joining a table with itself or preparing for set operations where attribute names must match.

\dfn{Selection ($\sigma$)}{The selection operator identifies and retrieves a subset of tuples from a relation that meet a defined logical condition.}

\dfn{Projection ($\pi$)}{The projection operator creates a new relation consisting only of a specified subset of attributes from the original relation.}

\nt{In modern query processors, an extended version of projection is often used. This allows not only the selection of attributes but also the creation of new columns through calculations or string manipulations based on existing data.}

\section{Binary Set Operations}

Relational algebra incorporates traditional set theory operations, including union ($\cup$), intersection ($\cap$), and subtraction ($-$). However, these cannot be applied to any two arbitrary tables. They require the operands to be "union-compatible." This means the two relations must share the exact same set of attributes, and each corresponding attribute must share the same domain.

\thm{Rules for Set Operations}{For two relations $R$ and $S$ to participate in a union, intersection, or difference, they must:
	\begin{enumerate}
		\item Possess the same set of attributes.
		\item Have identical domains for each corresponding attribute.
		\item (In mathematical relations) Maintain the same order of attributes to satisfy Cartesian product rules.
	\end{enumerate}}

The union of $R$ and $S$ includes all tuples that appear in either $R$, $S$, or both. The intersection includes only those tuples found in both relations. Subtraction (or set difference) retrieves tuples that are present in the first relation but not the second. It is important to note that while union and intersection are commutative, subtraction is not; the order of operands changes the result.

\section{Joining Relations: Products and Joins}

Combining data from different relations is achieved through joining operations. The most basic of these is the Cartesian Product ($\times$), which pairs every tuple of one relation with every tuple of another. While mathematically simple, this operation is computationally expensive and rarely used alone in practice, as it creates a massive amount of often irrelevant data.

To make combinations more meaningful, we use the Join operator ($\bowtie$). The natural join looks for attributes common to both relations and pairs tuples only when they share identical values for those common attributes. A more general version is the Theta-join, which pairs tuples based on an arbitrary condition (such as "greater than" or "not equal") rather than just simple equality.

\thm{The Equivalence of Theta-Joins}{Any Theta-join can be expressed as a Cartesian product followed immediately by a selection operation. Formally: $R \bowtie_{\theta} S = \sigma_{\theta}(R \times S)$.}

\nt{A "dangling tuple" refers to a record that does not find a match in the other relation during a join. In standard joins, these tuples are discarded from the result.}

\section{Bag Semantics and Extended Relational Algebra}

While mathematical relational algebra assumes set semantics (where every element is unique), real-world systems like SQL often utilize bag semantics. In a bag, the same tuple can appear multiple times. This affects how set operations are calculated. For example, in a bag union, if a tuple appears $m$ times in $R$ and $n$ times in $S$, it will appear $m+n$ times in the result.

Extended relational algebra introduces operators to handle these practical requirements. These include the duplicate elimination operator ($\delta$), which turns a bag into a set, and the sorting operator ($\tau$), which treats the relation as a list to arrange tuples by specific values.

The grouping and aggregation operator, denoted by gamma ($\gamma$), is perhaps the most powerful extended operator. It partitions tuples into groups based on "grouping keys" and applies an aggregate function—such as SUM, AVG, MIN, MAX, or COUNT—to each group.

\dfn{Aggregate Function}{A function that summarizes a collection of values from a column to produce a single representative value, such as a total or an average.}

\section{Relational Algebra as a Constraint Language}

Relational algebra is not just for querying; it can also be used to define the rules that data must follow to be considered valid. These constraints ensure the integrity of the database. We can express any constraint by stating that a specific algebraic expression must result in an empty set ($\emptyset$), or that the result of one expression must be a subset of another.

Key constraints can be represented by showing that if we join a table with itself and find two records with the same key but different attribute values, the set of such instances must be empty. Referential integrity (foreign keys) is expressed by asserting that the projection of a foreign key column in one table must be a subset of the projection of the primary key column in the referenced table.

\thm{Referential Integrity Constraint}{In relational algebra, referential integrity is enforced by the subset inclusion: $\pi_{A}(R) \subseteq \pi_{B}(S)$, meaning every value of attribute $A$ in relation $R$ must exist in the set of values of attribute $B$ in relation $S$.}

\section{Relational Algebra and Database Modifications}

The concepts of relational algebra also extend to how we modify the database state.
\begin{itemize}
	\item \textbf{Deletion}: Removing tuples from a relation $R$ can be modeled as $R := R - \sigma_C(R)$, where $C$ is the deletion condition.
	\item \textbf{Insertion}: Adding tuples is modeled as $R := R \cup S$, where $S$ is the set of new tuples.
	\item \textbf{Update}: Updating a tuple is logically equivalent to deleting the old version and inserting a new one with modified values.
\end{itemize}

By viewing modifications through this lens, the system can use algebraic laws to optimize not only how we retrieve data, but also how we maintain it.