Notes V.2.0.0

Rewrote Notes
2026-01-07 13:51:33 +01:00
parent c1878069fd
commit bcd2ddfe42
13 changed files with 787 additions and 623 deletions
--- a/sections/relational_algebra.tex
+++ b/sections/relational_algebra.tex
@@ -1,67 +1,83 @@
 \chapter{Relational Algebra}

-Relational algebra serves as the formal foundation for the manipulation of data within the relational model. It is a procedural language consisting of a set of operators that take one or more relations as input and produce a new relation as output. This mathematical framework is essential for database systems because it provides a precise way to represent queries and allows the system's query optimizer to reorganize expressions into more efficient execution plans. Unlike general-purpose programming languages such as C or Java, relational algebra is intentionally limited in power. For example, it cannot perform arbitrary calculations like factorials or determine if the number of tuples in a relation is even or odd. However, this limitation is a strategic advantage, as it enables the database engine to perform high-level optimizations that would be impossible in a more complex language.
+Relational algebra serves as the formal mathematical foundation for manipulating data within the relational model. While Data Definition Language (DDL) is concerned with the static structure of the database, relational algebra provides the dynamic framework for the Data Manipulation Language (DML). It is a notation consisting of a set of operators that take one or more relations as input and produce a new relation as output. This operational approach allows for the expression of complex queries by nesting and combining simpler operations.

-The algebra is characterized by the property of closure, where every result is a relation that can immediately serve as an input for another operation. Operators are generally categorized into unary operators, which act on a single relation, and binary operators, which combine two relations. The core operations include set-based maneuvers—such as union, intersection, and difference—and relational-specific maneuvers like selection, projection, joins, and renaming. In modern database implementations, these concepts are often extended to support bag semantics (allowing duplicates) and complex operations like grouping and sorting.
+The power of relational algebra lies in its ability to abstract away from the physical storage of data, focusing instead on the logical transformation of information. It is essentially to relational tables what basic arithmetic is to numbers or matrix algebra is to vectors. By defining precise rules for how tables are filtered, combined, and restructured, it ensures that query results are predictable and mathematically sound.

-\dfn{Relational Algebra}{A collection of mathematical operators that function on relations to perform data retrieval and manipulation. It is closed under its operations, ensuring that the output of any expression is itself a relation.}
+\section{The Concept of Relational Variables and Closure}

-\thm{The Power of Optimization}{By limiting the expressive power of the query language to relational algebra, database systems can effectively optimize code. This allows the system to replace inefficient execution strategies with mathematically equivalent but significantly faster algorithms.}
+A fundamental aspect of relational algebra is the way it identifies and treats data structures. In most mathematical contexts, an object does not inherently know the variable name assigned to it. However, in database theory, we often use the term "relvar" (relational variable) to describe a relation that is explicitly associated with a name. This allows the system to refer to stored data and intermediate results throughout the execution of a query.

-\section{The Unary Operators: Selection and Projection}
+Another critical property of relational algebra is closure. This principle dictates that because the input and output of every algebraic operator is a relation, operators can be nested indefinitely. This is identical to how the addition of two integers always results in another integer, allowing for the construction of complex arithmetic expressions.

-Selection and projection are the primary filters used to reduce the size of a relation. Selection, denoted by the Greek letter sigma ($\sigma$), acts as a horizontal filter. It examines each tuple in a relation and retains only those that satisfy a specific logical condition. This condition can involve attribute comparisons to constants or other attributes using standard operators such as equality, inequality, and logical connectors like AND and OR.
+\thm{The Property of Closure}{In relational algebra, the result of any operation is always another relation. This ensures that the output of one operator can serve as the valid input for any subsequent operator in a query tree.}

-Projection, denoted by the Greek letter pi ($\pi$), acts as a vertical filter. It is used to choose specific columns from a relation while discarding others. In its classical set-based form, projection also serves to eliminate any duplicate tuples that may arise when certain attributes are removed. This ensures that the result remains a valid set. Extended versions of projection allow for the creation of new attributes through calculations or renamings of existing fields.
+\dfn{Relational Variable (Relvar)}{A relvar is a named variable that is assigned a specific relation as its value, effectively allowing the database to track and manipulate data through an explicit identifier.}

-\dfn{Selection}{An operation $\sigma_C(R)$ that produces a relation containing all tuples from $R$ that satisfy the condition $C$. It does not change the schema of the relation but reduces the number of rows.}
+\section{Unary Operators: Selection, Projection, and Renaming}

-\dfn{Projection}{An operation $\pi_L(R)$ that creates a new relation consisting of only the attributes listed in $L$. It transforms the schema of the relation and may reduce the number of rows if duplicates are removed.}
+Unary operators are those that act upon a single relation. The three primary unary operators are selection, projection, and renaming. These tools allow a user to isolate specific rows, columns, or change the labels of the data structure.

-\section{Set Operations and Compatibility Constraints}
+Selection, denoted by the Greek letter sigma ($\sigma$), acts as a horizontal filter. It extracts only those records (tuples) that satisfy a specific condition, known as a predicate. This predicate can involve logical comparisons, arithmetic, and boolean operators. Importantly, selection does not change the schema of the table; the output has the exact same attributes and domains as the input.

-Relational algebra incorporates standard set operations: union ($\cup$), intersection ($\cap$), and difference ($-$). Because these operations are inherited from set theory, they require the participating relations to be "compatible." This means the relations must share the same schema—specifically, they must have the same set of attributes, and the domains (data types) associated with corresponding attributes must be identical.
+Projection, denoted by the Greek letter pi ($\pi$), serves as a vertical filter. It allows a user to choose a specific subset of attributes from a relation, discarding the rest. Since the output is still a relation (under set semantics), any duplicate rows that might appear because of the removal of identifying columns must be eliminated.

-Union combines all tuples from two relations into a single result. Intersection identifies tuples that appear in both input relations. Difference, which is not commutative, returns tuples found in the first relation but not the second. While these were originally defined for sets, modern systems often apply them to "bags" (multisets), where the rules for handling duplicates differ. For instance, in bag union, the number of occurrences of a tuple is the sum of its occurrences in the inputs, whereas in set union, it appears only once.
+Renaming, denoted by the Greek letter rho ($\rho$), does not change the data within a relation but alters the metadata. It can be used to change the name of the relation itself or the names of specific attributes. This is often necessary when joining a table with itself or preparing for set operations where attribute names must match.

-\dfn{Schema Compatibility}{The requirement that two relations involved in a set operation must have the same number of attributes, with matching names and identical data types for each corresponding column.}
+\dfn{Selection ($\sigma$)}{The selection operator identifies and retrieves a subset of tuples from a relation that meet a defined logical condition.}

-\thm{Commutativity and Associativity}{Set union and intersection are both commutative ($R \cup S = S \cup R$) and associative $((R \cup S) \cup T = R \cup (S \cup T))$, allowing the system to reorder these operations for better performance.}
+\dfn{Projection ($\pi$)}{The projection operator creates a new relation consisting only of a specified subset of attributes from the original relation.}

-\section{Renaming and Relational Variables}
+\nt{In modern query processors, an extended version of projection is often used. This allows not only the selection of attributes but also the creation of new columns through calculations or string manipulations based on existing data.}

-In complex queries, it is often necessary to change the name of an attribute or the relation itself to avoid ambiguity or to prepare a relation for a set operation. The renaming operator, denoted by the Greek letter rho ($\rho$), allows for this modification. This is particularly useful when joining a relation with itself, as it provides a way to distinguish between the two copies.
+\section{Binary Set Operations}

-The concept of a "relvar" (relational variable) is also important here. A relvar is essentially a variable that has a name and is assigned a specific relation. In algebraic expressions, we use these names to refer to the data stored within the tables.
+Relational algebra incorporates traditional set theory operations, including union ($\cup$), intersection ($\cap$), and subtraction ($-$). However, these cannot be applied to any two arbitrary tables. They require the operands to be "union-compatible." This means the two relations must share the exact same set of attributes, and each corresponding attribute must share the same domain.

-\dfn{Renaming}{An operator $\rho_S(R)$ that returns a new relation identical to $R$ in content but renamed to $S$. It can also be used as $\rho_{S(A_1, ..., A_n)}(R)$ to rename individual attributes.}
+\thm{Rules for Set Operations}{For two relations $R$ and $S$ to participate in a union, intersection, or difference, they must:
+	\begin{enumerate}
+		\item Possess the same set of attributes.
+		\item Have identical domains for each corresponding attribute.
+		\item (In mathematical relations) Maintain the same order of attributes to satisfy Cartesian product rules.
+	\end{enumerate}}

-\section{Combining Relations: Products and Joins}
+The union of $R$ and $S$ includes all tuples that appear in either $R$, $S$, or both. The intersection includes only those tuples found in both relations. Subtraction (or set difference) retrieves tuples that are present in the first relation but not the second. It is important to note that while union and intersection are commutative, subtraction is not; the order of operands changes the result.

-The most complex operations in relational algebra involve combining information from different relations. The Cartesian Product ($\times$) is the most basic of these, pairing every tuple of the first relation with every tuple of the second. While mathematically simple, the product often produces very large relations that contain many irrelevant pairings.
+\section{Joining Relations: Products and Joins}

-Joins are more refined versions of the product. The Theta-Join ($\bowtie_C$) performs a product followed by a selection based on a specific condition $C$. The most common join is the Natural Join ($\bowtie$), which automatically pairs tuples that have equal values in all attributes shared by the two relations. After the pairing, it removes the redundant columns, leaving a cleaner result.
+Combining data from different relations is achieved through joining operations. The most basic of these is the Cartesian Product ($\times$), which pairs every tuple of one relation with every tuple of another. While mathematically simple, this operation is computationally expensive and rarely used alone in practice, as it creates a massive amount of often irrelevant data.

-\dfn{Cartesian Product}{A binary operator $R \times S$ that produces a relation whose schema is the union of the schemas of $R$ and $S$, and whose tuples are all possible concatenations of a tuple from $R$ and a tuple from $S$.}
+To make combinations more meaningful, we use the Join operator ($\bowtie$). The natural join looks for attributes common to both relations and pairs tuples only when they share identical values for those common attributes. A more general version is the Theta-join, which pairs tuples based on an arbitrary condition (such as "greater than" or "not equal") rather than just simple equality.

-\dfn{Natural Join}{A specific type of join $R \bowtie S$ that connects tuples based on equality across all common attributes and then projects out the duplicate columns.}
+\thm{The Equivalence of Theta-Joins}{Any Theta-join can be expressed as a Cartesian product followed immediately by a selection operation. Formally: $R \bowtie_{\theta} S = \sigma_{\theta}(R \times S)$.}

-\thm{The Join-Product Relationship}{A theta-join $R \bowtie_C S$ is mathematically equivalent to the expression $\sigma_C(R \times S)$. This relationship allows query optimizers to choose between different physical execution strategies for the same logical request.}
+\nt{A "dangling tuple" refers to a record that does not find a match in the other relation during a join. In standard joins, these tuples are discarded from the result.}

-\section{Linear Notation and Expression Trees}
+\section{Bag Semantics and Extended Relational Algebra}

-Because relational algebra is a functional language, complex queries are built by nesting operations. These can be represented in two main ways. Linear notation involves a sequence of assignments to temporary variables, making the steps of a query easier to read. Alternatively, expression trees provide a graphical representation where leaves are stored relations and interior nodes are algebraic operators.
+While mathematical relational algebra assumes set semantics (where every element is unique), real-world systems like SQL often utilize bag semantics. In a bag, the same tuple can appear multiple times. This affects how set operations are calculated. For example, in a bag union, if a tuple appears $m$ times in $R$ and $n$ times in $S$, it will appear $m+n$ times in the result.

-The query processor uses these trees to visualize the flow of data. By applying algebraic laws, the processor can "push" selections and projections down the tree, closer to the data sources. This reduces the size of intermediate relations as early as possible, which is a hallmark of efficient query execution.
+Extended relational algebra introduces operators to handle these practical requirements. These include the duplicate elimination operator ($\delta$), which turns a bag into a set, and the sorting operator ($\tau$), which treats the relation as a list to arrange tuples by specific values.

-\thm{Selection Pushing}{In a query tree, moving a selection $\sigma$ below other operators like joins or unions is almost always beneficial, as it reduces the number of tuples that subsequent, more expensive operators must process.}
+The grouping and aggregation operator, denoted by gamma ($\gamma$), is perhaps the most powerful extended operator. It partitions tuples into groups based on "grouping keys" and applies an aggregate function—such as SUM, AVG, MIN, MAX, or COUNT—to each group.

-\section{Extended Relational Algebra}
+\dfn{Aggregate Function}{A function that summarizes a collection of values from a column to produce a single representative value, such as a total or an average.}

-To meet the requirements of SQL, the basic algebra is often extended with additional operators. These include duplicate elimination ($\delta$), which explicitly turns a bag into a set; sorting ($\tau$), which orders the tuples of a relation; and grouping and aggregation ($\gamma$), which partitions tuples into groups and calculates summaries like sums or averages.
+\section{Relational Algebra as a Constraint Language}

-While these operators go beyond the original mathematical definition of the algebra, they are essential for practical database management. They allow the algebra to serve as a complete intermediate language for translating SQL queries into physical instructions for the machine.
+Relational algebra is not just for querying; it can also be used to define the rules that data must follow to be considered valid. These constraints ensure the integrity of the database. We can express any constraint by stating that a specific algebraic expression must result in an empty set ($\emptyset$), or that the result of one expression must be a subset of another.

-\dfn{Duplicate Elimination}{The operator $\delta(R)$ that takes a bag $R$ as input and returns a set containing exactly one copy of every distinct tuple found in the input.}
+Key constraints can be represented by showing that if we join a table with itself and find two records with the same key but different attribute values, the set of such instances must be empty. Referential integrity (foreign keys) is expressed by asserting that the projection of a foreign key column in one table must be a subset of the projection of the primary key column in the referenced table.

-\dfn{Aggregation}{The application of functions such as SUM, AVG, MIN, MAX, or COUNT to a column of a relation to produce a single summary value.}
+\thm{Referential Integrity Constraint}{In relational algebra, referential integrity is enforced by the subset inclusion: $\pi_{A}(R) \subseteq \pi_{B}(S)$, meaning every value of attribute $A$ in relation $R$ must exist in the set of values of attribute $B$ in relation $S$.}
+
+\section{Relational Algebra and Database Modifications}
+
+The concepts of relational algebra also extend to how we modify the database state.
+\begin{itemize}
+	\item \textbf{Deletion}: Removing tuples from a relation $R$ can be modeled as $R := R - \sigma_C(R)$, where $C$ is the deletion condition.
+	\item \textbf{Insertion}: Adding tuples is modeled as $R := R \cup S$, where $S$ is the set of new tuples.
+	\item \textbf{Update}: Updating a tuple is logically equivalent to deleting the old version and inserting a new one with modified values.
+\end{itemize}
+
+By viewing modifications through this lens, the system can use algebraic laws to optimize not only how we retrieve data, but also how we maintain it.