Notes V.2.0.0

Rewrote Notes
This commit is contained in:
2026-01-07 13:51:33 +01:00
parent c1878069fd
commit bcd2ddfe42
13 changed files with 787 additions and 623 deletions

View File

@@ -1,131 +1,83 @@
\chapter{The Relational Model}
The relational model stands as the preeminent framework for managing data in contemporary information systems. Historically, the organization of data in tabular formats is a practice that stretches back nearly four millennia, beginning with early clay tablets. However, the modern digital iteration was pioneered by Edgar Codd in 1970. Codds primary contribution was the principle of data independence, which strictly separates the logical representation of information from its physical implementation on storage devices. Before this shift, programmers were often required to understand the underlying physical structure of the data to perform even basic queries. The relational model replaced these complex, system-dependent methods with a high-level abstraction based on tables, which are referred to as relations.
The relational model serves as the theoretical cornerstone of modern database systems, providing a structured yet flexible framework for data management. Proposed by Edgar Codd in 1970, this model revolutionized the field by introducing the principle of data independence. This principle decouples the logical representation of data—how users perceive and interact with it—from its physical storage on hardware. By representing information through intuitive two-dimensional tables, the model bridges the gap between complex mathematical theory and practical business applications. Interestingly, the tabular format is not a modern invention; historical evidence shows that humans have used clay tablets for relational data organization since at least 1800 BC. This enduring utility underscores the model's alignment with human cognitive patterns for managing structured facts.
In this model, data is represented as a collection of two-dimensional structures. This approach offers simplicity and versatility, allowing for anything from corporate records to scientific data to be modeled effectively. By restricting operations to a limited set of high-level queries, the relational model allows for significant optimization by the database management system, often performing tasks more efficiently than code written in general-purpose languages. This chapter details the structure, mathematical foundations, and design theories—specifically functional dependencies and normalization—that ensure data remains consistent and free from redundancy.
\thm{Data Independence}{The separation of the logical data model from the physical storage implementation, allowing changes to the machine-level storage without affecting user queries or the logical view of the data.}
\section{Core Terminology and Structural Components}
The architecture of the relational model is defined by a specific set of terms that describe both the structure and the content of the data.
In relational theory, specific terminology is used to describe the components of a database, often with synonyms used across different technical and business contexts. The primary structure is the relation, commonly referred to as a table. A relation consists of a set of attributes, which are the named columns that define the properties of the data stored. The set of these attributes, combined with the name of the relation itself, constitutes the relation schema.
\dfn{Attribute}{
An attribute is a named header for a column in a relation. It describes the meaning of the entries within that column. For example, in a table tracking information about movies, "title" and "year" would be typical attributes.
}
\dfn{Relation Schema}{The formal description of a relation, comprising its name and a set of attributes, typically denoted as $R(A_1, A_2, \dots, A_n)$.}
\dfn{Tuple}{
A tuple is a single row in a relation, excluding the header row. It represents a specific instance of the entity described by the relation. A tuple contains one component for every attribute defined in the relation's schema.
}
Each entry within a relation is called a tuple, which corresponds to a row in a table or a record in a file. A tuple contains a specific value for each attribute defined in the schema. These values, often called scalars, represent individual facts or characteristics.
\dfn{Relation Schema}{
A relation schema consists of the name of the relation and the set of attributes associated with it. This is typically expressed as $R(A_1, A_2, \dots, A_n)$. A database schema is the total collection of all relation schemas within a system.
}
\dfn{Tuple}{A single row or record within a relation, representing a specific instance of the entity or business object described by the schema.}
\dfn{Relation Instance}{
A relation instance is the specific set of tuples present in a relation at any given time. While schemas are relatively static, instances change frequently as data is inserted, updated, or deleted.
}
\nt{While mathematicians prefer to index tuples by numbers, database scientists identify components by their attribute names to provide semantic clarity.}
The components of a tuple must be atomic, meaning they are elementary types like integers or strings. The model explicitly forbids complex structures such as nested lists or sets as individual values. Every attribute is associated with a domain, which defines the set of permissible values or the specific data type for that column.
\section{Domains and Atomic Values}
Every attribute in a relation is associated with a domain. A domain is essentially a data type or a set of permissible values that can appear in a specific column. For example, a "year" attribute might be restricted to the domain of integers, while a "name" attribute is restricted to the domain of character strings. A fundamental requirement of the standard relational model is that these values must be atomic. This means they cannot be further decomposed into smaller components, such as nested tables, lists, or sets. This requirement is formally known as the First Normal Form.
\dfn{Domain}{A set of values of a specific elementary type from which an attribute draws its components.}
\thm{Atomic Integrity}{The rule that every component of every tuple must be an indivisible, elementary value rather than a structured or repeating group.}
\section{Mathematical Foundations of Relations}
Mathematically, a relation is defined as a subset of the Cartesian product of the domains of its attributes. If an attribute $A$ has a domain $D$, then the entries in the column for $A$ must be elements of $D$. A record can be viewed as a partial function or a "map" from the set of attribute names to a set of atomic values.
The relational model is built upon the mathematical concept of the Cartesian product. Given a family of domains $D_1, D_2, \dots, D_n$, a relation is defined as a subset of the Cartesian product $D_1 \times D_2 \times \dots \times D_n$. Each element of this subset is an $n$-tuple. This mathematical approach ensures that domain integrity and relational integrity are maintained by definition, as every value must belong to its prescribed set.
\thm{Relation as a Set}{
In the abstract mathematical model, a relation is a set of tuples. This implies that the order of the rows is irrelevant and that every tuple must be unique. Furthermore, because attributes are a set, the order of columns does not change the identity of the relation, provided the components of the tuples are reordered to match.
}
\dfn{Cartesian Product}{The set of all possible ordered tuples that can be formed by taking one element from each of the participating sets or domains.}
While the theoretical model relies on set semantics, practical implementations often utilize alternate semantics:
\begin{itemize}
\item \textbf{Bag Semantics}: Used by SQL, this allows for duplicate records within a table.
\item \textbf{List Semantics}: In this variation, the specific sequence of the records is preserved and carries meaning.
\end{itemize}
An alternative mathematical representation views a record as a map. In this perspective, a record $t$ is a partial function from a set of attribute names to a global set of values. This mapping approach is often preferred because it makes the order of attributes irrelevant, reflecting how databases actually operate in practice.
A database is formally defined as a set of these relational tables. To interact with this data, the model employs relational algebra, a system of operators that take one or more relations as input and produce a new relation as output.
\nt{In a relation, the order of both the attributes and the tuples is immaterial; a relation remains the same regardless of how its rows or columns are permuted.}
\section{Integrity Constraints and Consistency}
\section{Integrity and Consistency Rules}
To ensure the validity of data, the relational model enforces several categories of integrity.
For a collection of data to be considered a valid relational table, it must adhere to three primary integrity rules. These rules ensure the consistency and predictability of the data.
\thm{Relational Integrity}{
The requirement that every record within a specific relation must possess the exact same set of attributes. Broken relational integrity occurs if attributes are missing or if redundant attributes appear in individual rows.
}
\thm{Atomic Integrity}{
Also known as the First Normal Form (1NF), this rule dictates that every value in a cell must be a single, indivisible unit. Complex data types cannot be stored within a single attribute field.
}
\thm{Domain Integrity}{
This constraint requires that every value for an attribute must belong to the predefined set of values or the data type associated with its domain.
}
\section{Defining Relation Schemas in SQL}
SQL (Structured Query Language) is the primary tool for implementing the relational model. It is divided into the Data-Definition Language (DDL) for creating and modifying schemas, and the Data-Manipulation Language (DML) for querying and updating data. The most fundamental command in DDL is the \texttt{CREATE TABLE} statement, which establishes the table name, its attributes, and their types.
\subsection{SQL Data Types}
Attributes must be assigned a primitive type. Common SQL types include:
\begin{itemize}
\item \textbf{CHAR(n)}: A fixed-length string of $n$ characters.
\item \textbf{VARCHAR(n)}: A variable-length string up to $n$ characters.
\item \textbf{INT / INTEGER}: Standard whole numbers.
\item \textbf{FLOAT / REAL}: Floating-point numbers.
\item \textbf{BOOLEAN}: Stores TRUE, FALSE, or UNKNOWN.
\item \textbf{DATE / TIME}: Specific formats for calendar dates (e.g., YYYY-MM-DD) and clock times.
\end{itemize}
\subsection{Keys and Uniqueness}
\dfn{Key}{
A key is a set of one or more attributes such that no two tuples in any possible relation instance can share the same values for all these attributes. A key must be minimal; no subset of its attributes can also be a key.
}
In SQL, keys are declared using the \texttt{PRIMARY KEY} or \texttt{UNIQUE} keywords. Attributes designated as a primary key are forbidden from containing NULL values, whereas \texttt{UNIQUE} columns may allow them depending on the system.
\section{Functional Dependencies}
A central concept in database design theory is the functional dependency (FD), which generalizes the idea of a key.
\thm{Functional Dependency}{
A functional dependency on a relation $R$ is an assertion that if two tuples agree on a set of attributes $A_1, \dots, A_n$, they must also agree on another set of attributes $B_1, \dots, B_m$. This is written as $A \rightarrow B$.
}
FDs are not merely observations about a specific instance of data but are constraints that must hold for every possible legal instance of the relation. They describe the relationships between attributes; for example, a movie's title and year might functionally determine its length and studio, as there is only one specific length and studio for a unique movie released in a given year.
\dfn{Superkey}{
A superkey is a set of attributes that contains a key as a subset. Therefore, every superkey functionally determines all attributes of the relation, but it may not be minimal.
}
The closure of a set of attributes under a set of FDs is the collection of all attributes that are functionally determined by that set. Calculating the closure allows designers to identify all keys of a relation and test if a new FD follows from the existing ones.
\section{Anomalies and the Need for Decomposition}
Careless schema design leads to "anomalies," which are problems that occur when too much information is crammed into a single table. There are three primary types:
\begin{enumerate}
\item \textbf{Redundancy}: Information is repeated unnecessarily across multiple rows (e.g., repeating a studio's address for every movie they made).
\item \textbf{Update Anomalies}: If a piece of redundant information changes, it must be updated in every row. Failure to do so leads to inconsistent data.
\item \textbf{Deletion Anomalies}: Deleting a record might inadvertently destroy the only copy of unrelated information (e.g., deleting the last movie of a studio might remove the studio's address from the database entirely).
\item \textbf{Relational Integrity:} This requires that all records within a specific table have the exact same set of attributes. A table cannot have "holes" or missing attributes in some rows but not others.
\item \textbf{Atomic Integrity:} As previously noted, this prohibits the nesting of structures within a cell. A value must be a single fact.
\item \textbf{Domain Integrity:} This ensures that every value in a column is of the same kind, matching the type specified for that attribute in the schema.
\end{enumerate}
To eliminate these issues, designers use decomposition—the process of splitting a relation into two or more smaller relations whose attributes, when combined, include all the original attributes.
\dfn{Domain Integrity}{The constraint that every value in a specific column must belong to the domain (data type) associated with that attribute.}
\section{Normal Forms}
\thm{Relational Integrity}{The requirement that every record in a relation must possess the same support, meaning they all share the identical set of attributes defined in the schema.}
The goal of decomposition is to reach a normal form that guarantees the absence of certain anomalies.
\section{Keys and Uniqueness}
\thm{Boyce-Codd Normal Form (BCNF)}{
A relation $R$ is in BCNF if and only if for every nontrivial functional dependency $A \rightarrow B$, the set of attributes $A$ is a superkey. In simpler terms, every determinant must be a key.
}
To distinguish between tuples, the relational model relies on the concept of keys. A key is a set of one or more attributes that uniquely identifies a tuple within a relation instance. No two tuples in a valid relation can share the same values for all attributes in the key. Typically, one key is designated as the primary key.
Any relation can be decomposed into a collection of BCNF relations. This process effectively removes redundancy caused by functional dependencies. However, while BCNF is very powerful, it does not always preserve all original dependencies. This leads to the use of a slightly relaxed condition.
\dfn{Primary Key}{A specific attribute or minimal set of attributes chosen to uniquely identify each tuple in a relation, often indicated in a schema by underlining the attributes.}
\thm{Third Normal Form (3NF)}{
A relation $R$ is in 3NF if for every nontrivial FD $A \rightarrow B$, either $A$ is a superkey, or every attribute in $B$ that is not in $A$ is "prime" (a member of some key).
}
\nt{Identifying a primary key is essential for establishing relationships between different tables and maintaining data accuracy.}
3NF is useful because it is always possible to find a decomposition that is both lossless (the original data can be reconstructed) and dependency-preserving, which is not always true for BCNF.
\section{Relation Instances and Temporal Change}
\section{Modifying and Removing Schemas}
A relation is not a static object; it changes over time as tuples are inserted, deleted, or updated. The set of tuples present in a relation at any given moment is called an instance. Standard database systems typically only maintain the "current instance," representing the data as it exists right now. Changing a schema (adding or deleting columns) is a much more significant and expensive operation than changing an instance, as it requires restructuring every tuple currently stored.
Database structures are dynamic. SQL provides the \texttt{DROP TABLE} command to remove a relation and all its data permanently. For structural changes, the \texttt{ALTER TABLE} command is used. This allows for the addition of new attributes via \texttt{ADD} or the removal of existing ones via \texttt{DROP}. When new columns are added, existing tuples typically receive a \texttt{NULL} value or a specified \texttt{DEFAULT} value.
\dfn{Relation Instance}{The specific set of tuples contained within a relation at a given point in time.}
\section{Alternative Storage Semantics}
While the classical relational model is based on set semantics, where duplicate tuples are strictly forbidden, practical implementations often utilize different semantics based on the needs of the system.
\begin{enumerate}
\item \textbf{Set Semantics:} No duplicate records are allowed.
\item \textbf{Bag Semantics:} Duplicate records are permitted. This is common in SQL results, as eliminating duplicates is computationally expensive.
\item \textbf{List Semantics:} The specific order of the records is preserved and significant.
\end{enumerate}
\thm{Bag Semantics}{A variation of the relational model where duplicate tuples are allowed to exist within a relation, often used to improve the efficiency of query operations.}
\nt{The choice between set, bag, and list semantics is often a trade-off between mathematical purity and the performance requirements of a real-world database engine.}
\section{Conclusion}
The relational model's power lies in its simplicity and its firm mathematical grounding. By treating data as a collection of relations and providing a clear set of integrity rules, it allows for the creation of robust, scalable information systems. The use of schemas provides a stable contract for applications, while the principle of data independence ensures that the system can evolve technologically without breaking the logical structures that users depend on.
\nt{The relational model effectively acts as the "physics" of data, providing the laws that govern how digital information is structured and transformed.}