Notes V.2.0.0

Rewrote Notes
This commit is contained in:
2026-01-07 13:51:33 +01:00
parent c1878069fd
commit bcd2ddfe42
13 changed files with 787 additions and 623 deletions

View File

@@ -1,55 +1,132 @@
\chapter{Data Definition with SQL}
\section{Overview of Data Definition and the SQL Language}
The process of managing data within an information system begins with a rigorous definition of its structure. The Data Definition Language (DDL) serves as the primary tool for database administrators to specify the logical organization of data, known as the schema. Historically, the evolution of database languages was marked by the development of SEQUEL in the early 1970s at the IBM Almaden Research Center, which was eventually renamed SQL due to trademark concerns. SQL is a declarative, set-based language, meaning it allows users to specify what data they want to retrieve or manipulate without detailing the step-by-step physical procedures. This abstraction significantly enhances the productivity of developers by separating the conceptual model from the physical storage layer.
\section{Overview of Data Definition}
Data in a relational system is organized into two-dimensional tables called relations. These relations must adhere to fundamental integrity principles to maintain data quality. Relational integrity ensures that every entry in a table follows the structure defined by its attributes. Domain integrity mandates that every attribute value belongs to a specific, predefined set of acceptable values. Finally, atomic integrity requires that every component of a tuple is an indivisible unit, preventing the use of complex structures like lists or nested records as simple attribute values.
Data definition is the fundamental process of specifying the logical structure of a database, often referred to as the schema. In the context of SQL (Structured Query Language), this involves declaring the tables that will store information, identifying the types of data permitted in each column, and establishing rules to maintain the correctness and consistency of that data. The relational model, which serves as the foundation for modern database systems, represents information as two-dimensional tables called relations. By defining these relations, developers create a rigid framework that ensures data independence, allowing the underlying physical storage to be optimized without affecting the high-level queries used by applications.
\section{Core Domain Types and Atomic Integrity}
A central aspect of data definition is the selection of appropriate domain types for each attribute. SQL provides a rich set of standardized types to handle various data categories. Character data is typically managed through fixed-length strings (CHAR) or variable-length strings with a maximum limit (VARCHAR). For exceptionally large textual content, types such as CLOB or TEXT are utilized. Numeric data is bifurcated into exact and approximate types. Exact numbers include various sizes of integers (SMALLINT, INTEGER, BIGINT) and fixed-point decimals (DECIMAL or NUMERIC), where precision and scale can be strictly defined. Approximate numbers, such as REAL and DOUBLE PRECISION, follow floating-point standards to represent scientific data where a degree of approximation is acceptable.
\section{Mathematical Foundations of Relations}
\dfn{Atomic Integrity}{The requirement that every value in a relational table must be a single, indivisible data item of an elementary type, such as an integer or a string.}
The concept of a relational table originates from mathematical set theory. At its core, a relation is defined over a series of sets, which are known as attribute domains. While a general relation can represent any subset of a Cartesian product, SQL tables require more specific semantic structures to function effectively as data stores.
Temporal data is equally vital, with SQL supporting types for dates, times, and timestamps. These allow for the storage of specific points in time, often including time zone information for global applications. Furthermore, intervals represent durations, such as "two years and four months." Binary data, such as images or passport scans, is stored using BLOB or BYTEA types. Boolean types provide the foundation for logical operations, supporting TRUE, FALSE, and the three-valued logic involving UNKNOWN when NULL values are present.
A collection of data can be viewed as a set of records, where each record acts as a "map" or a function from a set of attributes to values. To transform a simple collection into a formal relational table, three specific types of integrity must be enforced.
\section{Structural Operations: Creating and Modifying Tables}
The lifecycle of a database schema involves the creation, modification, and removal of tables. The \texttt{CREATE TABLE} command is the primary DDL statement used to introduce new relations. It requires the specification of a table name, a list of attributes, and their associated domains. A newly created table is initially empty, representing an extension of zero tuples.
\dfn{Relational Table (Set Semantics)}{A relational table is a set of maps from attribute names to values that satisfies relational integrity, domain integrity, and atomic integrity.}
\thm{Relational Schema}{The formal definition of a relation, comprising its unique name and the set of attributes along with their corresponding data types or domains.}
\thm{The Three Rules of Relational Integrity}{To qualify as a relational table, a collection must fulfill:
As requirements change, the \texttt{ALTER TABLE} statement allows administrators to evolve the schema without deleting existing data. This includes adding new columns, which may be initialized with NULL or a specific default value, and renaming or removing existing columns. When a table is no longer required, the \texttt{DROP TABLE} command removes both the schema and all stored data from the system. To avoid errors during automated scripts, the \texttt{IF EXISTS} clause is frequently employed to ensure a command only executes if the target relation is present.
\begin{enumerate}
\item \textbf{Relational Integrity}: Every record in the collection must have the same support, meaning they all share the exact same set of attributes.
\item \textbf{Domain Integrity}: Each attribute is associated with a specific domain (type), and every value for that attribute must belong to that domain.
\item \textbf{Atomic Integrity}: Every value in the table must be atomic, meaning it cannot be broken down into smaller components (e.g., no tables within tables).
\end{enumerate}}
\section{Data Manipulation and Logic of Modifications}
Once the schema is defined, Data Manipulation Language (DML) commands are used to populate and maintain the data. The \texttt{INSERT} statement adds new records to a relation. It can take specific values for a single tuple or use a subquery to perform bulk insertions from other tables. A critical rule in SQL modification is that the system must evaluate the query portion of an insertion entirely before any data is actually added to the target table. This prevents infinite loops or inconsistent states where a new tuple might satisfy its own insertion criteria.
\nt{While mathematics primarily uses set semantics (no duplicates), practical SQL implementations often utilize bag semantics, allowing for duplicate records, or list semantics, where the order of records is preserved.}
\dfn{Cascading Rollback}{A situation where the abortion of one transaction necessitates the cancellation of other dependent transactions that have read data written by the initial transaction.}
\section{The Evolution and Nature of SQL}
The \texttt{DELETE} and \texttt{UPDATE} commands provide the means to remove or modify existing tuples based on specific conditions. Similar to insertions, these operations apply the condition to every tuple in the relation and execute the change only on those that satisfy the predicate. Through these commands, the system transitions between different database states while aiming to preserve overall consistency.
SQL was developed in the early 1970s at IBMs San Jose research facility, originally under the name SEQUEL (Structured English Query Language). Created by Don Chamberlin and Raymond Boyce, the language was designed to be more intuitive than earlier procedural languages by using English-like syntax.
\section{Integrity Constraints and Key Definitions}
Constraints are declarative rules that restrict the data permitted in the database to prevent inaccuracies. The most fundamental constraints are those defining keys. A primary key uniquely identifies each row in a table and, by definition, cannot contain NULL values. In contrast, the \texttt{UNIQUE} constraint ensures distinctness but may permit NULLs depending on the specific DBMS implementation.
The primary characteristic of SQL is that it is a declarative language. Unlike imperative languages such as Java or C++, where the programmer must define exactly how to retrieve or calculate data, a SQL user simply declares what the desired result looks like. The database engine then determines the most efficient way to execute the request.
\thm{The Thomas Write Rule}{A principle in timestamp-based concurrency control that allows a write operation to be ignored if a later transaction has already updated the same data element, thereby maintaining the intended final state.}
\thm{Set-Based Processing}{SQL is a set-based language, meaning it manipulates entire relations with a single command rather than processing one record at a time.}
Beyond keys, \texttt{NOT NULL} constraints ensure that critical attributes always have a value. \texttt{CHECK} constraints provide more complex logic, allowing the system to validate that an attribute or an entire tuple meets specific boolean conditions. For instance, a check could ensure that a person's age is never negative or that a start date precedes an end date.
\section{SQL Data Types and Domains}
\section{Referential Integrity and Foreign Keys}
Referential integrity is maintained through foreign keys, which establish a link between tables. A foreign key in one table must reference a unique or primary key in another. This ensures that the relationship between entities remains valid; for example, every movie in a tracking system must be associated with an existing studio.
Every attribute in a SQL table must be assigned a data type. These types define the nature of the data and the operations that can be performed on it.
\dfn{Foreign Key}{An attribute or set of attributes in a relation that serves as a reference to a primary or unique key in a different relation, enforcing a logical connection between the two.}
\subsection{String Types}
The management of these links during data removal or updates is governed by specific policies. The \texttt{CASCADE} policy ensures that changes in the parent table are automatically reflected in the child table. Alternatively, the \texttt{SET NULL} policy breaks the link by nullifying the foreign key when the referenced record is deleted. If neither is appropriate, the \texttt{RESTRICT} policy blocks any modification that would break referential integrity.
SQL provides several ways to store text. Fixed-length strings are defined as \texttt{char(n)}, where the system reserves exactly $n$ characters. If the input is shorter, it is padded with spaces. Variable-length strings with a specified limit are defined as \texttt{varchar(n)}. For very long text without a specific limit, PostgreSQL uses the \texttt{text} type, while the SQL standard refers to this as \texttt{clob} (Character Large Object).
\section{Advanced Constraints and Deferred Checking}
For constraints that span multiple tables or require global validation, SQL offers assertions. Unlike table-based checks, assertions are standalone schema elements that the DBMS must verify whenever any involved relation is modified. This makes them powerful but potentially expensive to implement efficiently.
\subsection{Numeric Types}
\thm{Two-Phase Locking (2PL)}{A concurrency control protocol that guarantees conflict-serializability by requiring that all lock acquisitions by a transaction must occur before any of its locks are released.}
Numbers are categorized into exact and approximate types. Exact numbers include integers (\texttt{smallint}, \texttt{integer}, \texttt{bigint}) and fixed-point decimals.
In complex transactions where multiple interrelated tables are updated, immediate constraint checking can be problematic. SQL addresses this with deferred checking. By declaring a constraint as \texttt{DEFERRABLE}, the system can postpone validation until the very end of a transaction, just before it commits. This allows for temporary inconsistencies that are resolved by the time the transaction completes its entire sequence of actions.
\dfn{Fixed-Point Decimal}{A numeric type defined by \texttt{decimal(p, s)}, where $p$ is the total number of significant digits (precision) and $s$ is the number of digits after the decimal point (scale).}
\section{Active Database Elements: Triggers}
Triggers, or Event-Condition-Action (ECA) rules, represent the transition from a passive database to an active one. A trigger is awakened by a specific event—such as an insertion, deletion, or update—and then evaluates a condition. If the condition is true, the system executes a predefined set of actions.
Approximate numbers are represented as floating-point values using \texttt{real} (single precision) or \texttt{double precision}. These follow the IEEE 754 standard and are highly efficient because they are handled directly by computer hardware.
\dfn{Trigger}{A stored procedure that is automatically invoked by the DBMS in response to specified changes to the database, consisting of a triggering event, a condition, and a resulting action.}
\subsection{Temporal and Binary Types}
Triggers offer significant flexibility compared to standard constraints. They can be set to execute either \texttt{BEFORE} or \texttt{AFTER} the triggering event and can operate at either the row level (executing for every modified tuple) or the statement level (executing once for the entire SQL command). They are frequently used to enforce complex business rules, maintain audit logs, or automatically fix data inconsistencies that simple constraints cannot handle.
SQL supports complex date and time tracking. The \texttt{date} type follows the Gregorian calendar, while \texttt{time} tracks hours, minutes, and seconds. \texttt{timestamp} combines both, and can optionally include time zone data to handle global information.
\nt{The \texttt{interval} type represents a duration. However, there is a "duration wall" between months and days because the number of days in a month is variable, making certain additions ambiguous.}
Binary data, such as images or videos, is stored using \texttt{binary(p)}, \texttt{varbinary(p)}, or \texttt{blob} (referred to as \texttt{bytea} in PostgreSQL).
\section{Structural Management of Tables}
The Data Definition Language (DDL) subset of SQL provides commands to manage the lifecycle of tables.
\subsection{Creating and Dropping Tables}
The \texttt{CREATE TABLE} statement is used to define a new relation. It requires a unique table name, a list of attributes, and their associated domains. A newly created table is initially empty.
\nt{In SQL, names are generally case-insensitive. However, if a developer needs to force a specific case for an attribute name, they must surround it with double quotes. Single quotes are reserved exclusively for string literals (values).}
To remove a table entirely from the database, the \texttt{DROP TABLE} command is used. If there is uncertainty about whether a table exists, the \texttt{IF EXISTS} clause can be added to prevent execution errors.
\subsection{Modifying Tables}
The \texttt{ALTER TABLE} command allows for changes to an existing table's schema. This includes adding new columns, removing existing ones, or renaming attributes and the table itself.
\thm{Adding Columns to Populated Tables}{When a new column is added to a table that already contains data, the system must fill the new attribute for existing rows. By default, it uses \texttt{NULL}, but a specific \texttt{DEFAULT} value can be specified instead.}
\section{Data Population and Manipulation}
While data modification is primarily part of the Data Manipulation Language (DML), it is closely tied to definition through constraints.
\subsection{Insertion Strategies}
The most basic way to populate a table is the \texttt{INSERT INTO} statement followed by \texttt{VALUES}. One can insert a single record or multiple records in one command. If certain columns are omitted, the system will attempt to fill them with \texttt{NULL} or a defined default value.
\thm{Insertion via Subqueries}{Instead of providing explicit values, an \texttt{INSERT} statement can use a \texttt{SELECT} subquery to compute a set of tuples from other tables and insert them into the target relation.}
\subsection{Updates and Deletions}
Data can be modified using \texttt{UPDATE}, which changes values in existing tuples based on a condition, or removed using \texttt{DELETE FROM}, which deletes specific rows while keeping the table's structure intact.
\section{Consistency and Integrity Constraints}
Constraints are rules used to prevent the entry of invalid data, effectively enforcing relational and domain integrity at the database level.
\dfn{NULL Value}{A special marker used in SQL to indicate that a data value is unknown, inapplicable, or kept secret. It is not equivalent to zero or an empty string.}
\subsection{Fundamental Constraints}
\begin{itemize}
\item \textbf{NOT NULL}: This ensures that a column cannot have an empty or unknown value. This is a primary tool for pushing a database toward strict relational integrity.
\item \textbf{UNIQUE}: This requires that every non-null value in a column be distinct. It can be applied to a single column or a combination of columns (table constraint).
\item \textbf{CHECK}: This allows for arbitrary conditions that every row must satisfy, such as ensuring a price is positive or a date is within a valid range.
\end{itemize}
\section{Primary and Foreign Keys}
Keys are the most critical constraints in relational design as they define how records are identified and linked.
\subsection{Primary Keys}
A primary key is an attribute or set of attributes that uniquely identifies a row. By definition, a primary key must be \texttt{UNIQUE} and \texttt{NOT NULL}. Every table should ideally have one primary key to ensure each record can be referenced without ambiguity.
\subsection{Foreign Keys and Referential Integrity}
A foreign key is an attribute in one table that references a unique or primary key in another table. This creates a link between the two relations.
\thm{Referential Integrity}{This constraint ensures that every value in a foreign key column must either be \texttt{NULL} or exist in the referenced primary key column of the related table.}
\subsection{Handling Deletions in References}
If a referenced value is deleted, the system must follow a specific policy to maintain integrity.
\nt{Common policies for handling the deletion of a referenced record include:
\begin{itemize}
\item \textbf{CASCADE}: Automatically delete or update the referencing rows.
\item \textbf{RESTRICT/NO ACTION}: Prohibit the deletion if references exist.
\item \textbf{SET NULL}: Reset the foreign key of the referencing rows to \texttt{NULL}.
\item \textbf{SET DEFAULT}: Reset the foreign key to its default value.
\end{itemize}}
\section{Lexical vs. Value Space}
A sophisticated concept in data definition is the distinction between how data is represented and what it actually is. The "Value Space" refers to the abstract mathematical object (e.g., the concept of the number four), while the "Lexical Space" refers to the various ways that value can be written in a query (e.g., '4', '4.0', '04', or '4e0'). SQL engines are responsible for mapping various lexical representations to the correct underlying value space to perform comparisons and arithmetic accurately.