information_systems_for_eng…/sections/data_definition_sql.tex

\chapter{Data Definition with SQL}

\section{Overview of Data Definition and the SQL Language}
The process of managing data within an information system begins with a rigorous definition of its structure. The Data Definition Language (DDL) serves as the primary tool for database administrators to specify the logical organization of data, known as the schema. Historically, the evolution of database languages was marked by the development of SEQUEL in the early 1970s at the IBM Almaden Research Center, which was eventually renamed SQL due to trademark concerns. SQL is a declarative, set-based language, meaning it allows users to specify what data they want to retrieve or manipulate without detailing the step-by-step physical procedures. This abstraction significantly enhances the productivity of developers by separating the conceptual model from the physical storage layer.

Data in a relational system is organized into two-dimensional tables called relations. These relations must adhere to fundamental integrity principles to maintain data quality. Relational integrity ensures that every entry in a table follows the structure defined by its attributes. Domain integrity mandates that every attribute value belongs to a specific, predefined set of acceptable values. Finally, atomic integrity requires that every component of a tuple is an indivisible unit, preventing the use of complex structures like lists or nested records as simple attribute values.

\section{Core Domain Types and Atomic Integrity}
A central aspect of data definition is the selection of appropriate domain types for each attribute. SQL provides a rich set of standardized types to handle various data categories. Character data is typically managed through fixed-length strings (CHAR) or variable-length strings with a maximum limit (VARCHAR). For exceptionally large textual content, types such as CLOB or TEXT are utilized. Numeric data is bifurcated into exact and approximate types. Exact numbers include various sizes of integers (SMALLINT, INTEGER, BIGINT) and fixed-point decimals (DECIMAL or NUMERIC), where precision and scale can be strictly defined. Approximate numbers, such as REAL and DOUBLE PRECISION, follow floating-point standards to represent scientific data where a degree of approximation is acceptable.

\dfn{Atomic Integrity}{The requirement that every value in a relational table must be a single, indivisible data item of an elementary type, such as an integer or a string.}

Temporal data is equally vital, with SQL supporting types for dates, times, and timestamps. These allow for the storage of specific points in time, often including time zone information for global applications. Furthermore, intervals represent durations, such as "two years and four months." Binary data, such as images or passport scans, is stored using BLOB or BYTEA types. Boolean types provide the foundation for logical operations, supporting TRUE, FALSE, and the three-valued logic involving UNKNOWN when NULL values are present.

\section{Structural Operations: Creating and Modifying Tables}
The lifecycle of a database schema involves the creation, modification, and removal of tables. The \texttt{CREATE TABLE} command is the primary DDL statement used to introduce new relations. It requires the specification of a table name, a list of attributes, and their associated domains. A newly created table is initially empty, representing an extension of zero tuples.

\thm{Relational Schema}{The formal definition of a relation, comprising its unique name and the set of attributes along with their corresponding data types or domains.}

As requirements change, the \texttt{ALTER TABLE} statement allows administrators to evolve the schema without deleting existing data. This includes adding new columns, which may be initialized with NULL or a specific default value, and renaming or removing existing columns. When a table is no longer required, the \texttt{DROP TABLE} command removes both the schema and all stored data from the system. To avoid errors during automated scripts, the \texttt{IF EXISTS} clause is frequently employed to ensure a command only executes if the target relation is present.

\section{Data Manipulation and Logic of Modifications}
Once the schema is defined, Data Manipulation Language (DML) commands are used to populate and maintain the data. The \texttt{INSERT} statement adds new records to a relation. It can take specific values for a single tuple or use a subquery to perform bulk insertions from other tables. A critical rule in SQL modification is that the system must evaluate the query portion of an insertion entirely before any data is actually added to the target table. This prevents infinite loops or inconsistent states where a new tuple might satisfy its own insertion criteria.

\dfn{Cascading Rollback}{A situation where the abortion of one transaction necessitates the cancellation of other dependent transactions that have read data written by the initial transaction.}

The \texttt{DELETE} and \texttt{UPDATE} commands provide the means to remove or modify existing tuples based on specific conditions. Similar to insertions, these operations apply the condition to every tuple in the relation and execute the change only on those that satisfy the predicate. Through these commands, the system transitions between different database states while aiming to preserve overall consistency.

\section{Integrity Constraints and Key Definitions}
Constraints are declarative rules that restrict the data permitted in the database to prevent inaccuracies. The most fundamental constraints are those defining keys. A primary key uniquely identifies each row in a table and, by definition, cannot contain NULL values. In contrast, the \texttt{UNIQUE} constraint ensures distinctness but may permit NULLs depending on the specific DBMS implementation.

\thm{The Thomas Write Rule}{A principle in timestamp-based concurrency control that allows a write operation to be ignored if a later transaction has already updated the same data element, thereby maintaining the intended final state.}

Beyond keys, \texttt{NOT NULL} constraints ensure that critical attributes always have a value. \texttt{CHECK} constraints provide more complex logic, allowing the system to validate that an attribute or an entire tuple meets specific boolean conditions. For instance, a check could ensure that a person's age is never negative or that a start date precedes an end date.

\section{Referential Integrity and Foreign Keys}
Referential integrity is maintained through foreign keys, which establish a link between tables. A foreign key in one table must reference a unique or primary key in another. This ensures that the relationship between entities remains valid; for example, every movie in a tracking system must be associated with an existing studio.

\dfn{Foreign Key}{An attribute or set of attributes in a relation that serves as a reference to a primary or unique key in a different relation, enforcing a logical connection between the two.}

The management of these links during data removal or updates is governed by specific policies. The \texttt{CASCADE} policy ensures that changes in the parent table are automatically reflected in the child table. Alternatively, the \texttt{SET NULL} policy breaks the link by nullifying the foreign key when the referenced record is deleted. If neither is appropriate, the \texttt{RESTRICT} policy blocks any modification that would break referential integrity.

\section{Advanced Constraints and Deferred Checking}
For constraints that span multiple tables or require global validation, SQL offers assertions. Unlike table-based checks, assertions are standalone schema elements that the DBMS must verify whenever any involved relation is modified. This makes them powerful but potentially expensive to implement efficiently.

\thm{Two-Phase Locking (2PL)}{A concurrency control protocol that guarantees conflict-serializability by requiring that all lock acquisitions by a transaction must occur before any of its locks are released.}

In complex transactions where multiple interrelated tables are updated, immediate constraint checking can be problematic. SQL addresses this with deferred checking. By declaring a constraint as \texttt{DEFERRABLE}, the system can postpone validation until the very end of a transaction, just before it commits. This allows for temporary inconsistencies that are resolved by the time the transaction completes its entire sequence of actions.

\section{Active Database Elements: Triggers}
Triggers, or Event-Condition-Action (ECA) rules, represent the transition from a passive database to an active one. A trigger is awakened by a specific event—such as an insertion, deletion, or update—and then evaluates a condition. If the condition is true, the system executes a predefined set of actions.

\dfn{Trigger}{A stored procedure that is automatically invoked by the DBMS in response to specified changes to the database, consisting of a triggering event, a condition, and a resulting action.}

Triggers offer significant flexibility compared to standard constraints. They can be set to execute either \texttt{BEFORE} or \texttt{AFTER} the triggering event and can operate at either the row level (executing for every modified tuple) or the statement level (executing once for the entire SQL command). They are frequently used to enforce complex business rules, maintain audit logs, or automatically fix data inconsistencies that simple constraints cannot handle.