information_systems_for_eng…/sections/data_definition_sql.tex

\chapter{Data Definition with SQL}

\section{Overview of Data Definition}

Data definition is the fundamental process of specifying the logical structure of a database, often referred to as the schema. In the context of SQL (Structured Query Language), this involves declaring the tables that will store information, identifying the types of data permitted in each column, and establishing rules to maintain the correctness and consistency of that data. The relational model, which serves as the foundation for modern database systems, represents information as two-dimensional tables called relations. By defining these relations, developers create a rigid framework that ensures data independence, allowing the underlying physical storage to be optimized without affecting the high-level queries used by applications.

\section{Mathematical Foundations of Relations}

The concept of a relational table originates from mathematical set theory. At its core, a relation is defined over a series of sets, which are known as attribute domains. While a general relation can represent any subset of a Cartesian product, SQL tables require more specific semantic structures to function effectively as data stores.

A collection of data can be viewed as a set of records, where each record acts as a "map" or a function from a set of attributes to values. To transform a simple collection into a formal relational table, three specific types of integrity must be enforced.

\dfn{Relational Table (Set Semantics)}{A relational table is a set of maps from attribute names to values that satisfies relational integrity, domain integrity, and atomic integrity.}

\thm{The Three Rules of Relational Integrity}{To qualify as a relational table, a collection must fulfill:

	\begin{enumerate}
		\item \textbf{Relational Integrity}: Every record in the collection must have the same support, meaning they all share the exact same set of attributes.
		\item \textbf{Domain Integrity}: Each attribute is associated with a specific domain (type), and every value for that attribute must belong to that domain.
		\item \textbf{Atomic Integrity}: Every value in the table must be atomic, meaning it cannot be broken down into smaller components (e.g., no tables within tables).
	\end{enumerate}}

\nt{While mathematics primarily uses set semantics (no duplicates), practical SQL implementations often utilize bag semantics, allowing for duplicate records, or list semantics, where the order of records is preserved.}

\section{The Evolution and Nature of SQL}

SQL was developed in the early 1970s at IBM’s San Jose research facility, originally under the name SEQUEL (Structured English Query Language). Created by Don Chamberlin and Raymond Boyce, the language was designed to be more intuitive than earlier procedural languages by using English-like syntax.

The primary characteristic of SQL is that it is a declarative language. Unlike imperative languages such as Java or C++, where the programmer must define exactly how to retrieve or calculate data, a SQL user simply declares what the desired result looks like. The database engine then determines the most efficient way to execute the request.

\thm{Set-Based Processing}{SQL is a set-based language, meaning it manipulates entire relations with a single command rather than processing one record at a time.}

\section{SQL Data Types and Domains}

Every attribute in a SQL table must be assigned a data type. These types define the nature of the data and the operations that can be performed on it.

\subsection{String Types}

SQL provides several ways to store text. Fixed-length strings are defined as \texttt{char(n)}, where the system reserves exactly $n$ characters. If the input is shorter, it is padded with spaces. Variable-length strings with a specified limit are defined as \texttt{varchar(n)}. For very long text without a specific limit, PostgreSQL uses the \texttt{text} type, while the SQL standard refers to this as \texttt{clob} (Character Large Object).

\subsection{Numeric Types}

Numbers are categorized into exact and approximate types. Exact numbers include integers (\texttt{smallint}, \texttt{integer}, \texttt{bigint}) and fixed-point decimals.

\dfn{Fixed-Point Decimal}{A numeric type defined by \texttt{decimal(p, s)}, where $p$ is the total number of significant digits (precision) and $s$ is the number of digits after the decimal point (scale).}

Approximate numbers are represented as floating-point values using \texttt{real} (single precision) or \texttt{double precision}. These follow the IEEE 754 standard and are highly efficient because they are handled directly by computer hardware.

\subsection{Temporal and Binary Types}

SQL supports complex date and time tracking. The \texttt{date} type follows the Gregorian calendar, while \texttt{time} tracks hours, minutes, and seconds. \texttt{timestamp} combines both, and can optionally include time zone data to handle global information.

\nt{The \texttt{interval} type represents a duration. However, there is a "duration wall" between months and days because the number of days in a month is variable, making certain additions ambiguous.}

Binary data, such as images or videos, is stored using \texttt{binary(p)}, \texttt{varbinary(p)}, or \texttt{blob} (referred to as \texttt{bytea} in PostgreSQL).

\section{Structural Management of Tables}

The Data Definition Language (DDL) subset of SQL provides commands to manage the lifecycle of tables.

\subsection{Creating and Dropping Tables}

The \texttt{CREATE TABLE} statement is used to define a new relation. It requires a unique table name, a list of attributes, and their associated domains. A newly created table is initially empty.

\nt{In SQL, names are generally case-insensitive. However, if a developer needs to force a specific case for an attribute name, they must surround it with double quotes. Single quotes are reserved exclusively for string literals (values).}

To remove a table entirely from the database, the \texttt{DROP TABLE} command is used. If there is uncertainty about whether a table exists, the \texttt{IF EXISTS} clause can be added to prevent execution errors.

\subsection{Modifying Tables}

The \texttt{ALTER TABLE} command allows for changes to an existing table's schema. This includes adding new columns, removing existing ones, or renaming attributes and the table itself.

\thm{Adding Columns to Populated Tables}{When a new column is added to a table that already contains data, the system must fill the new attribute for existing rows. By default, it uses \texttt{NULL}, but a specific \texttt{DEFAULT} value can be specified instead.}

\section{Data Population and Manipulation}

While data modification is primarily part of the Data Manipulation Language (DML), it is closely tied to definition through constraints.

\subsection{Insertion Strategies}

The most basic way to populate a table is the \texttt{INSERT INTO} statement followed by \texttt{VALUES}. One can insert a single record or multiple records in one command. If certain columns are omitted, the system will attempt to fill them with \texttt{NULL} or a defined default value.

\thm{Insertion via Subqueries}{Instead of providing explicit values, an \texttt{INSERT} statement can use a \texttt{SELECT} subquery to compute a set of tuples from other tables and insert them into the target relation.}

\subsection{Updates and Deletions}

Data can be modified using \texttt{UPDATE}, which changes values in existing tuples based on a condition, or removed using \texttt{DELETE FROM}, which deletes specific rows while keeping the table's structure intact.

\section{Consistency and Integrity Constraints}

Constraints are rules used to prevent the entry of invalid data, effectively enforcing relational and domain integrity at the database level.

\dfn{NULL Value}{A special marker used in SQL to indicate that a data value is unknown, inapplicable, or kept secret. It is not equivalent to zero or an empty string.}

\subsection{Fundamental Constraints}

\begin{itemize}
	\item \textbf{NOT NULL}: This ensures that a column cannot have an empty or unknown value. This is a primary tool for pushing a database toward strict relational integrity.
	\item \textbf{UNIQUE}: This requires that every non-null value in a column be distinct. It can be applied to a single column or a combination of columns (table constraint).
	\item \textbf{CHECK}: This allows for arbitrary conditions that every row must satisfy, such as ensuring a price is positive or a date is within a valid range.
\end{itemize}

\section{Primary and Foreign Keys}

Keys are the most critical constraints in relational design as they define how records are identified and linked.

\subsection{Primary Keys}

A primary key is an attribute or set of attributes that uniquely identifies a row. By definition, a primary key must be \texttt{UNIQUE} and \texttt{NOT NULL}. Every table should ideally have one primary key to ensure each record can be referenced without ambiguity.

\subsection{Foreign Keys and Referential Integrity}

A foreign key is an attribute in one table that references a unique or primary key in another table. This creates a link between the two relations.

\thm{Referential Integrity}{This constraint ensures that every value in a foreign key column must either be \texttt{NULL} or exist in the referenced primary key column of the related table.}

\subsection{Handling Deletions in References}

If a referenced value is deleted, the system must follow a specific policy to maintain integrity.

\nt{Common policies for handling the deletion of a referenced record include:

	\begin{itemize}
		\item \textbf{CASCADE}: Automatically delete or update the referencing rows.
		\item \textbf{RESTRICT/NO ACTION}: Prohibit the deletion if references exist.
		\item \textbf{SET NULL}: Reset the foreign key of the referencing rows to \texttt{NULL}.
		\item \textbf{SET DEFAULT}: Reset the foreign key to its default value.
	\end{itemize}}

\section{Lexical vs. Value Space}

A sophisticated concept in data definition is the distinction between how data is represented and what it actually is. The "Value Space" refers to the abstract mathematical object (e.g., the concept of the number four), while the "Lexical Space" refers to the various ways that value can be written in a query (e.g., '4', '4.0', '04', or '4e0'). SQL engines are responsible for mapping various lexical representations to the correct underlying value space to perform comparisons and arithmetic accurately.