Notes V.2.0.0

Rewrote Notes
This commit is contained in:
2026-01-07 13:51:33 +01:00
parent c1878069fd
commit bcd2ddfe42
13 changed files with 787 additions and 623 deletions

View File

@@ -1,72 +1,90 @@
\chapter{Transactions and the Three Tiers}
The evolution of data management has shifted from localized, single-machine installations to complex, multi-tiered architectures that support massive user bases across the globe. This chapter explores the foundational structures of modern information systems, specifically focusing on how databases operate within a server environment. We examine the interaction between various layers of processing, known as the three-tier architecture, and the logical organization of data into environments, clusters, catalogs, and schemas. Furthermore, we investigate the mechanisms that allow general-purpose programming languages, such as Java, to interact with SQL through call-level interfaces like JDBC. Central to this discussion is the management of transactions, which ensure that even in highly concurrent and distributed settings, the integrity and consistency of data are maintained through the adherence to the ACID properties and the management of isolation levels.
\dfn{Database Management System}{A specialized software system designed to create, manage, and provide efficient, safe, and persistent access to large volumes of data over long periods of time.}
Modern database systems do not operate in isolation; they are embedded within complex multi-tier architectures designed to handle thousands of concurrent users. At the heart of this ecosystem is the concept of a transaction, a logical unit of work that ensures data integrity despite system failures or overlapping user actions. To maintain this integrity, databases adhere to the ACID properties—Atomicity, Consistency, Isolation, and Durability. This chapter explores the three-tier architecture that connects users to data, the hierarchical structure of the SQL environment, and the rigorous mechanics of transaction management, including isolation levels and locking protocols such as Two-Phase Locking (2PL).
\section{The Three-Tier Architecture}
Modern large-scale database installations typically utilize a three-tier or three-layer architecture. This structure is designed to separate different functional concerns, which allows for better scalability, security, and maintenance.
Large-scale database installations typically utilize a three-tier architecture to separate concerns and improve scalability. This organization allows different components of the system to run on dedicated hardware, optimizing performance for each specific task.
\thm{Three-Tier Architecture}{A system organization that distinguishes between three interacting layers: the Web Server tier (user interface), the Application Server tier (business logic), and the Database Server tier (data management).}
\dfn{Three-Tier Architecture}{A system organization consisting of three distinct layers: the Web Server tier for user interaction, the Application Server tier for processing logic, and the Database Server tier for data management.}
The first layer is the Web-Server Tier. This tier manages the primary interaction with the user, often through the Internet. When a customer accesses a service, a web server responds to the initial request and presents the interface, such as an HTML page with forms and menus. The client's browser handles the user's input and transmits it back to the web server, which then communicates with the application tier.
The first tier consists of **Web Servers**. These processes act as the entry point for clients, usually interacting via a web browser over the Internet. When a user enters a URL or submits a form, the browser sends an HTTP (Hypertext Transfer Protocol) request to the web server. The web server is responsible for returning an HTML page, which may include images and other data to be displayed to the user.
The middle layer is the Application-Server Tier. This is where the "business logic" of an organization resides. The responsibility of this tier is to process requests from the web server by determining what data is needed and how it should be presented. In complex systems, this tier might be divided into subtiers, such as one for object-oriented data handling or another for information integration, where data from multiple disparate sources is combined. The application tier performs the heavy lifting of turning raw database information into a meaningful response for the end user.
\nt{Common web server software includes Apache and Tomcat, which are frequently used in both professional and academic environments to bridge the gap between web browsers and database systems.}
The final layer is the Database-Server Tier. This layer consists of the processes that run the Database Management System (DBMS). It receives query and modification requests from the application tier and executes them against the stored data. To ensure efficiency, this tier often maintains a pool of open connections that can be shared among various application processes, avoiding the overhead of constantly opening and closing connections.
The second tier is the **Application Server**, often referred to as the **Business Logic** layer. This is where the core functionality of the system resides. When the web server receives a request that requires data, it communicates with the application tier. Programmers use languages such as Java, Python, C++, or PHP to write the logic that decides how to respond to user requests. This layer is responsible for generating SQL queries, sending them to the database, and formatting the returned results into a programmatically built HTML page or other responses.
\section{The SQL Environment and Its Logical Organization}
The third tier is the **Database Server**. These are the processes running the Database Management System (DBMS), such as PostgreSQL or MySQL. This tier executes the queries requested by the application tier, manages data persistence on disk, and ensures that the system remains responsive through buffering and connection management.
The SQL environment provides the framework within which data exists and operations are executed. This environment is organized into a specific hierarchy to manage terminology and scope.
\section{The SQL Environment}
\dfn{SQL Environment}{The overall framework, typically an installation of a DBMS at a specific site, under which database elements are defined and SQL operations are performed.}
Within the database tier, data is organized in a hierarchical framework known as the SQL environment. This structure allows for a clear namespace and organizational scope for all database elements.
At the top of this hierarchy is the Cluster. A cluster is a collection of catalogs and represents the maximum scope over which a single database operation can occur. Essentially, it is the entire database as perceived by a specific user.
\dfn{SQL Environment}{The overall framework under which database elements exist and SQL operations are executed, typically representing a specific installation of a DBMS.}
Below the cluster is the Catalog. Catalogs are used to organize schemas and provide a unique naming space. Each catalog must contain a special schema that holds information about all other schemas within that catalog.
The hierarchy begins with the **Cluster**, which represents the maximum scope for a database operation and the set of all data accessible to a particular user. Within a cluster, data is organized into **Catalogs**. A catalog is the primary unit for supporting unique terminology and contains one or more **Schemas**.
The most basic unit of organization is the Schema. A schema is a collection of database elements such as tables, views, triggers, and assertions. One can create a schema using a specific declaration or modify it over time.
A schema is a collection of database objects, including tables, views, triggers, and assertions. In professional environments, a full name for a table might look like `CatalogName.SchemaName.TableName`. If the catalog or schema is not explicitly specified, the system defaults to the current session's settings (e.g., `public` is often the default schema).
\section{Establishing Connections and Sessions}
\thm{The Concept of Sessions}{A session is the period during which a connection between a SQL client and a SQL server is active, encompassing a sequence of operations performed under a specific authorization ID.}
For a program or a user to interact with the database server, a link must be established. This is handled through connections and sessions. A connection is the physical or logical link between a SQL client (often the application server) and a SQL server. A user can open multiple connections, but only one can be active at any given moment.
\section{Fundamentals of Transactions}
\dfn{Session}{The sequence of SQL operations performed while a specific connection is active. It includes state information such as the current catalog, current schema, and the authorized user.}
A transaction is a single execution of a program or a batch of queries that must be treated as an indivisible unit. The goal of the transaction manager is to ensure that even if the system crashes or multiple users access the same record, the result remains correct.
When a connection is established, it usually requires an authorization clause, which includes a username and password. This ensures that the current authorization ID has the necessary privileges to perform the requested actions. In this context, a "Module" refers to the application program code, while a "SQL Agent" is the actual execution of that code.
\dfn{Transaction}{A collection of one or more database operations, such as reads and writes, that are grouped together to be executed atomically and in isolation from other concurrent actions.}
\section{Transactions and the ACID Properties}
To be considered reliable, every transaction must satisfy the **ACID** test. These four properties are the cornerstone of database design theory.
Transactions are the fundamental units of work in a database system. To ensure that the database remains in a consistent state despite concurrent access or system failures, every transaction must follow a set of requirements known as the ACID properties.
\thm{ACID Properties}{
\begin{itemize}
\item \textbf{Atomicity:} Often described as "all-or-nothing," this ensures that a transaction is either fully completed or not executed at all. If a failure occurs halfway through, any partial changes must be undone.
\item \textbf{Consistency:} A transaction must take the database from one consistent state to another, satisfying all integrity constraints like primary keys and check constraints.
\item \textbf{Isolation:} Each transaction should run as if it were the only one using the system, regardless of how many other users are active.
\item \textbf{Durability:} Once a transaction has been committed, its effects must persist in the database even in the event of a power outage or system crash.
\end{itemize}}
\thm{ACID Properties}{A set of four essential characteristics of a transaction: Atomicity (all-or-nothing execution), Consistency (preserving database invariants), Isolation (executing as if in isolation), and Durability (permanent storage of results).}
\nt{Atomicity in transactions should not be confused with atomic values in First Normal Form. In this context, it refers to the indivisibility of the execution process itself.}
Atomicity ensures that if a transaction is interrupted, any partial changes are rolled back, leaving the database as if the transaction never started. Consistency guarantees that a transaction moves the database from one valid state to another, respecting all defined rules. Isolation is managed by a scheduler to ensure that the concurrent execution of multiple transactions results in a state that could have been achieved if they were run one after another. Finally, Durability ensures that once a transaction is committed, its effects will survive even a subsequent system crash.
\section{Concurrency and Isolation Levels}
\subsection{Transactional Phenomena and Isolation Levels}
When multiple transactions run at the same time, their actions may interleave in a way that leads to inconsistencies. A **Schedule** is the actual sequence of actions (reads and writes) performed by these transactions. While a **Serial Schedule** (running one transaction after another) is always safe, it is inefficient. Schedulers instead aim for **Serializability**.
When multiple transactions run simultaneously, several problematic phenomena can occur if isolation is not strictly enforced.
\thm{Serializability}{A schedule is serializable if its effect on the database is identical to the effect of some serial execution of the same transactions.}
1. \textbf{Dirty Read}: This happens when one transaction sees data that has been written by another transaction but has not yet been committed. If the first transaction eventually aborts, the data seen by the second transaction effectively never existed.
2. \textbf{Nonrepeatable Read}: A transaction reads the same data twice but finds different values because another transaction modified and committed that data in the meantime.
3. \textbf{Phantom Read}: A transaction runs a query multiple times and finds "phantom" rows that were inserted by another committed transaction during the process.
4. \textbf{Serialization Anomaly}: This occurs when the result of a group of concurrent transactions is inconsistent with any serial ordering of those same transactions.
If isolation is not properly managed, several types of "anomalies" can occur. These phenomena describe undesirable interactions between concurrent processes.
To manage these risks, SQL defines various "Isolation Levels." The most stringent is "Serializable," which prevents all the aforementioned phenomena. Lower levels, such as "Read Committed" or "Read Uncommitted," allow for higher concurrency at the risk of encountering some of these issues.
\dfn{Dirty Read}{A situation where one transaction reads data that has been modified by another transaction but has not yet been committed. If the first transaction subsequently aborts, the second transaction has based its work on data that "never existed."}
\subsection{Java Database Connectivity (JDBC)}
\dfn{Non-repeatable Read}{Occurs when a transaction reads the same data element twice but finds different values because another transaction modified and committed that element in the interim.}
One of the most common ways to implement the application tier is through Java, using the JDBC call-level interface. JDBC allows a Java program to interact with virtually any SQL database by using a standard set of classes and methods.
\dfn{Phantom Read}{A phenomenon where a transaction runs a query to find a set of rows, but upon repeating the query, finds additional "phantom" rows that were inserted and committed by a concurrent transaction.}
\dfn{JDBC}{A Java-based API that provides a standard library of classes for connecting to a database, executing SQL statements, and processing the results.}
SQL provides four **Isolation Levels** that allow developers to trade off strictness for performance.
The process begins by loading a driver for the specific DBMS, such as MySQL or PostgreSQL. Once the driver is loaded, a connection is established using a URL that identifies the database, along with credentials for authorization.
\begin{itemize}
\item \textbf{Read Uncommitted:} The most relaxed level; allows dirty reads.
\item \textbf{Read Committed:} Forbids dirty reads but allows non-repeatable reads.
\item \textbf{Repeatable Read:} Forbids dirty and non-repeatable reads but may allow phantoms.
\item \textbf{Serializable:} The strictest level; ensures the result is equivalent to some serial order.
\end{itemize}
In JDBC, there are different types of statements used to interact with the data. A simple `Statement` is used for queries without parameters, while a `PreparedStatement` is used when a query needs to be executed multiple times with different values. These parameters are denoted by question marks in the SQL string and are bound to specific values before execution.
\section{Locking and Two-Phase Locking (2PL)}
The result of a query in JDBC is returned as a `ResultSet` object. This object acts like a cursor, allowing the program to iterate through the resulting tuples one at a time using a `next()` method. For each tuple, the programmer uses specific getter methods, such as `getInt()` or `getString()`, to extract data based on the attribute's position in the result.
The most common way for a database to enforce serializability is through the use of **Locks**. Before a transaction can read or write a piece of data, it must obtain a lock on that element. These are managed via a **Lock Table** in the scheduler.
\thm{JDBC Interaction Pattern}{The standard flow of database access in Java: Load Driver $\rightarrow$ Establish Connection $\rightarrow$ Create Statement $\rightarrow$ Execute Query/Update $\rightarrow$ Process Results via ResultSet $\rightarrow$ Close Connection.}
\dfn{Shared and Exclusive Locks}{A Shared (S) lock is required for reading and allows multiple transactions to read the same element. An Exclusive (X) lock is required for writing and prevents any other transaction from accessing that element.}
This interface effectively solves the "impedance mismatch" between the set-oriented world of SQL and the object-oriented world of Java. By providing a mechanism to fetch rows individually, it allows Java's iterative control structures to process data retrieved from SQL's relational queries. Furthermore, it supports the execution of updates, which encompass all non-query operations like insertions, deletions, and schema modifications. This robust framework is essential for building the business logic required in the application tier of the three-tier architecture.```
Simply using locks is not enough to guarantee serializability; the timing of when locks are released is critical. If a transaction releases a lock too early, another transaction might intervene and change the data, leading to a non-serializable schedule. To prevent this, systems use the **Two-Phase Locking (2PL)** protocol.
\thm{Two-Phase Locking (2PL)}{A protocol requiring that in every transaction, all locking actions must precede all unlocking actions. This creates two distinct phases: a "growing phase" where locks are acquired and a "shrinking phase" where they are released.}
\nt{Strict Two-Phase Locking is a variation where a transaction does not release any exclusive locks until it has committed or aborted. This prevents other transactions from reading dirty data and avoids the need for cascading rollbacks.}
A potential downside of locking is the risk of a **Deadlock**. This occurs when two or more transactions are stuck in a cycle, each waiting for a lock held by the other. Schedulers must be able to detect these cycles—often using a **Waits-For Graph**—and resolve them by aborting one of the transactions.
In conclusion, the management of transactions requires a deep integration of architectural tiers, hierarchical environments, and rigorous concurrency control. By utilizing ACID properties, various isolation levels, and the 2PL protocol, database systems provide a robust platform where users can safely interact with data as if they were the sole occupants of the system.
\nt{In practice, many developers use higher-level APIs like JDBC for Java or PHP's PEAR DB library to handle the complexities of database connections and transaction boundaries programmatically.}
To think of it another way, a transaction is like a single entry in a shared diary. Even if twenty people are writing in the same diary simultaneously, the system acts like a careful librarian, ensuring that each person's entry is written cleanly on its own line without anyone's ink smudging another's work.