\chapter{Database Architecture}

Database architecture serves as the structural foundation that bridges high-level data models with the physical realities of computer hardware. It encompasses a spectrum of components ranging from the multi-layered memory hierarchy used to store bits and bytes to the logical abstractions like virtual views that provide tailored interfaces for users. A central goal of this architecture is to maintain data independence, allowing the underlying storage methods to change without affecting how application programs interact with the data. This summary explores the mechanics of physical storage, the implementation of virtual relations, and the optimization techniques involving index structures to ensure efficient data retrieval and system resilience.

\section{The Storage and Memory Hierarchy}
Behind the daily operations of a database system lies a sophisticated hardware hierarchy designed to manage the trade-off between access speed and storage capacity. At the fastest and most expensive level is the processor's internal cache (Levels 1 and 2), providing almost instantaneous access to data measured in nanoseconds. Below this is the main memory, or RAM, which acts as the primary workspace for the Database Management System (DBMS). While RAM provides significant capacity, it is volatile, meaning its contents are lost if power is interrupted. This volatility is a critical consideration for the ACID properties of transactions, specifically durability.

Secondary storage, primarily consisting of magnetic disks, serves as the non-volatile repository where data persists. Accessing disks involves mechanical movements, introducing latencies in the millisecond range, which is orders of magnitude slower than RAM. For massive data sets that exceed disk capacity, tertiary storage such as tapes or DVDs is utilized, offering enormous capacity (terabyte to petabyte range) at the cost of significantly longer access times.

\dfn{Memory Hierarchy}{A systematic organization of storage devices in a computer system, ranked by speed, capacity, and cost per bit, typically including cache, main memory, and secondary/tertiary storage.}

\thm{The Dominance of I/O}{The performance of a database system is largely determined by the number of disk I/O operations performed, as the time required to access a block on disk is significantly greater than the time needed to process data in main memory.}

\section{Physical Data Representation and PGDATA}
The physical storage of a database is typically organized into a specific directory structure on the host machine, often referred to as the data directory or PGDATA. This directory contains the actual files representing tables, configuration parameters, and transaction logs. To manage this data effectively, the DBMS divides information into blocks or pages, which are the fundamental units of transfer between disk and memory. In systems like PostgreSQL, these pages are usually 8KB in size.

Data within these pages is organized into records. Fixed-length records are straightforward to manage, but variable-length fields require more complex structures like offset tables within the block header. When fields become exceptionally large, such as multi-gigabyte video files or documents, techniques like TOAST (The Oversized-Attribute Storage Technique) are employed to store these values in separate chunks, preventing them from bloating the primary data pages.

\section{Virtual Views and Data Abstraction}
Virtual views are relations that do not exist physically in the database but are instead defined by a stored query over base tables. These views provide a layer of abstraction, allowing different users to see the data in formats that suit their specific needs without duplicating the underlying information. When a user queries a view, the system's query processor substitutes the view name with its corresponding definition, effectively transforming the query into one that operates directly on the stored base tables.

\dfn{Virtual View}{A relation that is not stored in the database but is defined by an expression that constructs it from other relations whenever it is needed.}

Attributes within a view can be renamed for clarity using the AS keyword or by listing names in the declaration. This allows the architect to present a clean logical model to the end-user while hiding the complexity of the underlying join operations or attribute names.

\section{Modification of Virtual Relations}
Modifying a view is inherently more complex than modifying a base table because the system must determine how to map changes to the underlying physical data. SQL allows for "updatable views" under specific conditions: the view must be defined over a single relation, it cannot use aggregation or duplicate elimination, and the selection criteria must be simple enough that the system can unambiguously identify which base tuples are affected.

\dfn{Updatable View}{A virtual view that is sufficiently simple to allow insertions, deletions, and updates to be translated directly into equivalent modifications on the underlying base relation.}

For more complex views that involve multiple tables or aggregations, "instead-of" triggers provide a solution. These triggers intercept a modification attempt on a view and execute a custom piece of logic—written by the database designer—that appropriately updates the base tables.

\section{Index Structures and Motivation}
As database relations grow, scanning every block to find specific information becomes prohibitively slow. Indexes are specialized data structures designed to accelerate this process. An index takes the value of a specific attribute, known as the search key, and provides pointers directly to the records containing that value.

\dfn{Index}{A stored data structure that facilitates the efficient retrieval of records in a relation based on the values of one or more attributes.}

The primary motivation for indexing is the reduction of disk I/O. For example, finding a specific movie in a massive relation is much faster if the system can use an index on the title rather than performing a full table scan. In joins, indexes on the join attributes can allow the system to look up only the relevant matching tuples, avoiding the exhaustive pairing of every row from both relations.

\section{Strategic Selection of Indexes}
While indexes speed up queries, they impose a cost: every time a record is inserted, deleted, or updated, the associated indexes must also be modified. This creates a strategic trade-off for the database architect. A clustering index, where the physical order of records matches the index order, is exceptionally efficient for range queries as it minimizes the number of blocks that must be read. Non-clustering indexes are useful for locating individual records but may require many disk accesses if many rows match the search key, as the records might be scattered across different blocks.

\thm{Index Cost-Benefit Analysis}{The decision to create an index depends on the ratio of queries to modifications; an index is beneficial if the time saved during data retrieval exceeds the additional time required to maintain the index during updates.}

Architects often use a cost model based on the number of disk I/O's to evaluate the utility of a proposed index. This model considers factors like the number of tuples (T), the number of blocks (B), and the number of distinct values for an attribute (V).

\section{Historical Foundations of Relational Theory}
The modern concept of database architecture and data independence traces back to the seminal work of Edgar Codd in 1970. Codd's introduction of the relational model revolutionized the field by suggesting that data should be viewed as sets of tuples in tables, independent of their physical storage. This shift allowed for the development of high-level query languages like SQL and sophisticated optimization techniques that define the current state of database management systems. Subsequent research into integrity checking and semistructured data models like XML continues to build upon these relational foundations.

\dfn{Data Independence}{The principle that the logical structure of data (the schema) should be separated from its physical storage, ensuring that changes to the storage method do not require changes to application programs.}