\chapter{Views and Indecies}

The management of information within a relational database system involves a balance between logical abstraction and physical performance. While base tables provide the primary storage for data, they are not always optimized for the specific ways in which users interact with information. To bridge this gap, database systems utilize two critical components: \textbf{Views} and \textbf{Indices}. Views allow designers to create virtual relations that simplify complex queries and provide security by filtering access to specific attributes or rows. Indices, on the other hand, focus on the physical layer, providing specialized data structures that accelerate the retrieval of tuples without requiring exhaustive table scans. Together, these tools ensure that a database is both easy to use for the developer and efficient for the machine. This chapter explores the declaration and querying of virtual views, the limitations of updating them, the mechanics of indexing, and the mathematical models used to determine when an index provides a genuine performance benefit.

\section{Virtual Views and Logical Abstraction}

A virtual view is a relation that does not exist as a separate entity on the physical disk but is instead defined by a query over one or more base tables. From the perspective of the user or the application layer, a view is indistinguishable from a standard table; it can be queried, joined with other relations, and used in subqueries.

\dfn{Virtual View}{A named relation defined by an expression or query that acts as a shortcut to data stored in other relations. It is persistent in definition but transient in representation, meaning its contents are recomputed or mapped back to base tables every time it is accessed.}

The primary advantage of using views is \textbf{productivity}. Instead of repeating a complex subquery multiple times across different parts of an application, a developer can define that subquery as a view and refer to it by name.

\thm{The Interpretation of View Queries}{When a query refers to a virtual view, the query processor logically replaces the view name with its underlying definition. This effectively turns the query into a larger expression that operates directly on the base tables, ensuring that the view always reflects the most current state of the database.}

\nt{To declare a view in SQL, the \texttt{CREATE VIEW} statement is used followed by the keyword \texttt{AS} and a standard \texttt{SELECT-FROM-WHERE} block. If a developer wishes to change the column names presented by the view to be more descriptive or to avoid name collisions, they can list the new attribute names in parentheses immediately following the view name.}

\section{Updatable Views and Modification Criteria}

While querying a view is straightforward, modifying one—through \texttt{INSERT}, \texttt{DELETE}, or \texttt{UPDATE}—presents a logical challenge. Since a view is virtual, any change must be translated into a corresponding change in the underlying base tables. SQL allows this only under specific, restrictive conditions to ensure that the translation is unambiguous.

\dfn{Updatable View}{A virtual view that the DBMS can modify by passing the changes through to the base relation. In standard SQL, this generally requires the view to be defined over a single relation and to include enough attributes so that a valid tuple can be formed in the underlying table.}

To be considered updatable without the help of triggers, a view typically must meet several criteria:
\begin{itemize}
	\item The \texttt{FROM} clause must contain exactly one relation.
	\item There can be no \texttt{DISTINCT} keyword, as this would make it impossible to determine which original tuple a change refers to.
	\item The \texttt{WHERE} clause cannot use the relation itself in a subquery.
	\item The \texttt{SELECT} list must include enough attributes to satisfy the \texttt{NOT NULL} and primary key constraints of the base table, or those omitted attributes must have default values.
\end{itemize}

\nt{If an insertion is made into a view and the view projects out the attribute used in the \texttt{WHERE} clause, the new tuple might disappear from the view immediately after being added. This is because the underlying table receives the tuple with a \texttt{NULL} or default value that may not satisfy the view’s selection criteria.}

\section{Instead-Of Triggers}

When a view is too complex to be automatically updatable (for instance, when it involve joins or aggregations), a database designer can use \textbf{Instead-Of Triggers}. These allow the programmer to explicitly define how a modification to a view should be handled by the system.

\thm{Instead-Of Trigger Principle}{An instead-of trigger intercepts a modification command intended for a view and executes a specified block of code in its place. This code usually involves custom logic to distribute the modification across multiple base tables or to calculate missing values.}

\nt{By using the \texttt{REFERENCING NEW ROW AS} clause, the trigger can access the values the user attempted to insert into the view and use them as parameters for updates to the actual stored tables.}

\section{Physical Storage and the Motivation for Indices}

In a database without indices, the only way to find a specific record is through a \textbf{Full Scan}. This requires the system to read every block of the relation from the disk and check every tuple against the search condition. While this is feasible for small tables, it becomes a massive bottleneck as the data grows into millions or billions of rows.

\dfn{Index}{A supplementary data structure that associates specific values of one or more attributes (the search key) with pointers to the physical locations of the records containing those values. Its purpose is to allow the system to bypass irrelevant data and jump directly to the desired blocks.}

\thm{The I/O Model of Computation}{The cost of a database operation is primarily determined by the number of disk I/O actions it requires. Because moving data from the disk to main memory is orders of magnitude slower than CPU operations, the efficiency of a physical plan is measured by how many blocks must be read or written.}

\nt{It is important to distinguish between the \textbf{search key} of an index and the \textbf{primary key} of a relation. An index can be built on any attribute or set of attributes, regardless of whether they are unique.}

\section{Clustered and Non-Clustered Indices}

The relationship between the order of an index and the physical arrangement of tuples on the disk significantly impacts performance.

\dfn{Clustering Index}{An index where the physical order of the records in the data file matches or closely follows the order of the keys in the index. This ensures that all tuples with the same or similar key values are packed into the minimum possible number of blocks.}

\thm{Clustering Efficiency}{A clustering index is much more efficient for range queries than a non-clustering index. In a clustered scenario, the system can retrieve a range of values by reading consecutive blocks. In a non-clustered scenario, every matching tuple might reside on a different block, potentially requiring one disk I/O per tuple.}

\nt{A relation can only have one clustering index because the data can only be physically sorted in one way. However, it can have multiple non-clustering (secondary) indices.}

\section{The Mechanics of B-Trees}

The most prevalent index structure in modern database systems is the \textbf{B-Tree}, specifically the B+ Tree variant. This structure is a balanced tree that automatically scales with the size of the data.

\dfn{B-Tree}{A balanced tree structure where every path from the root to a leaf is of equal length. Each node corresponds to a single disk block and contains a sorted list of keys and pointers to either child nodes or data records.}

B-Trees are characterized by a parameter $n$, which defines the maximum number of keys a block can hold. The rules for a B-Tree include:
\begin{itemize}
	\item \textbf{Internal Nodes:} Must have between $\lceil (n+1)/2 \rceil$ and $n+1$ children, except for the root, which can have as few as two.
	\item \textbf{Leaves:} Hold the actual search keys and pointers to the records. They also include a pointer to the next leaf in sequence to facilitate range scans.
\end{itemize}

\thm{Logarithmic Search Complexity}{In a B-Tree, the number of steps required to find any specific key is proportional to the height of the tree. For a tree with $N$ records and a fan-out of $f$, the height is approximately $\log_f N$. Because $f$ is typically large (often hundreds of keys per block), even a billion records can be searched in just three or four disk accesses.}

\nt{B-Trees are dynamic; they grow by splitting nodes when they become too full and shrink by merging nodes when deletions leave them under-populated. This ensures that every block remains at least half-full, optimizing disk usage.}

\section{Hash Indices and Constant Time Lookups}

As an alternative to tree-structured indices, databases may use \textbf{Hash Indices}. These are based on a "smoothie machine" analogy: a deterministic function that turns any input into a seemingly random integer.

\dfn{Hash Index}{A structure that uses a hash function to map search keys into specific buckets. Each bucket corresponds to one or more disk blocks holding pointers to the relevant records.}

\thm{Constant Complexity}{A hash index provides $O(1)$ lookup time for equality queries. No matter how large the table becomes, the time to locate a specific key remains constant, as it requires only the computation of the hash and a direct jump to the indicated bucket.}

\nt{The major limitation of hash indices is their lack of support for range queries. Because the hash function randomizes the placement of keys, two keys that are close in value (e.g., 19 and 20) will likely end up in completely different parts of the index.}

\section{Index Selection and Cost Modeling}

Creating an index is not a "free" performance boost. Every index added to a relation imposes costs that must be weighed against its benefits. The decision process involves analyzing a query workload and estimating the average disk I/O.

\dfn{Index Creation Costs}{The costs associated with indices include the initial CPU and I/O time to build the structure, the additional disk space required to store it, and the "write penalty"—the fact that every \texttt{INSERT}, \texttt{DELETE}, or \texttt{UPDATE} to the base table must also update every associated index.}

\thm{The Selection Formula}{If $p$ is the probability that a query uses a certain attribute and $1-p$ is the probability of an update, an index on that attribute is beneficial only if the time saved during the queries outweighs the extra time spent on updates. This can be expressed as a linear combination of costs based on the parameters $B(R)$ (blocks) and $T(R)$ (tuples).}

\nt{In practice, many systems use an "automatic tuning advisor" that applies a greedy algorithm to suggest the best set of indices for a specific historical workload.}

\section{Indices and Complex Queries}

Indices are particularly powerful when dealing with joins or multiple selection criteria.

\thm{The Index-Join Strategy}{In a join $R \bowtie S$, if $S$ has an index on the join attribute, the system can iterate through $R$ and for each tuple, use the index on $S$ to find matching records. This is significantly faster than a nested-loop join if $R$ is small and the index on $S$ is efficient.}

\nt{When multiple indices are available for a single query, a technique called "pointer intersection" can be used. The system retrieves lists of pointers from several indices, intersects them in main memory, and only then reads the data blocks for the tuples that satisfy all conditions.}

\section{Information Retrieval and Inverted Indices}

A specialized form of indexing used for documents is the \textbf{Inverted Index}. This is the technology that powers web search engines and large-scale document repositories.

\dfn{Inverted Index}{A mapping from words (keywords) to the list of documents in which those words appear. Often, these lists include metadata such as the position of the word in the document or whether it appeared in a title or anchor tag.}

\nt{To optimize inverted indices, systems often use "stemming" (reducing words to their root form) and "stop words" (ignoring common words like "the" or "and" that do not help distinguish documents).}

\section{Summary of Design Principles}

The theory of views and indices suggests that database design is as much about managing the physical medium as it is about logical modeling.

\thm{The Table Universe and Consistency}{While views provide a filtered perspective of the data, they must remain consistent with the "table universe"—the set of all possible valid states of the database. Constraints and indices must be applied such that they hold true across all these potential states, ensuring that neither hardware failure nor concurrent access can corrupt the logical integrity of the system.}

\nt{Ultimately, the goal of indices is to turn linear or quadratic problems into logarithmic or constant ones. By carefully selecting which attributes to index based on the $B, T,$ and $V$ parameters, a designer can create a system that remains responsive even under the weight of massive datasets.}

In summary, views provide the necessary abstraction to keep application code clean and secure, while indices provide the surgical precision required to extract data from high-volume storage. A database is essentially a large library; a view is a specific bookshelf curated for a student, while an index is the card catalog that allows a librarian to find one specific page in a million volumes without having to read every book in the building.