Notes V.2.0.0

Rewrote Notes
This commit is contained in:
2026-01-07 13:51:33 +01:00
parent c1878069fd
commit bcd2ddfe42
13 changed files with 787 additions and 623 deletions

View File

@@ -1,63 +1,121 @@
\chapter{Views and Indecies}
This chapter explores the conceptual and physical layers of database management, focusing on the mechanisms that allow users to interact with data flexibly while ensuring that the underlying hardware performs at its peak. The discussion is divided into two primary concepts: views and indexes.
The management of information within a relational database system involves a balance between logical abstraction and physical performance. While base tables provide the primary storage for data, they are not always optimized for the specific ways in which users interact with information. To bridge this gap, database systems utilize two critical components: \textbf{Views} and \textbf{Indices}. Views allow designers to create virtual relations that simplify complex queries and provide security by filtering access to specific attributes or rows. Indices, on the other hand, focus on the physical layer, providing specialized data structures that accelerate the retrieval of tuples without requiring exhaustive table scans. Together, these tools ensure that a database is both easy to use for the developer and efficient for the machine. This chapter explores the declaration and querying of virtual views, the limitations of updating them, the mechanics of indexing, and the mathematical models used to determine when an index provides a genuine performance benefit.
Virtual views represent a method of logical abstraction. They allow a database designer to present users with data organized in a way that is most convenient for their specific tasks, without necessarily altering the structure of the base tables where the information is physically stored. These virtual relations are computed on demand and provide a layer of data independence, protecting applications from changes in the underlying schema and offering a simplified interface for complex queries.
\section{Virtual Views and Logical Abstraction}
On the physical side, indexes are specialized data structures used to circumvent the high cost of exhaustive table scans. By providing direct paths to specific tuples based on the values of search keys, indexes significantly reduce the number of disk accesses required for lookups and joins. However, the creation of an index is not a cost-free operation. It involves a fundamental trade-off between the acceleration of read operations and the increased overhead associated with insertions, deletions, and updates. This summary evaluates the criteria for view updatability, the mechanics of index implementation, and the rigorous cost models used to determine the optimal configuration of physical storage.
A virtual view is a relation that does not exist as a separate entity on the physical disk but is instead defined by a query over one or more base tables. From the perspective of the user or the application layer, a view is indistinguishable from a standard table; it can be queried, joined with other relations, and used in subqueries.
\section{Virtual Views in a Relational Environment}
\dfn{Virtual View}{A named relation defined by an expression or query that acts as a shortcut to data stored in other relations. It is persistent in definition but transient in representation, meaning its contents are recomputed or mapped back to base tables every time it is accessed.}
In a standard database, relations created through table declarations are considered persistent or "base" tables. These structures are physically stored on disk and remain unchanged unless modified by specific commands. In contrast, a virtual view is a relation defined by a SQL expression, typically a query. It does not exist in storage as a set of tuples; instead, its content is dynamically generated whenever it is referenced.
The primary advantage of using views is \textbf{productivity}. Instead of repeating a complex subquery multiple times across different parts of an application, a developer can define that subquery as a view and refer to it by name.
\dfn{Virtual View}{A named virtual relation defined by a query over one or more existing base tables or other views, which is not physically materialized in the database.}
\thm{The Interpretation of View Queries}{When a query refers to a virtual view, the query processor logically replaces the view name with its underlying definition. This effectively turns the query into a larger expression that operates directly on the base tables, ensuring that the view always reflects the most current state of the database.}
\thm{View Expansion}{The query processing mechanism whereby the name of a view in a SQL query is replaced by the query expression that defines it, allowing the system to optimize the operation as if it were performed directly on the base tables.}
\nt{To declare a view in SQL, the \texttt{CREATE VIEW} statement is used followed by the keyword \texttt{AS} and a standard \texttt{SELECT-FROM-WHERE} block. If a developer wishes to change the column names presented by the view to be more descriptive or to avoid name collisions, they can list the new attribute names in parentheses immediately following the view name.}
When a view is defined, the system stores only its definition. From the perspective of a user, the view is indistinguishable from a base table. It possesses a schema and can be the target of queries. Furthermore, attributes in a view can be renamed during declaration to provide clearer identifiers for the end-user. This is particularly useful when the underlying table uses technical or ambiguous column names. For instance, a view might extract movie titles and production years from a comprehensive database to present a simplified list of films belonging to a specific studio.
\section{Updatable Views and Modification Criteria}
\section{Modification and Update Logic for Views}
While querying a view is straightforward, modifying one—through \texttt{INSERT}, \texttt{DELETE}, or \texttt{UPDATE}—presents a logical challenge. Since a view is virtual, any change must be translated into a corresponding change in the underlying base tables. SQL allows this only under specific, restrictive conditions to ensure that the translation is unambiguous.
While querying a view is straightforward, modifying one—through insertions, updates, or deletions—is conceptually complex because the view contains no physical tuples. For a modification to be successful, the database management system must be able to translate the request into an equivalent sequence of operations on the underlying base tables.
\dfn{Updatable View}{A virtual view that the DBMS can modify by passing the changes through to the base relation. In standard SQL, this generally requires the view to be defined over a single relation and to include enough attributes so that a valid tuple can be formed in the underlying table.}
\dfn{Updatable View}{A virtual view that is sufficiently simple for the system to automatically map modifications back to the original base relations without ambiguity.}
To be considered updatable without the help of triggers, a view typically must meet several criteria:
\begin{itemize}
\item The \texttt{FROM} clause must contain exactly one relation.
\item There can be no \texttt{DISTINCT} keyword, as this would make it impossible to determine which original tuple a change refers to.
\item The \texttt{WHERE} clause cannot use the relation itself in a subquery.
\item The \texttt{SELECT} list must include enough attributes to satisfy the \texttt{NOT NULL} and primary key constraints of the base table, or those omitted attributes must have default values.
\end{itemize}
\thm{Criteria for Updatability}{To be updatable, a view must generally be defined by a simple selection and projection from a single relation. It cannot involve duplicate elimination, aggregations, or group-by clauses, and it must include all attributes necessary to form a valid tuple in the base relation.}
\nt{If an insertion is made into a view and the view projects out the attribute used in the \texttt{WHERE} clause, the new tuple might disappear from the view immediately after being added. This is because the underlying table receives the tuple with a \texttt{NULL} or default value that may not satisfy the views selection criteria.}
If a view is defined over multiple relations, such as through a join, it is typically not updatable because the logic for handling the change is not unique. For example, if a tuple is deleted from a view joining movies and producers, it is unclear whether the system should delete the movie, the producer, or both. To overcome these limitations, SQL provides "instead-of" triggers. These allow the designer to intercept a modification attempt on a view and define a custom set of actions to be performed on the base tables instead. This ensures that the intended semantics of the operation are preserved regardless of the complexity of the view's definition.
\section{Instead-Of Triggers}
\section{Physical Indexes and Retrieval Performance}
When a view is too complex to be automatically updatable (for instance, when it involve joins or aggregations), a database designer can use \textbf{Instead-Of Triggers}. These allow the programmer to explicitly define how a modification to a view should be handled by the system.
The efficiency of data retrieval is largely determined by the number of disk blocks the system must access. Without an index, the database must perform a full scan of a relation to find specific tuples. For large relations spanning thousands of blocks, this process is prohibitively slow. An index is a physical structure that maps values of a search key to the physical locations of the tuples containing those values.
\thm{Instead-Of Trigger Principle}{An instead-of trigger intercepts a modification command intended for a view and executes a specified block of code in its place. This code usually involves custom logic to distribute the modification across multiple base tables or to calculate missing values.}
\dfn{Index}{A physical data structure designed to accelerate the location of tuples within a relation based on specified attribute values, bypassing the need for an exhaustive scan of all blocks.}
\nt{By using the \texttt{REFERENCING NEW ROW AS} clause, the trigger can access the values the user attempted to insert into the view and use them as parameters for updates to the actual stored tables.}
\dfn{Multi-attribute Index}{An index built on a combination of two or more attributes, allowing the system to efficiently find tuples when values for all or a prefix of those attributes are provided in a query.}
\section{Physical Storage and the Motivation for Indices}
Indexes are most commonly implemented as B-trees or hash tables. A B-tree is a balanced structure where every path from the root to a leaf is of equal length, ensuring predictable performance for both point lookups and range queries. In most modern systems, the B+ tree variant is used, where pointers to the actual data records are stored only at the leaf nodes. This structure allows the system to navigate through the index by comparing search keys, moving from a root block down to the appropriate leaf with minimal disk I/O.
In a database without indices, the only way to find a specific record is through a \textbf{Full Scan}. This requires the system to read every block of the relation from the disk and check every tuple against the search condition. While this is feasible for small tables, it becomes a massive bottleneck as the data grows into millions or billions of rows.
\section{Selection and Performance Analysis of Indexes}
\dfn{Index}{A supplementary data structure that associates specific values of one or more attributes (the search key) with pointers to the physical locations of the records containing those values. Its purpose is to allow the system to bypass irrelevant data and jump directly to the desired blocks.}
The decision of whether to build an index on a particular attribute requires a careful analysis of the expected workload. While an index speeds up queries, every modification to the underlying relation requires a corresponding update to the index. This secondary update involves reading and writing index blocks, which can double the cost of insertions and deletions.
\thm{The I/O Model of Computation}{The cost of a database operation is primarily determined by the number of disk I/O actions it requires. Because moving data from the disk to main memory is orders of magnitude slower than CPU operations, the efficiency of a physical plan is measured by how many blocks must be read or written.}
\dfn{Clustering Index}{An index where the physical order of the tuples on disk corresponds to the order of the index entries, ensuring that all tuples with a specific key value are stored on as few blocks as possible.}
\nt{It is important to distinguish between the \textbf{search key} of an index and the \textbf{primary key} of a relation. An index can be built on any attribute or set of attributes, regardless of whether they are unique.}
\thm{The Index Selection Trade-off}{The process of evaluating whether the time saved during the execution of frequent queries outweighs the time lost during the maintenance of the index for insertions, updates, and deletions.}
\section{Clustered and Non-Clustered Indices}
To make this determination, database administrators use a cost model centered on disk I/O. If a relation is clustered on an attribute, the cost of retrieving all tuples with a specific value is approximately the number of blocks occupied by the relation divided by the number of distinct values of that attribute. If the index is non-clustering, each retrieved tuple may potentially reside on a different block, leading to a much higher retrieval cost. A tuning advisor or administrator will calculate the average cost of all anticipated operations (queries and updates) to decide which set of indexes minimizes the total weighted cost for the system.
The relationship between the order of an index and the physical arrangement of tuples on the disk significantly impacts performance.
\section{Materialized Views and Automated Tuning}
\dfn{Clustering Index}{An index where the physical order of the records in the data file matches or closely follows the order of the keys in the index. This ensures that all tuples with the same or similar key values are packed into the minimum possible number of blocks.}
Beyond virtual views and physical indexes, database systems often employ materialized views. Unlike a virtual view, a materialized view is physically computed and stored on disk. This approach is beneficial for high-complexity queries that are executed frequently, such as those involving expensive joins or aggregations in a data warehousing environment.
\thm{Clustering Efficiency}{A clustering index is much more efficient for range queries than a non-clustering index. In a clustered scenario, the system can retrieve a range of values by reading consecutive blocks. In a non-clustered scenario, every matching tuple might reside on a different block, potentially requiring one disk I/O per tuple.}
\dfn{Materialized View}{A view whose query result is physically stored in the database, requiring an explicit maintenance strategy to synchronize its content with changes in the base tables.}
\nt{A relation can only have one clustering index because the data can only be physically sorted in one way. However, it can have multiple non-clustering (secondary) indices.}
The use of materialized views introduces a maintenance cost similar to that of indexes. Every time a base table changes, the materialized view must be updated, either immediately or on a periodic schedule. Because the number of possible views is virtually infinite, modern systems use automated tuning advisors. These tools analyze a query log to identify representative workloads and then use a greedy algorithm to recommend the combination of indexes and materialized views that will provide the greatest overall benefit to the system's performance.
\section{The Mechanics of B-Trees}
\section{Strategic Balance in Database Design}
The most prevalent index structure in modern database systems is the \textbf{B-Tree}, specifically the B+ Tree variant. This structure is a balanced tree that automatically scales with the size of the data.
The successful implementation of a database requires a strategic balance between logical flexibility and physical efficiency. Virtual views provide the necessary abstraction to simplify application development and manage data security. Meanwhile, the careful selection of indexes and materialized views ensures that the system remains responsive as the volume of data grows.
\dfn{B-Tree}{A balanced tree structure where every path from the root to a leaf is of equal length. Each node corresponds to a single disk block and contains a sorted list of keys and pointers to either child nodes or data records.}
By employing a formal cost model based on disk access times, designers can objectively evaluate the merits of different storage configurations. The goal is to reach a state where the most frequent and critical operations are prioritized, even if it necessitates a penalty for less common tasks. This continuous process of tuning and optimization is a hallmark of modern relational database management, allowing these systems to handle massive datasets while providing the illusion of instantaneous access to information. An index on a primary key, for example, is almost always beneficial because it is frequently queried and guarantees that only a single block needs to be retrieved to find a unique record. In contrast, an index on a frequently updated non-key attribute requires a more nuanced analysis to ensure it does not become a performance bottleneck.
B-Trees are characterized by a parameter $n$, which defines the maximum number of keys a block can hold. The rules for a B-Tree include:
\begin{itemize}
\item \textbf{Internal Nodes:} Must have between $\lceil (n+1)/2 \rceil$ and $n+1$ children, except for the root, which can have as few as two.
\item \textbf{Leaves:} Hold the actual search keys and pointers to the records. They also include a pointer to the next leaf in sequence to facilitate range scans.
\end{itemize}
Ultimately, the choice of views and indexes defines the operational efficiency of the entire information system. A well-designed logical and physical schema acts as the foundation for scalable, high-performance applications, enabling efficient data exploration and robust transaction processing in even the most demanding environments.```
\thm{Logarithmic Search Complexity}{In a B-Tree, the number of steps required to find any specific key is proportional to the height of the tree. For a tree with $N$ records and a fan-out of $f$, the height is approximately $\log_f N$. Because $f$ is typically large (often hundreds of keys per block), even a billion records can be searched in just three or four disk accesses.}
\nt{B-Trees are dynamic; they grow by splitting nodes when they become too full and shrink by merging nodes when deletions leave them under-populated. This ensures that every block remains at least half-full, optimizing disk usage.}
\section{Hash Indices and Constant Time Lookups}
As an alternative to tree-structured indices, databases may use \textbf{Hash Indices}. These are based on a "smoothie machine" analogy: a deterministic function that turns any input into a seemingly random integer.
\dfn{Hash Index}{A structure that uses a hash function to map search keys into specific buckets. Each bucket corresponds to one or more disk blocks holding pointers to the relevant records.}
\thm{Constant Complexity}{A hash index provides $O(1)$ lookup time for equality queries. No matter how large the table becomes, the time to locate a specific key remains constant, as it requires only the computation of the hash and a direct jump to the indicated bucket.}
\nt{The major limitation of hash indices is their lack of support for range queries. Because the hash function randomizes the placement of keys, two keys that are close in value (e.g., 19 and 20) will likely end up in completely different parts of the index.}
\section{Index Selection and Cost Modeling}
Creating an index is not a "free" performance boost. Every index added to a relation imposes costs that must be weighed against its benefits. The decision process involves analyzing a query workload and estimating the average disk I/O.
\dfn{Index Creation Costs}{The costs associated with indices include the initial CPU and I/O time to build the structure, the additional disk space required to store it, and the "write penalty"—the fact that every \texttt{INSERT}, \texttt{DELETE}, or \texttt{UPDATE} to the base table must also update every associated index.}
\thm{The Selection Formula}{If $p$ is the probability that a query uses a certain attribute and $1-p$ is the probability of an update, an index on that attribute is beneficial only if the time saved during the queries outweighs the extra time spent on updates. This can be expressed as a linear combination of costs based on the parameters $B(R)$ (blocks) and $T(R)$ (tuples).}
\nt{In practice, many systems use an "automatic tuning advisor" that applies a greedy algorithm to suggest the best set of indices for a specific historical workload.}
\section{Indices and Complex Queries}
Indices are particularly powerful when dealing with joins or multiple selection criteria.
\thm{The Index-Join Strategy}{In a join $R \bowtie S$, if $S$ has an index on the join attribute, the system can iterate through $R$ and for each tuple, use the index on $S$ to find matching records. This is significantly faster than a nested-loop join if $R$ is small and the index on $S$ is efficient.}
\nt{When multiple indices are available for a single query, a technique called "pointer intersection" can be used. The system retrieves lists of pointers from several indices, intersects them in main memory, and only then reads the data blocks for the tuples that satisfy all conditions.}
\section{Information Retrieval and Inverted Indices}
A specialized form of indexing used for documents is the \textbf{Inverted Index}. This is the technology that powers web search engines and large-scale document repositories.
\dfn{Inverted Index}{A mapping from words (keywords) to the list of documents in which those words appear. Often, these lists include metadata such as the position of the word in the document or whether it appeared in a title or anchor tag.}
\nt{To optimize inverted indices, systems often use "stemming" (reducing words to their root form) and "stop words" (ignoring common words like "the" or "and" that do not help distinguish documents).}
\section{Summary of Design Principles}
The theory of views and indices suggests that database design is as much about managing the physical medium as it is about logical modeling.
\thm{The Table Universe and Consistency}{While views provide a filtered perspective of the data, they must remain consistent with the "table universe"—the set of all possible valid states of the database. Constraints and indices must be applied such that they hold true across all these potential states, ensuring that neither hardware failure nor concurrent access can corrupt the logical integrity of the system.}
\nt{Ultimately, the goal of indices is to turn linear or quadratic problems into logarithmic or constant ones. By carefully selecting which attributes to index based on the $B, T,$ and $V$ parameters, a designer can create a system that remains responsive even under the weight of massive datasets.}
In summary, views provide the necessary abstraction to keep application code clean and secure, while indices provide the surgical precision required to extract data from high-volume storage. A database is essentially a large library; a view is a specific bookshelf curated for a student, while an index is the card catalog that allows a librarian to find one specific page in a million volumes without having to read every book in the building.