Notes V.2.0.0
Rewrote Notes
This commit is contained in:
@@ -1,103 +1,72 @@
|
||||
\chapter{Introduction}
|
||||
|
||||
Modern technological infrastructure relies heavily on the ability to manage vast quantities of facts effectively. At its most fundamental level, we distinguish between data—which represents raw, stored facts—and information, which is data endowed with specific meaning. When this information is applied to solve meaningful problems or used in decision-making processes, it matures into knowledge. An information system is essentially a collection of software programs designed to manage this progression of information efficiently.
|
||||
Modern engineering increasingly relies on the structured management of information, treating data as the fundamental digital substance equivalent to physical matter. In the pursuit of understanding the world, we can categorize scientific inquiry into a matrix of paradigms. While mathematics explores necessary truths through natural thought and computer science analyzes the theoretical necessity of artificial computation, physics observes the world as it exists. Data science effectively acts as the "physics of computer science," utilizing machine-driven computation to observe and interpret the world through empirical evidence.
|
||||
|
||||
In the broader context of science, data management has shifted paradigms. While classical mathematics and physics focused on the world as it must be or as it is observed naturally, computer science and data science focus on computing and data-driven insights. Data can be viewed as the "matter" of the digital world, and studying its behavior and storage is central to modern engineering.
|
||||
The objective of an information system is to transform raw observations into actionable intelligence. This process follows a strict hierarchy. Data consists of raw, uninterpreted facts that are stored and moved between systems. When these facts are associated with specific meanings, they become information. Finally, when this information is applied to meaningful tasks or decision-making, it evolves into knowledge.
|
||||
|
||||
\dfn{Information System}{
|
||||
A software program or a complex suite of programs dedicated to the management, storage, and exchange of information.
|
||||
}
|
||||
\dfn{Information System}{A software program or a synchronized set of programs designed to manage, store, and provide efficient access to information.}
|
||||
|
||||
\thm{Data-Information-Knowledge Hierarchy}{
|
||||
The progression from raw facts (data) to interpreted meaning (information) and finally to the purposeful application of that information (knowledge).
|
||||
}
|
||||
\thm{The Knowledge Hierarchy}{The structured progression from raw data to information through added meaning, culminating in knowledge through practical application.}
|
||||
|
||||
\section{Historical Context and the Evolution of Storage}
|
||||
\nt{In modern engineering, making superior decisions is no longer just about observing numbers but about leveraging knowledge derived through information systems.}
|
||||
|
||||
The necessity of recording information spans human history, moving from oral traditions to the invention of writing, accounting, and eventually the printing press. However, the mid-20th century marked the beginning of the computational era. In the 1960s, data was primarily managed through file systems. These systems were rudimentary, as they essentially involved independent programs reading from local disks, often leading to data redundancy and inconsistency across different applications.
|
||||
\section{The Historical Evolution of Data Management}
|
||||
|
||||
The 1970s ushered in the Relational Era, largely defined by the work of Edgar Codd. He proposed a model where data is organized into tables (relations), allowing users to interact with data logically rather than worrying about its physical placement on a disk. The 1980s saw the rise of object-oriented models, and the 2000s introduced the NoSQL era, which addressed the needs of massive, distributed data through key-value stores, document stores, and graph databases.
|
||||
The history of data management is a narrative of scaling human memory and communication. Before the advent of technology, information was transmitted via oral traditions, which were hindered by the limitations of human recall and distance. The invention of writing marked the first major turning point, allowing symbols to be preserved on durable media such as stone or clay.
|
||||
|
||||
\dfn{Relational Database}{
|
||||
An organized collection of related data presented to the user as a set of two-dimensional tables called relations.
|
||||
}
|
||||
Ancient civilizations intuitively adopted the tabular format for data. Clay tablets from thousands of years ago have been discovered containing relational data, such as Pythagorean triples, organized in rows and columns. This indicates that tables are a primary cognitive tool for human information organization. The invention of the printing press in the 16th century further enabled the mass distribution of data, leading eventually to the mechanical and electronic computing revolutions of the 20th century.
|
||||
|
||||
\section{The Purpose and Functionality of a DBMS}
|
||||
In the early decades of computing, specifically the 1960s, data management was handled through direct file systems. Programmers were required to know the physical location of data on a disk and write complex logic to retrieve it. This changed in 1970 when Edgar Codd introduced the relational model. He argued that users should interact with data through intuitive tables, while the underlying machine complexities remained hidden. This principle of data independence paved the way for the Object Era in the 1980s and the NoSQL Era in the 2000s, the latter of which was driven by the massive scale of modern social networks and search engines.
|
||||
|
||||
A Database Management System (DBMS) is a specialized software suite designed to manage and query databases. Relying on simple file systems for complex applications is problematic because it is difficult to combine data from different files, and there is no built-in support for multiple users or protection against data loss. A DBMS provides five critical functionalities to solve these issues.
|
||||
\nt{The tabular format has remained the most intuitive and enduring method for humans to represent structured data, from ancient clay to modern SQL.}
|
||||
|
||||
First, it allows for the creation of new databases and the definition of their schemas, or logical structures. Second, it enables efficient querying and modification of data through specialized languages. Third, it supports the storage of immense datasets—reaching into the terabytes and petabytes—over long periods. Fourth, it ensures durability, meaning the system can recover from failures or errors without losing data. Finally, it manages concurrent access, allowing many users to interact with the data simultaneously without causing inconsistencies.
|
||||
\section{The Structure and Shapes of Data}
|
||||
|
||||
\dfn{Database Management System (DBMS)}{
|
||||
A powerful tool for creating, managing, and efficiently querying large amounts of persistent, safe data.
|
||||
}
|
||||
Data is categorized based on its degree of organization. Unstructured data, such as natural language text, audio, images, and video, exists in a raw form that was historically difficult for computers to process. However, recent breakthroughs in linear algebra and vector-based mathematics have enabled modern systems to interpret and even generate this type of content.
|
||||
|
||||
\thm{Data Independence}{
|
||||
The principle, championed by Edgar Codd, that separates the physical storage of data from its logical representation, allowing users to interact with a logical model that the software then translates into physical structures.
|
||||
}
|
||||
Structured data is the highly organized information typically found in spreadsheets and relational databases. Between these lies semi-structured data, which uses tags (like XML or JSON) to provide some semantic context without the rigid requirements of a fixed schema. To manage these types, engineers utilize data models—mathematical notations for describing data structures, the operations allowed on them, and the constraints they must follow.
|
||||
|
||||
\section{Data Classification and the Three Vs}
|
||||
\dfn{Data Model}{A formal notation that describes the structure of data, the methods for querying and modifying it, and the rules that maintain its integrity.}
|
||||
|
||||
Data within an information system typically takes one of three shapes. Structured data, such as that found in relational databases or spreadsheets, follows a rigid schema. Semi-structured data, including XML, JSON, and YAML, possesses some internal structure but is more flexible and can be validated against frames or schemas. Unstructured data, such as raw text, audio, images, or video, lacks a predefined format and often requires advanced linear algebra and vector-based processing to analyze.
|
||||
\thm{The Three Vs of Big Data}{The defining challenges of modern data management are Volume (the sheer amount of bytes), Variety (the diversity of data types), and Velocity (the speed at which data is generated and must be processed).}
|
||||
|
||||
The scale of modern data is often described by the "Three Vs": Volume (the sheer amount of data, moving from terabytes to zettabytes), Variety (the different formats and sources of data), and Velocity (the speed at which new data is generated and must be processed). Understanding the prefixes of the International System of Units, such as Peta (10 to the 15th) and Exa (10 to the 18th), is essential for engineers working at this scale.
|
||||
\section{The Necessity of Database Management Systems}
|
||||
|
||||
\dfn{Structured Data}{
|
||||
Data that is organized into a highly formatted structure, typically using the relational model, which makes it easily searchable via languages like SQL.
|
||||
}
|
||||
In primitive computing environments, applications directly accessed files on local disks. This approach resulted in severe problems as systems grew. Data was often redundant (stored in multiple places) and inconsistent (versions of the same data conflicting). It was also difficult to combine data from different sources or control who had access to specific information.
|
||||
|
||||
\section{System Architecture and the Three-Tier Model}
|
||||
A Database Management System (DBMS) resolves these issues by serving as a central software layer. A robust DBMS is expected to fulfill five primary roles:
|
||||
\begin{enumerate}
|
||||
\item Allow users to define the structure (schema) of new databases.
|
||||
\item Provide high-level languages for querying and changing data.
|
||||
\item Facilitate the storage of massive datasets over long durations.
|
||||
\item Ensure durability by recovering data after system failures.
|
||||
\item Manage concurrent access by multiple users to prevent data corruption.
|
||||
\end{enumerate}
|
||||
|
||||
Modern information systems are often organized into a three-tier architecture to separate concerns and improve scalability. The top layer is the user interface or front-end, which manages the presentation and user interaction. The middle layer is the business logic, where the specific rules and processes of the application are defined. The bottom layer is the database system, which handles data persistence and management.
|
||||
\dfn{Database Management System (DBMS)}{A specialized software suite used to create, manage, and query databases, shielding the user from physical storage details.}
|
||||
|
||||
Within this architecture, the DBMS itself is divided into various components. A storage manager controls how data is placed on disk and moved between the disk and main memory. The query processor parses and optimizes requests to find the most efficient execution plan. The transaction manager ensures that database operations are performed safely and reliably.
|
||||
\nt{A "Database System" is the holistic term for the combination of the DBMS software and the actual data stored within it.}
|
||||
|
||||
\dfn{3-Tier Architecture}{
|
||||
A software design pattern consisting of three layers: the presentation layer (User Interface), the logic layer (Business Logic), and the data layer (Database System).
|
||||
}
|
||||
\section{System Architecture and Data Independence}
|
||||
|
||||
\section{Database Languages: DDL and DML}
|
||||
Most modern information systems utilize a three-tier architecture to ensure modularity and scalability. The top layer is the User Interface (UI), which handles human interaction. The middle layer is the Business Logic, where the rules of the application are processed. The bottom layer is the Persistence layer, where the DBMS manages data storage on a disk or in the cloud.
|
||||
|
||||
Interaction with a DBMS occurs through two primary types of languages. The Data Definition Language (DDL) is used to establish and modify the metadata, which is the "data about data" describing the schema and constraints of the database. The Data Manipulation Language (DML) is used to search, retrieve, and modify the actual data stored within that schema.
|
||||
The most vital concept within this architecture is data independence, championed by Edgar Codd. This principle separates the logical level (the tables humans see) from the physical level (the bits stored on the machine). Because of this separation, an engineer can change the physical storage medium—from a hard drive to a data center or even DNA storage—without the user ever needing to change their queries.
|
||||
|
||||
These languages can be further categorized as imperative or declarative. Imperative languages require the programmer to specify *how* to perform a task (e.g., C++, Java), while declarative languages, most notably SQL, allow the user to specify *what* they want, leaving the "how" to the system's query optimizer.
|
||||
\dfn{Data Independence}{The ability of a database system to provide a stable logical view of data that is entirely independent of its physical storage implementation.}
|
||||
|
||||
\dfn{Metadata}{
|
||||
The structural information that defines the types and constraints of the data, essentially acting as a blueprint for the database.
|
||||
}
|
||||
\thm{Three-Tier Architecture}{A design pattern that divides an application into the presentation, logic, and data management layers to simplify development and maintenance.}
|
||||
|
||||
\thm{Declarative Language Property}{
|
||||
The characteristic of languages like SQL that allows users to describe the desired result of a query without defining the physical execution steps or algorithms required to reach that result.
|
||||
}
|
||||
\section{Query Languages and Internal Processes}
|
||||
|
||||
\section{Transaction Management and the ACID Test}
|
||||
Interaction with a DBMS occurs through specialized languages. The Data Definition Language (DDL) is used to define metadata—the "data about the data," such as the names of columns and their types. The Data Manipulation Language (DML), primarily SQL, is used to search for or update actual records.
|
||||
|
||||
A transaction is a single unit of work consisting of one or more database operations that must be treated as an indivisible whole. To maintain integrity, transactions must satisfy the ACID properties.
|
||||
SQL is distinct because it is a declarative language. In imperative languages like C++ or Python, a programmer must write the step-by-step instructions for how to perform a task. In a declarative language, the user only describes what result they want. The DBMS uses a query compiler to analyze the request and an execution engine to find the most efficient path—the "query plan"—to retrieve the data.
|
||||
|
||||
\dfn{Transaction}{
|
||||
A program or set of actions that manages information and must be executed as an atomic unit to preserve database consistency.
|
||||
}
|
||||
\nt{The efficiency of modern databases is largely due to the query compiler's ability to optimize a declarative request into a high-performance execution strategy.}
|
||||
|
||||
\thm{The ACID Properties}{
|
||||
The fundamental requirements for reliable transaction processing:
|
||||
\begin{itemize}
|
||||
\item \textbf{Atomicity}: All-or-nothing execution; if any part of the transaction fails, the entire transaction is rolled back.
|
||||
\item \textbf{Consistency}: Every transaction must leave the database in a state that satisfies all predefined rules and constraints.
|
||||
\item \textbf{Isolation}: Each transaction must appear to execute in a vacuum, as if no other transactions are occurring simultaneously.
|
||||
\item \textbf{Durability}: Once a transaction is completed, its effects are permanent and must survive any subsequent system failures.
|
||||
\end{itemize}
|
||||
}
|
||||
\section{Measurement and Scaling in the Era of Big Data}
|
||||
|
||||
\section{Database Roles and Ecosystems}
|
||||
The scale of data generated today is exponential, often said to double every few years. Engineers must be familiar with the international system of units for volume. While standard kilo and mega represent powers of 10 ($10^3$ and $10^6$), computer science often relies on binary prefixes like kibi ($2^{10} = 1024$) and mebi ($2^{20}$) to ensure precision in memory and storage calculations. We are now entering the age of Zettabytes and Yottabytes, requiring a deep understanding of how to scale information systems to meet these unprecedented demands.
|
||||
|
||||
A database environment involves several distinct roles. The Database Administrator (DBA) is responsible for coordination, monitoring, and access control. The Database Designer creates the structure and schema of the content. Power Users may interact with the system through complex programming, while Data Analysts use DML for updates and queries. Finally, Parametric Users interact with the database through simplified interfaces like menus and forms.
|
||||
|
||||
As systems grow, the challenge often becomes one of information integration. Large organizations may have many legacy databases that use different models or terms. Integration strategies include creating data warehouses—centralized repositories where data from various sources is translated and copied—or using mediators, which provide an integrated model of the data while translating requests for each individual source database.
|
||||
|
||||
\dfn{Data Warehouse}{
|
||||
A centralized database used for reporting and data analysis, which stores integrated data from one or more disparate sources.
|
||||
}
|
||||
|
||||
\thm{Legacy Database Problem}{
|
||||
The difficulty of decommissioning old database systems because existing applications depend on them, necessitating the creation of integration layers to combine their data with newer systems.
|
||||
}
|
||||
\nt{The total amount of data created in just the last few years is estimated to be greater than the sum of all information produced in the entirety of previous human history.}
|
||||
|
||||
Reference in New Issue
Block a user