Files
information_systems_for_eng…/sections/introduction.tex
2025-12-29 23:19:14 +01:00

104 lines
9.4 KiB
TeX

\chapter{Introduction}
Modern technological infrastructure relies heavily on the ability to manage vast quantities of facts effectively. At its most fundamental level, we distinguish between data—which represents raw, stored facts—and information, which is data endowed with specific meaning. When this information is applied to solve meaningful problems or used in decision-making processes, it matures into knowledge. An information system is essentially a collection of software programs designed to manage this progression of information efficiently.
In the broader context of science, data management has shifted paradigms. While classical mathematics and physics focused on the world as it must be or as it is observed naturally, computer science and data science focus on computing and data-driven insights. Data can be viewed as the "matter" of the digital world, and studying its behavior and storage is central to modern engineering.
\dfn{Information System}{
A software program or a complex suite of programs dedicated to the management, storage, and exchange of information.
}
\thm{Data-Information-Knowledge Hierarchy}{
The progression from raw facts (data) to interpreted meaning (information) and finally to the purposeful application of that information (knowledge).
}
\section{Historical Context and the Evolution of Storage}
The necessity of recording information spans human history, moving from oral traditions to the invention of writing, accounting, and eventually the printing press. However, the mid-20th century marked the beginning of the computational era. In the 1960s, data was primarily managed through file systems. These systems were rudimentary, as they essentially involved independent programs reading from local disks, often leading to data redundancy and inconsistency across different applications.
The 1970s ushered in the Relational Era, largely defined by the work of Edgar Codd. He proposed a model where data is organized into tables (relations), allowing users to interact with data logically rather than worrying about its physical placement on a disk. The 1980s saw the rise of object-oriented models, and the 2000s introduced the NoSQL era, which addressed the needs of massive, distributed data through key-value stores, document stores, and graph databases.
\dfn{Relational Database}{
An organized collection of related data presented to the user as a set of two-dimensional tables called relations.
}
\section{The Purpose and Functionality of a DBMS}
A Database Management System (DBMS) is a specialized software suite designed to manage and query databases. Relying on simple file systems for complex applications is problematic because it is difficult to combine data from different files, and there is no built-in support for multiple users or protection against data loss. A DBMS provides five critical functionalities to solve these issues.
First, it allows for the creation of new databases and the definition of their schemas, or logical structures. Second, it enables efficient querying and modification of data through specialized languages. Third, it supports the storage of immense datasets—reaching into the terabytes and petabytes—over long periods. Fourth, it ensures durability, meaning the system can recover from failures or errors without losing data. Finally, it manages concurrent access, allowing many users to interact with the data simultaneously without causing inconsistencies.
\dfn{Database Management System (DBMS)}{
A powerful tool for creating, managing, and efficiently querying large amounts of persistent, safe data.
}
\thm{Data Independence}{
The principle, championed by Edgar Codd, that separates the physical storage of data from its logical representation, allowing users to interact with a logical model that the software then translates into physical structures.
}
\section{Data Classification and the Three Vs}
Data within an information system typically takes one of three shapes. Structured data, such as that found in relational databases or spreadsheets, follows a rigid schema. Semi-structured data, including XML, JSON, and YAML, possesses some internal structure but is more flexible and can be validated against frames or schemas. Unstructured data, such as raw text, audio, images, or video, lacks a predefined format and often requires advanced linear algebra and vector-based processing to analyze.
The scale of modern data is often described by the "Three Vs": Volume (the sheer amount of data, moving from terabytes to zettabytes), Variety (the different formats and sources of data), and Velocity (the speed at which new data is generated and must be processed). Understanding the prefixes of the International System of Units, such as Peta (10 to the 15th) and Exa (10 to the 18th), is essential for engineers working at this scale.
\dfn{Structured Data}{
Data that is organized into a highly formatted structure, typically using the relational model, which makes it easily searchable via languages like SQL.
}
\section{System Architecture and the Three-Tier Model}
Modern information systems are often organized into a three-tier architecture to separate concerns and improve scalability. The top layer is the user interface or front-end, which manages the presentation and user interaction. The middle layer is the business logic, where the specific rules and processes of the application are defined. The bottom layer is the database system, which handles data persistence and management.
Within this architecture, the DBMS itself is divided into various components. A storage manager controls how data is placed on disk and moved between the disk and main memory. The query processor parses and optimizes requests to find the most efficient execution plan. The transaction manager ensures that database operations are performed safely and reliably.
\dfn{3-Tier Architecture}{
A software design pattern consisting of three layers: the presentation layer (User Interface), the logic layer (Business Logic), and the data layer (Database System).
}
\section{Database Languages: DDL and DML}
Interaction with a DBMS occurs through two primary types of languages. The Data Definition Language (DDL) is used to establish and modify the metadata, which is the "data about data" describing the schema and constraints of the database. The Data Manipulation Language (DML) is used to search, retrieve, and modify the actual data stored within that schema.
These languages can be further categorized as imperative or declarative. Imperative languages require the programmer to specify *how* to perform a task (e.g., C++, Java), while declarative languages, most notably SQL, allow the user to specify *what* they want, leaving the "how" to the system's query optimizer.
\dfn{Metadata}{
The structural information that defines the types and constraints of the data, essentially acting as a blueprint for the database.
}
\thm{Declarative Language Property}{
The characteristic of languages like SQL that allows users to describe the desired result of a query without defining the physical execution steps or algorithms required to reach that result.
}
\section{Transaction Management and the ACID Test}
A transaction is a single unit of work consisting of one or more database operations that must be treated as an indivisible whole. To maintain integrity, transactions must satisfy the ACID properties.
\dfn{Transaction}{
A program or set of actions that manages information and must be executed as an atomic unit to preserve database consistency.
}
\thm{The ACID Properties}{
The fundamental requirements for reliable transaction processing:
\begin{itemize}
\item \textbf{Atomicity}: All-or-nothing execution; if any part of the transaction fails, the entire transaction is rolled back.
\item \textbf{Consistency}: Every transaction must leave the database in a state that satisfies all predefined rules and constraints.
\item \textbf{Isolation}: Each transaction must appear to execute in a vacuum, as if no other transactions are occurring simultaneously.
\item \textbf{Durability}: Once a transaction is completed, its effects are permanent and must survive any subsequent system failures.
\end{itemize}
}
\section{Database Roles and Ecosystems}
A database environment involves several distinct roles. The Database Administrator (DBA) is responsible for coordination, monitoring, and access control. The Database Designer creates the structure and schema of the content. Power Users may interact with the system through complex programming, while Data Analysts use DML for updates and queries. Finally, Parametric Users interact with the database through simplified interfaces like menus and forms.
As systems grow, the challenge often becomes one of information integration. Large organizations may have many legacy databases that use different models or terms. Integration strategies include creating data warehouses—centralized repositories where data from various sources is translated and copied—or using mediators, which provide an integrated model of the data while translating requests for each individual source database.
\dfn{Data Warehouse}{
A centralized database used for reporting and data analysis, which stores integrated data from one or more disparate sources.
}
\thm{Legacy Database Problem}{
The difficulty of decommissioning old database systems because existing applications depend on them, necessitating the creation of integration layers to combine their data with newer systems.
}