information_systems_for_eng…/sections/introduction.tex

\chapter{Introduction}

Modern engineering increasingly relies on the structured management of information, treating data as the fundamental digital substance equivalent to physical matter. In the pursuit of understanding the world, we can categorize scientific inquiry into a matrix of paradigms. While mathematics explores necessary truths through natural thought and computer science analyzes the theoretical necessity of artificial computation, physics observes the world as it exists. Data science effectively acts as the "physics of computer science," utilizing machine-driven computation to observe and interpret the world through empirical evidence.

The objective of an information system is to transform raw observations into actionable intelligence. This process follows a strict hierarchy. Data consists of raw, uninterpreted facts that are stored and moved between systems. When these facts are associated with specific meanings, they become information. Finally, when this information is applied to meaningful tasks or decision-making, it evolves into knowledge.

\dfn{Information System}{A software program or a synchronized set of programs designed to manage, store, and provide efficient access to information.}

\thm{The Knowledge Hierarchy}{The structured progression from raw data to information through added meaning, culminating in knowledge through practical application.}

\nt{In modern engineering, making superior decisions is no longer just about observing numbers but about leveraging knowledge derived through information systems.}

\section{The Historical Evolution of Data Management}

The history of data management is a narrative of scaling human memory and communication. Before the advent of technology, information was transmitted via oral traditions, which were hindered by the limitations of human recall and distance. The invention of writing marked the first major turning point, allowing symbols to be preserved on durable media such as stone or clay.

Ancient civilizations intuitively adopted the tabular format for data. Clay tablets from thousands of years ago have been discovered containing relational data, such as Pythagorean triples, organized in rows and columns. This indicates that tables are a primary cognitive tool for human information organization. The invention of the printing press in the 16th century further enabled the mass distribution of data, leading eventually to the mechanical and electronic computing revolutions of the 20th century.

In the early decades of computing, specifically the 1960s, data management was handled through direct file systems. Programmers were required to know the physical location of data on a disk and write complex logic to retrieve it. This changed in 1970 when Edgar Codd introduced the relational model. He argued that users should interact with data through intuitive tables, while the underlying machine complexities remained hidden. This principle of data independence paved the way for the Object Era in the 1980s and the NoSQL Era in the 2000s, the latter of which was driven by the massive scale of modern social networks and search engines.

\nt{The tabular format has remained the most intuitive and enduring method for humans to represent structured data, from ancient clay to modern SQL.}

\section{The Structure and Shapes of Data}

Data is categorized based on its degree of organization. Unstructured data, such as natural language text, audio, images, and video, exists in a raw form that was historically difficult for computers to process. However, recent breakthroughs in linear algebra and vector-based mathematics have enabled modern systems to interpret and even generate this type of content.

Structured data is the highly organized information typically found in spreadsheets and relational databases. Between these lies semi-structured data, which uses tags (like XML or JSON) to provide some semantic context without the rigid requirements of a fixed schema. To manage these types, engineers utilize data models—mathematical notations for describing data structures, the operations allowed on them, and the constraints they must follow.

\dfn{Data Model}{A formal notation that describes the structure of data, the methods for querying and modifying it, and the rules that maintain its integrity.}

\thm{The Three Vs of Big Data}{The defining challenges of modern data management are Volume (the sheer amount of bytes), Variety (the diversity of data types), and Velocity (the speed at which data is generated and must be processed).}

\section{The Necessity of Database Management Systems}

In primitive computing environments, applications directly accessed files on local disks. This approach resulted in severe problems as systems grew. Data was often redundant (stored in multiple places) and inconsistent (versions of the same data conflicting). It was also difficult to combine data from different sources or control who had access to specific information.

A Database Management System (DBMS) resolves these issues by serving as a central software layer. A robust DBMS is expected to fulfill five primary roles:
\begin{enumerate}
	\item Allow users to define the structure (schema) of new databases.
	\item Provide high-level languages for querying and changing data.
	\item Facilitate the storage of massive datasets over long durations.
	\item Ensure durability by recovering data after system failures.
	\item Manage concurrent access by multiple users to prevent data corruption.
\end{enumerate}

\dfn{Database Management System (DBMS)}{A specialized software suite used to create, manage, and query databases, shielding the user from physical storage details.}

\nt{A "Database System" is the holistic term for the combination of the DBMS software and the actual data stored within it.}

\section{System Architecture and Data Independence}

Most modern information systems utilize a three-tier architecture to ensure modularity and scalability. The top layer is the User Interface (UI), which handles human interaction. The middle layer is the Business Logic, where the rules of the application are processed. The bottom layer is the Persistence layer, where the DBMS manages data storage on a disk or in the cloud.

The most vital concept within this architecture is data independence, championed by Edgar Codd. This principle separates the logical level (the tables humans see) from the physical level (the bits stored on the machine). Because of this separation, an engineer can change the physical storage medium—from a hard drive to a data center or even DNA storage—without the user ever needing to change their queries.

\dfn{Data Independence}{The ability of a database system to provide a stable logical view of data that is entirely independent of its physical storage implementation.}

\thm{Three-Tier Architecture}{A design pattern that divides an application into the presentation, logic, and data management layers to simplify development and maintenance.}

\section{Query Languages and Internal Processes}

Interaction with a DBMS occurs through specialized languages. The Data Definition Language (DDL) is used to define metadata—the "data about the data," such as the names of columns and their types. The Data Manipulation Language (DML), primarily SQL, is used to search for or update actual records.

SQL is distinct because it is a declarative language. In imperative languages like C++ or Python, a programmer must write the step-by-step instructions for how to perform a task. In a declarative language, the user only describes what result they want. The DBMS uses a query compiler to analyze the request and an execution engine to find the most efficient path—the "query plan"—to retrieve the data.

\nt{The efficiency of modern databases is largely due to the query compiler's ability to optimize a declarative request into a high-performance execution strategy.}

\section{Measurement and Scaling in the Era of Big Data}

The scale of data generated today is exponential, often said to double every few years. Engineers must be familiar with the international system of units for volume. While standard kilo and mega represent powers of 10 ($10^3$ and $10^6$), computer science often relies on binary prefixes like kibi ($2^{10} = 1024$) and mebi ($2^{20}$) to ensure precision in memory and storage calculations. We are now entering the age of Zettabytes and Yottabytes, requiring a deep understanding of how to scale information systems to meet these unprecedented demands.

\nt{The total amount of data created in just the last few years is estimated to be greater than the sum of all information produced in the entirety of previous human history.}