Evolution of Databases and Codd Rules
The evolution of databases began with IBM’s development of the first commercially successful database management system (DBMS) in the 1960s. The relational model of databases, introduced by E.F. Codd in the 1970s, organized data into tables consisting of rows and columns, leading to the development of Structured Query Language (SQL). The rise of the internet and e-commerce in the 1990s led to the development of NoSQL databases for handling vast amounts of unstructured data. The Chord protocol, proposed in 2001, is a distributed hash table (DHT) algorithm used for maintaining the consistency and reliability of data across multiple nodes in distributed databases.
Codd’s 12 principles for relational databases established a framework for designing and implementing a robust, flexible, and scalable data management system. These principles are relevant in data warehousing today because they provide a standard for evaluating data warehousing systems and ensuring that they can handle large volumes of data, support complex queries, maintain data integrity, and evolve over time to meet changing business needs.
The 12 principles of Codd’s rules for relational databases are as follows:
1. Information Rule: All data in the database should be represented as values in tables. This means that the database should be structured as a collection of tables, with each table representing a single entity or relationship.
2. Guaranteed Access Rule: Each value in the database should
be accessible by specifying its table name, primary key value, and column name. This ensures that every piece of data in the database is uniquely identifiable and can be accessed efficiently.
3. Systematic Treatment of Null Values: The database should support the use of null values to represent missing or unknown data. These null values should be treated consistently throughout the system, with appropriate support for operations such as null comparisons and null concatenations.
4. Dynamic Online Catalog Based on the Relational Model: The database should provide a dynamic online catalog that describes the structure of the database in terms of tables, columns, indexes, and other relevant information. This catalog should be accessible to users and applications and should be based on the relational model.
5. Comprehensive Data Sublanguage Rule: The database should support a comprehensive data sublanguage that allows users to define, manipulate, and retrieve data in a variety of ways. This sublanguage should be able to express complex queries, data definitions, and data modifications.
6. View Updating Rule: The database should support the updating of views, which are virtual tables that are defined in terms of other tables. This allows users to modify data in a flexible and intuitive way, without having to worry about the underlying structure of the database.
7. High-Level Insert, Update, and Delete Rule: The database should support high-level insert, update, and delete operations that allow users to modify multiple rows or tables at once. This simplifies data management and improves performance by reducing the number of database interactions required.
8. Physical Data Independence: The database should be able to store and retrieve data without being affected by changes to the physical storage or indexing structure of the database. This allows the database to evolve over time without requiring significant changes to the application layer.
9. Logical Data Independence: The database should be able to store and retrieve data without being affected by changes to the logical structure of the database. This means that the database schema can be modified without requiring changes to the application layer.
10. Integrity Independence: The database should be able to enforce integrity constraints such as primary keys, foreign keys, and other business rules without being affected by changes to the application layer. This ensures that data is consistent and accurate at all times.
11. Distribution Independence: The database should be able to distribute data across multiple locations without being affected by changes to the application layer. This allows the database to scale horizontally and geographically without requiring changes to the application layer.
12. Non-Subversion Rule: The database should not be susceptible
to subversion by unauthorized users or applications. This means that the database should enforce access controls, encryption, and other security measures to protect against unauthorized access or modification of data.
Traditional tabular systems based on Codd rules were relevant, but with the rise of the internet and e-commerce, there was a huge increase in the volume and variety of data being generated. To handle this data, new NoSQL databases were developed, which are more flexible and scalable, especially for unstructured data. In building a universally accepted data warehouse, it’s important to consider the strengths and weaknesses of both traditional and NoSQL databases and follow best practices, such as data quality, data modeling, data governance, and security measures. In the upcoming section of this chapter, we will explore this transition in a step-by-step manner while giving special attention to the areas that remain relevant for creating a strong and widely accepted modern data warehouse.