Data Sources (Data Format and Common Sources)
In a data warehouse, data source refers to any system or application that provides data to the data warehouse. A data source can be any type of system or application that generates data, such as a transactional system, a customer relationship management (CRM) application, or an enterprise resource planning (ERP) system.
The data from these sources is extracted and transformed before it is loaded into the data warehouse. This process involves cleaning, standardizing, and consolidating the data to ensure that it is accurate, consistent, and reliable. Once the data has been transformed, it is then loaded into the data warehouse for storage and analysis.
In some cases, data sources may be connected to the data warehouse using extract, transform, and load (ETL) processes, while in other cases, they may be connected using other data integration methods, such as data replication, data federation, or data virtualization.
Note Data sources are a critical component of a data warehouse, as they provide the data that is needed to support business intelligence and analytics. By consolidating data from multiple sources into a single location, a data warehouse enables organizations to gain insights into their business operations and make more-informed decisions.
There are various types and formats of data sources that can be used in a data warehouse. Here are some examples:
• Relational databases: A common data source for a data warehouse is a relational database, such as Oracle, Microsoft SQL Server, or MySQL. These databases store data in tables with defined schemas and can be queried using SQL.
• Flat files: Data can also be sourced from flat files, such as CSV files, Parquet, Excel, or any other formatted text files. These files typically have a delimited format with columns and rows.
• Cloud storage services: Cloud storage services, such as Amazon S3 or Azure Data Lake Storage, can also be used as a data source for a data warehouse. These services can store data in a structured or unstructured format and can be accessed through APIs.
• NoSQL databases: NoSQL databases, such as MongoDB or Cassandra, can be used as data sources for data warehouses. These databases are designed to handle large volumes of unstructured data and can be queried using NoSQL query languages.
• Real-time data sources: Real-time data sources, such as message queues or event streams, can be used to stream data into a data warehouse in real-time. This type of data source is often used for applications that require up-to-date data.
• APIs: APIs can also be used as a data source, providing access to data from third-party applications or web services.
Format of the data coming from multiple sources can also vary depending on the type of data. For example, data can be structured or unstructured, semi-structured, such as JSON or XML. The data format needs to be considered when designing the data warehouse schema and the ETL processes. It is important to ensure that the data is properly transformed and loaded into the data warehouse in a format that is usable for analysis.
Data can flow to the data warehouse through different systems, some of the most used of which include the following:
• Transactional databases: Transactional databases are typically the primary source of data for a data warehouse. These databases capture and store business data generated by various systems, such as sales, finance, and operations.
• ERP systems: Enterprise resource planning (ERP) systems are used by many organizations to manage their business processes. ERP systems can provide a wealth of data that can be used in a data warehouse, including information on customer orders, inventory, and financial transactions.
• CRM systems: Customer relationship management (CRM) systems provide data on customer interactions that can be used to support business analytics and decision-making.
• Legacy systems: Legacy systems are often used to store important historical data that needs to be incorporated into the data warehouse. This data may be stored in a variety of formats, including flat files or proprietary databases.
• Cloud-based systems: Cloud-based systems, such as software-as-a-service (SaaS) applications, are becoming increasingly popular as data sources for data warehouses. These systems can provide access to a variety of data, including customer behavior, website traffic, and sales data.
• Social media: Social media platforms are another source of data that can be used in a data warehouse. This data can be used to gain insights into customer behavior, sentiment analysis, and brand reputation.
One effective approach for documenting data-related artifacts, such as data sources and data flows, is using data dictionaries and data catalogs. These tools can capture relevant information about data elements, including their structure and meaning, as well as provide more comprehensive details about data sources, flows, lineage, and ownership. By leveraging these tools, implementation teams and data operations teams can gain a better understanding of this information, leading to improved data quality, consistency, and collaboration across various teams and departments within an organization.
Note When categorizing data into structured or unstructured sources, you’ll find that older systems like transactional, ERP, CRM, and legacy tend to have well-organized and -classified data compared to that sourced from cloud-based systems or social media. It’s not entirely accurate to say that all data from cloud platforms and website analytics activities are unstructured, but analyzing such data requires additional computing power to derive significant insights. With the adoption of cloud computing, organizations are increasingly storing unstructured data.