What is Data Warehousing?

A data warehouse is a collection of non-volatile, subject-oriented, and time-variant data. Data analysts may use this information to make better decisions for the company. Every day, the operational database undergoes several modifications at the expense of the transactions. This blog will teach you the fundamentals of data warehousing so that you can see why data warehousing has become so important for online processing and data analysis.

Rating: 5
  
 
3710

Data Warehouse is an integrated, subject-oriented, non-volatile, and time-variant data collection. This data assists the data analysts in taking knowledgeable decisions in the organization. The functional database experiences frequent changes every single day at the expense of the transactions that occur. Data Warehouse is the database that stores the data aimed at satisfying the decision-making requests. It is a set of decision support technologies intended to allow the knowledge worker for making supreme and bigger decisions. In the what is DataWareHouse Tutorial, you will learn data warehouse basics to understand why data warehouses become an essential platform for online analytical processing and data analysis.

If you would like to become a Data Warehouse certified professional, “AWS Data Warehousing Training” Course. This course will help you to achieve excellence in this domain.

What is Data Warehouse?

Data Warehouse is a relational database that we design for analysis and querying instead of transaction processing. It comprises historical data that we derive from transaction data from single or multiple sources. It offers enterprise-wide, historical, and integrated data and emphasizes, giving support for decision-makers for data modeling and analysis. 

Data Warehouse is a set of data to the entire organization, in addition to a specific group of users. We will not use it for regular operations and transaction processing, but we will use it for decision-making.

The following diagram illustrates what happens in the data:

1. Time-Variant

We store historical information in the data warehouse. For instance, we can get files from 6months, 12months, and previous data from the data warehouse. These modifications with the transaction system, where frequently we store only the recent file. 

2. Non-Volatile

The data warehouse is an individual data storage that transforms from the source operational RDBMS. The operational renewals of the data will not happen in the data warehouse, i.e., we will not perform insert, delete, and update operations. Generally, it needs two processes in data accessing: early data loading and access to the data. So, the data warehouse does not require recovery, concurrency, and transaction processing capabilities, which enables significant speedup of data revival. 

Objectives of Data Warehouse

  • Data Warehouse is useful for data reporting and analysis.
  • Data Warehouse maintains the historical information of the organization.
  • Data Warehouse is the foundation for decision-making.

MindMajix Youtube Channel

Need for the Data Warehouse

Data Warehouse is vital because of the following reasons:

1. Business Users: Business users need a data warehouse for seeing the outlined past data. As the people are non-technical, we can present the data to them in an elementary form. 

2. Strategic Decision-Making: Some strategies depend on the data present in the data warehouse. Therefore, a data warehouse plays an important role in making strategic decisions.

3. Storing Historical Data: We require a data warehouse for storing time variable data from the past. We use that input for several purposes.

4. High Response Time: Data Warehouse must be ready for unpredicted data loads and query types that require a valid degree of quick response time and flexibility. 

The following diagrams illustrate the implementation of the data warehouse

Data Warehouse Terminology

The view on an operational data warehouse is called Virtual Warehouse. We can build the Virtual Warehouse quickly.

1. Metadata

We define Metadata as data about data. The data which we use for representing other data is called Metadata. 

2. Metadata Repository

Metadata Repository is a basic part of the data warehouse system. It has the following Metadata:

  • Business Metadata: It includes data ownership information, modifying policies, and business definition.
  • Operational Metadata: It contains data currency and data lineage. Data currency relates to the data which is active, purged, and archived. 
  • Algorithms for Summarization: It contains dimension algorithms, data on aggregation, granularity, etc.
Related Article: Types of facts in Data Warehouse

Data Warehouse Architecture

The Online Analytical Processing server is on the basis of the multidimensional data model. It enables analysts, managers, to achieve an understanding of the data through consistent, interactive, and fast access to the information.

1. Bottom Tier:

The Data warehouse database server is the bottom tier of the Data warehouse architecture. To inject the data into the bottom tier, we utilise back-end utilities and tools. Back-end utilities and tools carry out clean, extract, refresh, and load functions.

2. Middle Tier:

In this tier, we will implement the OLAP server in the following ways:

  • By using ROLAP: It is an extended relational database management system. ROLAP plots the operations on the multidimensional data to the standard relational operations.
  • By using the Multidimensional OLAP(MOLAP) model: This top-tier is the front-end client layer. 

3. Top-Tier:

This layer plots the reporting tools, query tools, data mining tools, and analysis tools.

What is OLAP?

The Online Analytical Processing server is on the basis of the multidimensional data model. It enables analysts, managers, to achieve an understanding of the data through consistent, interactive, and fast access to the information.

Kinds of OLAP Servers

Following are the four kinds of OLAP Servers:

1. Relational OLAP(ROLAP)

ROLAP servers exist between the client front-end and relational back-end tools. For storing and managing the warehouse data, ROLAP uses extend-relational or relational DBMS. ROLAP comprises the following:

  • Optimisation for every DBMS back end.
  • Supplementary services and tools
  • Aggregation Navigation Logic Implementation.

Relational OLAP Architecture

The ROLAP architecture contains the following components:

  • Database server
  • Front-end tool
  • ROLAP server

Advantages Of ROLAP:

  • We can easily use the ROLAP servers with the existing RDBMS.
  • We can store the data efficiently, as we can store the Zero facts.
  • ROLAP tools will not use pre-computed data cubes.
  • The DSS Server of the micro-strategy follows the ROLAP approach.

2. Multidimensional OLAP

Multidimensional OLAP  uses array-based multidimensional storage engines for the multidimensional views of the data. Through multidimensional data stores, the storage usage is low when the dataset is sparse. So, various OLAP servers utilise two types of data storage representation for handling sparse and dense datasets.

MOLAP Architecture

MOLAP comprises the below components:

  • Database Server
  • MOLAP Server
  • Front-end tool

Advantages

  • MOLAP enables the fastest indexing of the pre-calculated summarised data.
  • It allows the users linked to a network who require to analyse larger and less-defined data.
  • It is easy to use. So, MOLAP is useful for unsophisticated users.

MOLAP vs ROLAP

MOLAPROLAP
Information retrieval is quick Information retrieval is relatively low.
It uses a sparse array for storing data sets.It uses a relational table for storing data sets.
It holds an individual database for the data cubes.It does not require space apart from the space available in the data warehouse.
MOLAP is best for unsophisticated users.ROLAP is best for experienced users.
 

What is OLTP?

OLTP is an operational system that supports transactions-aligned applications in a three-tier architecture. It manages the regular transactions of an organisation. Generally, OLTP focuses on data-integrity maintenance and query processing in multi-access environments. OLTP full form is Online Transaction Processing.

  • OLTP Characteristics
  • By using OLTP, we can access the Indexed data easily.
  • OLTP contains fast response times.
  • OLTP has a huge number of users.
  • Through OLTP, end-users can access the databases directly.
  • OLTP supports complex data tables and models.

OLTP Architecture

Following are the Components of the OLTP Architecture:

  • Business Enterprise Strategy: Enterprise strategy relates to the issues which impact the complete organization. Generally, in OLTP, we develop the enterprise strategy at a high4 level inside the organization.
  • Business Processes: OLTP business processes are a group of tasks and activities which, after their completion, will achieve an organisational goal.
  • Data Analytics, Mining, and Decision Making: Data Mart is an access or a structured pattern that is particular to data warehouse environments. OLAP uses it for storing the processed data.
  • Customer, Products, and Orders: OLTP database holds information about orders, products, customers, employees, and suppliers.
  • ETL Processes: It distinguishes the data from different RDBMS source systems, after which it transforms the data and stacks the processed data into a data warehouse system.

OLTP Advantages

  • OLTP provides an exact forecast for expenses and revenue.
  • OLTP offers support for larger databases.
  • In OLTP, the data partition for data manipulation is simple.
  • OLTP is useful when we require only a small amount of records.
  • OLTP tasks include deletion, updating or insertion of data.

OLAP vs OLTP

The following are the difference between OLAP vs OLTP

OLAPOLTP
OLAP full form is an online analysis and data retrieving process.OLTP full form is an online transactional system.
OLAP uses a data warehouse.OLTP uses standard DBMS.
OLAP response time is second to minutes.OLTP response time is milliseconds.
OLAP is a customer-based process.OLTP is a market-based process.
It enables only read and rarely write operations.It allows read and writes operations.
We can create and maintain it easily.It allows the users to create a view through a spreadsheet.
It will enable hundreds of users.It will enable thousands of users.
In OLAP, query throughput is the performance measure.In OLTP, transaction throughput is the performance measure.

What is ETL?

ETL is a process of extracting the data from the systems and storing it in the data warehouse is generally called ETL, which refers to Extract, Load, and Transform. ETL process needs inputs from different stakeholders comprising testers, analysts, developers, and top executives. For maintaining their worth, we must upgrade the data warehouses according to the business requirements. ETL is a repeated(monthly, daily, weekly) method of the data warehouse system and it must be automated, well documented, and agile. ETL contains three phases:

     1. Extraction

     2. Transform

     3. Load

Phases of ETL

1. Extraction

Extraction is the process of extracting the data from one source system for additional use in the data warehouse environment. This is the first phase of the ETL process.

The extraction process is the most time-consuming task of the ETL.

2. Cleansing

The cleansing phase is important in the data warehouse technique because it has to enhance data quality. The main data cleansing features discovered in the ETL tools are homogenisation. They use particular dictionaries for rectifying the typing mistakes and for recognising the synonyms. 

3. Transformation

Transformation is the key to the reconciliation phase. It transforms the records from functional source format into a specific data warehouse format. If we deploy a three-layer architecture, this phase gives a reconciled data layer. 

4. Loading

Loading is the mechanism of writing the data into the intent database. In the loading step, we must ensure that we perform the data loading correctly. We can perform data loading in two ways:

 Refresh: We completely rewrite the data warehouse data. This indicates that we replace the older file. Generally, we use the refresh in association with the static extraction for populating the data warehouse originally.

 Update: Generally, we carry out the update without modifying or deleting the existing data. We use this method in association with the incremental extraction for updating the data warehouses constantly.

Kinds of Data Warehouses

Following are the different kinds of data warehouses:

1. Host-Based Data Warehouses

We can implement two kinds of host-based data warehouses, they are:

a. Host-Based Mainframe warehouses that exist on the large volume databases. Endorsed by the reliable and strong capacity structure like IBM System/390, Data general sequent systems and databases like Oracle, DB2, and Informix.

b.Host-Based LAN data warehouses, where we can handle the data delivery either centrally or through the workgroup environment. 

2. LAN Based Workgroup Data Warehouses

LAN based workgroup warehouse is an incorporated structure to maintain and build the data warehouse in the LAN environment. In this type of warehouse, we can extract the information from different sources and endorse multiple LAN supported warehouses, typically selected warehouse databases, including Oracle, DB2 family, Informix, and Sybase. LAN based workgroup warehouse is suitable for any business corporation because we develop it for the workgroup environment. The LAN based workgroup data warehouse needs minimal primary investment and technological training. 

3. Host-Based Single Stage(LAN) Data Warehouse

Through a LAN based data warehouse, we can handle the data delivery either centrally or through the workgroup environment; therefore, business groups can fulfil their data required without overloading the centralised IT resources.

4. Multi-Stage data warehouses

It applies to multiple stages in transforming the methods to analyse the data through the aggregations and, in other terms, the staging of data multiple times before loading operation in the data warehouse. This configuration is ideal for the environments where end-users with countless capacities need access to both commutative record and summarised information for the tiny strategic decisions and outlined an interchangeable record for longer-term strategic decisions.

5. Stationary Data Warehouses

In this kind of data warehouse, we cannot change the data from the sources. Alternatively, the customers provide the direct access to the data. For various organisations, volume issues, infrequent access, and company requirements order like approach. 

6. Virtual Data Warehouses

We create virtual data warehouses in the below phases:

  • Installing a group of data approach, process management facilities, and data dictionary. 
  • Training the end-clients.
  • Monitoring the utilisation of DW facilities.

According to the actual stage, we create the data warehouse physically for providing higher frequency results.

7. Distributed Data Warehouse

The distributed data warehouse indicates that we have two kinds of distributed data warehouses and their chances for the native enterprise warehouses we spread across the company.

Local Data Warehouses Characteristics

  • Activity exists at the local level.
  • Mass Operational processing.
  • Local sites are independent.
  • Every local data warehouse has its separate architecture and data contents.

What is Data Mart?

Data Mart is a subdivision of a managerial data store, typically based on a particular response or main data subject that we can share for providing the business requirements. Data Marts are the analytical record stores developed for focusing on specific business operations for a particular community within the organisation. 

Kinds of Data Marts

We have two approaches for designing the data marts. Following are the two approaches:

1. Dependent Data Marts

Dependent Data Marts are the logical subdivision of a larger data warehouse. Based on this technique, we can treat the data marts as a subdivision of the data warehouse. In this method, first, we will create a data warehouse from which we will create various data marts. Dependent data marts are dependent on the data warehouse and extract the important records from it. 

2. Independent Data Marts

In this approach, first, we create independent data marts, and then we design a data warehouse by using the independent data marts. In this approach, we design the data marts independently, so we require integration of the data marts.

3. Hybrid Data Marts

It enables us to merge the input from the sources apart from a data warehouse. This can be useful for various circumstances.

Importance of Data Mart

  • To enforce access control strategies, data mart does data partition.
  • For segmenting the data into different hardware platforms, a data mart is useful.
  • Data Mart organises the data in a suitable form for a user access tool.
  • Data Mart accelerates the queries by minimising the data volume that we can scan.

Cost-effective Data Marting

We perform the following steps to make cost-effective data marting:

1. Function Splits Identification

In this step, we decide whether the organisation has general practical splits. We see for the departmental splits, and we decide if the method in which the departments utilise information tends to have in isolation from the organisation. 

2. User Access Tool Requirements Identification

  • We require the data marts for supporting user access tools that need internal data structures. The data in those structures are beyond the data warehouse control, yet we must populate and update it regularly. 
  • We have some tools which populate promptly from the source system, but we cannot prompt 

Schema

Schema is a rational description of the complete database. It contains a description and name of the records of all types of records comprising all the related aggregates and data-items. Similar to a database, the data warehouse also needs to preserve a schema. A database uses a relational model, while the data warehouse uses snowflake, star, and Fact Constellation Schema.

1. Star Schema

  • We represent every dimension in the star schema with only a one-dimension table.
  • This dimension table includes a group of attributes.

2. Snowflake Schema

  • In the Snowflake Schema, we normalise dimensional tables.
  • Normalisation divides the data into further tables.
  • Contrary to Star Schema, we normalise the dimensions tables in a  snowflake schema. 

3. Fact Constellation Schema

  • Fact Constellation contains various fact tables. It is called galaxy schema.
  • The sales fact table is similar to the star schema.

Schema Definition

Coste defines a Multidimensional schema through Data Mining Query Language(DMQL). We will use the two primitives, dimension definition, cube definition, to define data marts, data warehouses.

1. Cube definition Syntax:

Define cube 

<cube_name> [<dimension_list> } : <measure_list>

2. Dimension Definition Syntax:

Define definition 

<dimension_name> as ( <attribute_or_dimension_list> )

Star Schema vs Snowflake Schema

Star SchemaSnowflake Schema
In the star schema, the fact table will be at the centre, and we link it to the dimension tables.The snowflake schema is an expansion of the star schema where we link it to multiple dimensions.
In a star schema, tables are fully in a denormalised structure.In the Snowflake schema, tables are partially in denormalised structures.
Performance of SQL queries is goodThe performance of the SQL queries is not good.
Data redundancy is high and holds a large disk space.Data redundancy is low and holds less disk space.

Conclusion

A data warehouse is a group of software tools that assist in analysing large amounts of diverse data. The objective of the data warehouse is to obtain understandings beneficially from the data. This tutorial covers advanced topics like Data Marts, Schemas, OLAP, etc. I hope this tutorial provides you with the required knowledge about Data Warehouse. 

"What is DataWareHousing ", you can share them via comments below and help us make this post a good read for data warehousing.

Join our newsletter
inbox

Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!

Course Schedule
NameDates
Snowflake TrainingMar 30 to Apr 14View Details
Snowflake TrainingApr 02 to Apr 17View Details
Snowflake TrainingApr 06 to Apr 21View Details
Snowflake TrainingApr 09 to Apr 24View Details
Last updated: 23 Feb 2024
About Author

Viswanath is a passionate content writer of Mindmajix. He has expertise in Trending Domains like Data Science, Artificial Intelligence, Machine Learning, Blockchain, etc. His articles help the learners to get insights about the Domain. You can reach him on Linkedin

read more