Data acquisition – ETL Process in Data Warehousing

Data Acquisition:

In DWH terminology, Extraction, Transformation, Loading (ETL) is called as Data Acquisition

It is a process of extracting relevant business information from multiple operational source systems, transforming the data into a homogenous format and loading into the DWH/Data mart

Two types of ETL’s used in implementing data acquisition.

A ) 1)  Code  Based  ETL: An ETL application can be developed using programming languages such as PL/SQL

Ex: Teradata ETL Utilities

2) SAS based & SAS access

B) GUI Based ETL: An application can be designed with simple GUI interface, point & click techniques

Ex: Informatica, Datastage, Data junction, Abintio, Data services, Data manager, Oracle data integrator (OWB), SSIS (SQL Server Integration Service)

Data Extraction:  It is the process of reading the data from following operational source systems (External to the organization and Internal databases)

  • Relational Systems—-Oracle, SQL Server, DB2, Sybase, Informix, Redbrick
  • File sources: XML, COBOL, Flat files, weblog
  • Legacy systems: Mainframes, VSAM
  • Other sources: MS-Access, MS-Excel, MQ Series, TIBCO
  • ERP Sources: SAP R/3, Peoplesoft, JDEdwards, BAAN, RAMCO systems

Data Transformation:

Transforming the  data  from one  format to  client required  format/business  standard format. A staging is a temporary memory or buffer where the following data transformation activities take place

  • Data merging
  • Data Cleansing
  • Data scrubbing
  • Data aggregation

Data Merging:

It’s a process of integrating the data from multiple data sources into a single operational data set Data Sources can be homogenous/heterogeneous

Two types of merging:

A) Horizontal Merging: Process of joining the records horizontally when the 2 sources are having different data definitions.

Ex: Source1—Emp Table

columns –Empno, Ename, Sal, Deptno

Source2—-Dept Table

columns in the Dept Table– Dept no, Dname, Loc

B) Vertical Merging: Process of joining the records vertically when the 2 sources are having similar data structures

Data cleansing:

Data cleansing is the process of ensuring that a set of data is correct and accurate. During this process, records are checked for accuracy and consistency, and they are either corrected or deleted as necessary. This can occur within a single set of records or between multiple sets of data that need to be merged or that will work together

Data scrubbing:

Data scrubbing is the process of detecting and removing or correcting any information in a database that has some sort of error. This error can be because the data is wrong, incomplete, formatted incorrectly, or is a duplicate  copy of another entry. Many data-intensive fields of business such as banking, insurance, retail, transportation, and telecommunications may use these sophisticated software applications to clean up a database’s information.

Data aggregation:

Is a process of calculating the summaries from the detailed data

  • Data scrubbing is applied for each record
  • Data aggregation is applied for a group of records

Data Loading:

Dumping/inserting of the processed data into the target system (DWH)

2 types of data loads:

A) Initial/Full load: is a process of loading the data into an empty target table

B) Incremental/Delta load: is a process of loading only new records/changed records, which takes place after initial load

  • First we load dimensions
  • Next we load facts


Get Updates on Tech posts, Interview & Certification questions and training schedules