ETL Testing Tutorial

This tutorial gives you an overview and talks about the fundamentals of ETL Testing.

1. In computing, Extract, Transform and Load (ETL) refers to a process in database usage and especially in data warehousing that:

  • Extracts data from homogeneous or heterogeneous data sources
  • Transforms the data for storing it in proper format or structure for querying and analysis purpose
  • Loads it into the final target (database, more specifically, operational data store, data mart, or data warehouse)

2. Usually all the three phases execute in parallel since the data extraction takes time, so while the data is being pulled another transformation process executes, processing the already received data and prepares the data for loading and as soon as there is some data ready to be loaded into the target, the data loading kicks off without waiting for the completion of the previous phases.
3. ETL systems commonly integrate data from multiple applications(systems), typically developed and supported by different vendors or hosted on separate computer hardware. The disparate systems containing the original data are frequently managed and operated by different employees. For example a cost accounting system may combine data from payroll, sales and purchasing.

Challenges in ETL

1. ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems.The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis can identify the data conditions that must be managed by transform rules specifications. This leads to an amendment of validation rules explicitly and implicitly implemented in the ETL process.
2. Data warehouses are typically assembled from a variety of data sources with different formats and purposes. As such, ETL is a key process to bring all the data together in a standard, homogeneous environment.
3. Design analysts should establish the scalability of an ETL system across the lifetime of its usage. This includes understanding the volumes of data that must be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to multiple-day micro batch to integration with message queues or real-time change-data capture for continuous transformation and update.

Dealing with keys

1. Keys are some of the most important objects in all relational databases, as they tie everything together. A primary key is a column that identifies a given entity, where a foreign key is a column in another table that refers a primary key. These keys can also be made of several columns, in which case they are composite keys. In many cases the primary key is an auto generated integer that has no meaning for the business entity being represented, but solely exists for the purpose of the relational database – commonly referred to as a surrogate key.
2. As there is usually more than one data source being loaded into the warehouse, the keys are an important concern to be addressed.
Your customers might be represented in several data sources, and in one their (Social Security Number) might be the primary key, their phone number in another and a surrogate in the third. All of the customers information needs to be consolidated into one dimension table.


Get Updates on Tech posts, Interview & Certification questions and training schedules