Data Quality Services (DQS) Cleansing Transformation
Introduced in SQL Server 2012 was a component called Data Quality Services (DQS). This is not a feature of Integration Services, but it is very much connected to the data cleansing processes within SSIS. In fact, there is a data transformation called the DQS Cleansing Task. This task connects to DQS, enabling you to connect incoming Data Flow data and perform data cleansing operations.
Because this Tutorial focuses on SSIS, a full DQS tutorial is not included; however, this section provides a brief overview of DQS and highlights a few data quality examples. To gain more understanding, you can also watch the DQS one day course by the Microsoft DQS Team at http://technet.microsoft.com/en-us/sqlserver/hh780961.aspx.
Data Quality Services
The workflow to use DQS within SSIS requires a few preparatory steps. These need to be performed within the DQS client tool connected to a DQS service. The DQS client is available in the SQL Server 2014 Programs folder (from the Start button). There is a 32-bit version and a 64-bit version. In order to use Data Quality Services, you must have installed it during the SQL Server setup and run the configuration executable, called DQSInstaller.exe. The full setup instructions can be found on MSDN, http://msdn.microsoft.com/enus/ library/gg492277(v=SQL.120).aspx.
Once you pull up the client and connect to the server, you will be in the DQS main screen, shown in Below screen shot.
You can perform three primary tasks with DQS:
- Knowledge Base Management is how you define the data cleansing rules and policies.
- Data Quality Projects are for applying the data quality definitions (from the knowledge base) against real data. We will not be considering projects in this Advanced Data Cleansing in SSIS Topic ; instead, you will see how to use the SSIS DQS Cleansing Task to apply the definitions.
- Administration is about configuring and monitoring the server and external connections.
To begin the process of cleansing data with DQS, you need to perform two primary steps within the Knowledge Base Management pane:
- Create a DQS Knowledge Base (DQS KB). A DQS KB is a grouping of related data quality definitions and rules (called domains) that are defined up front. These definitions and rules are applied against data with various outcomes (such as corrections, exceptions, etc.). For example, a DQS KB could be a set of domains that relate to address cleansing, or a grouping of valid purchase order code rules and code relationships within your company.
- Define DQS domains and composite domains. A DQS domain is a targeted definition of cleansing and validation properties for a given data point. For example, a domain could be “Country” and contain the logic on how to process values that relate to countries around the world. The value mapping and rules define what names are valid and how abbreviations map to which countries.
When you select the Open knowledge base option, you are presented with a list of KBs that you have worked with. The built-in KB included with DQS, DQS Data, contains several predefined domains and rules, and connections to external data. Below screen shot shows the right-click context menu, which enables you to open the KB and see the definition details.
Knowledge bases are about domains, which are the building blocks of DQS. A domain defines what the DQS engine should do with data it receives: Is it valid? Does it need to be corrected? Should it look at external services to cleanse the data? For example, Below screen shot highlights the Domain Values tab of the State domain. It shows how values are cleansed and which values should be grouped. In this example, it lists state abbreviations and names and the Correct To value.
In the next example, a composite domain is selected. A composite domain is just what it sounds like: a group of domains. In this case, the domains involve companies, based on the business name, city, country, and state. Below screen shot shows the partial configuration of a composite domain. In this case, there is an external web service reference called “D&B – D&B Company Cleanse & Match” through which data will be processed. There are many sources you could connect to, such as Melissa Data for address cleansing (www.melissadata.com) or a host of data sources from the Windows Azure Data Marketplace (https://datamarket.azure.com). There are a variety of premium data sources available here. Some can be free on a trial basis, while others have a paid subscription–based fee.
Domains can also contain rules that validate the data as it is processed through DQS. In the example in Below screen shot, the Zip (Address Check) field is validated so that the length is equal to 6. You can also see some of the other options in the list. Multiple rules can be applied with logical AND or OR conditions. If a data element fails the rules, it is marked as bad data during the processing.
Other common rules include range rules to check that numeric data values fall within a given range and value lists to make sure that the data coming in meets specific requirements.
As shown in these few examples, DQS can serve as a powerful data quality engine for your organization. In addition to the common data validation and cleansing operations, you can apply a host of custom rules, matching criteria, and external links.
The next step, after your knowledge base is defined, is to process your data through SSIS.
DQS Cleansing Transformation
SSIS can connect to DQS using the DQS Cleansing Transformation. This is one of two ways that data can be applied against the knowledge bases within DQS. (A data quality project is the primary way to process data if you are not using SSIS for ETL. This is found in the DQS client tool, but it’s not described in this Tutorial, which focuses on SSIS.)
In order to use the DQS Cleansing Transformation, you will first connect to a source within your Data Flow that contains the data you plan to associate with the knowledge base.
The next step is to connect the source (or other transformation) to a DQS Cleansing Transformation and edit the task.
Figure 10-26 shows the Connection Manager tab of the DQS Cleansing Transformation. You need to connect to the DQS server and choose the knowledge base that you will be using for your source input data within SSIS.
In this example, the source data contains states/provinces and countries, so you will use the built-in DQS Data KB to connect the states and countries. To see the list of domains, choose DQS Data from the Data Quality Knowledge Base dropdown, as shown in below screen shot.
The Mapping tab contains the list of input columns that can be used against the KB domains. In below figure, both the StateProvinceCode and the CountryRegionName columns are selected in the input column list and matched to the US – State (2-letter leading) and Country/Region domains in the Domain dropdown.
You are also able to redirect the errors to the error output for the rows that do not meet the domain criteria and rules, using the Configure Error Output dropdown at the bottom of the DQS editor.
Below screen shot shows the simple Data Flow with a couple of Multicast Transformations so that the data can be viewed (for demo purposes).
In addition to mapping the inputs to the DQS domain, the DQS Cleansing Transformation also provides additional data in the output of the transformation. Figure 10-30 shows a Data Viewer with the output rows and columns resulting from the DQS cleansing process.
In this example, note the highlighted row indicating where the country was corrected and standardized to the DQS domain definition. Besides the original and corrected value returned, you can also see a reason code, as well as a confidence level on the correction. These are similar to the Fuzzy Component outputs shown earlier, except you have much more control and flexibility in terms of how you define your data cleansing process within DQS and apply it in SSIS.
An alternate way to see the data flowing through the DQS transformation is to use a Data Tap. This is for when your package is deployed to an SSIS server catalog. In Administering SSIS Topic covers how to use a Data Tap in SSIS.