Advanced Data Cleansing in SSIS
What’s In This Topic?
- Using the Derived Column Transformation for advanced data cleansing
- Applying the Fuzzy Lookup and Fuzzy Grouping transformations and understanding how they work
- Introducing Data Quality Services
- Introducing Master Data Services
In this Topic, you will learn the ins and outs of data cleansing in SSIS, from the basics to the advanced. In a broad sense, one of SSIS’s main purposes is to cleanse data — that is, transform data from a source to a destination and perform operations on it along the way. In that sense, someone could correctly say that every transformation in SSIS is about data cleansing. For example, consider the following transformations:
- The Data Conversion adjusts data types.
- The Sort removes duplicate data.
- The Merge Join correlates data from two sources.
- The Derived Column applies expression logic to data.
- The Data Mining predicts values and exceptions.
- The Script applies .NET logic to data.
- The Term Extraction and Term Lookup perform text mining.
In a stricter sense, data cleansing is about identifying incomplete, incorrect, or irrelevant data and then updating, modifying, or removing the “dirty” data. From this perspective, SSIS has four primary data cleansing transformations, which are reviewed in this Topic:
- Derived Column Transformation: This transformation can perform advanced expression-based data cleansing. If you have just basic data cleansing needs, like blanks or nulls or simple text parsing, this is the right place to start. The next section will walk through some examples.
- Fuzzy Lookup Transformation: Capable of joining to external data based on data similarity, the Fuzzy Lookup Transformation is a core data cleansing tool in SSIS. This transformation is perfect if you have dirty data input that you want to associate to data in a table in your database based on similar values. Later in the Topic, you’ll take a look at the details of the Fuzzy Lookup Transformation and what happens behind the scenes.
- Fuzzy Grouping Transformation: The main purpose is de-duplication of similar data. The Fuzzy Grouping Transformation is ideal if you have data from a single source and you know you have duplicates that you need to find.
- DQS Cleansing: The Data Quality Services Cleansing Transformation leverages the DQS engine to perform predefined data quality rules and mapping. If you have any advanced cleansing where you would like to apply rules and manage cleansing logic, the DQS Transformation using the DQS engine is the right choice for you.
In addition to these data cleansing transformations, SSIS also has a Data Profiling Task that can help you identify any issues within your dirty data as you plan its necessary data cleansing. See SSIS Task Topic for an overview of the Data Profiling Task and the topic of Loading the Data Warehouse for a more detailed review of its functionality.
This Topic will also explore Master Data Services as a way of standardizing reference data. MDS give users the familiar interface of Excel to manage and correct data to truly have one version of the truth.