Typical Mainframe ETL with Data Scrubbing in SSIS
Recommended by 0 users
With the basic ETL out of the way, you will now jump into a more complex SSIS package that attempts to scrub data. You can start this scenario by downloading the 010305c.dat public data file from the website for this Tutorial into a directory called C:\ProSSIS\Data\Ch08. This file contains public record data from the Department of State of Florida.
In this scenario, you run a credit card company that’s interested in marketing to newly formed domestic corporations in Florida. You want to prepare a data extract each day for the marketing department to perform a mail merge and perform a bulk mailing. Yes, your company is an old-fashioned, snail-mail spammer. Luckily, the Florida Department of State has an interesting extract you can use to empower your marketing department.
The business goals of this package are as follows:
- Create a package that finds the files in the C:\ProSSIS\Data\Ch08 directory and loads the file into your relational database.
- Archive the file after you load it to prevent it from being loaded twice.
- The package must self-heal. If a column is missing data, the data should be added automatically.
- If the package encounters an error in its attempt to self-heal, output the row to an error queue.
- You must audit the fact that you loaded the file and how many rows you loaded.
Start a new package in your existing ProSSISCh08 SSDT project from the first tutorial. Right-click the SSIS Packages folder in the Solution Explorer and select New SSIS Package. This will create Package1.dtsx, or some numeric variation on that name. Rename the file CorporationLoad.dtsx. Doubleclick the package to open it if it is not already open.
Since the OLEDB_AdventureWorks connection you created earlier was a project-level connection, it should automatically appear in the Connection Manager window of the package. You now have two packages using the same project-level connection. If you were to change the database or server name in this connection, it would change for both packages.
Next, create a new Flat File Connection Manager just as you did in the last tutorial. When the configuration screen opens, call the connection FF_Corporation_DAT in the General tab.
Note Using naming conventions like this are a best practice in SSIS. The name tells you the type of file and the type of connection.
Enter any description you like. For this Connection Manager, you’re going to configure the file slightly differently. Click Browse and point to the C:\ProSSIS\Data\Ch08\010305c.dat file (keep in mind that the default file filter is *.txt so you will have to change the filter to All Files in order to see the file). You should also change the Text Qualifier option to a single doublequote (“). Check the “Column names in the first data row” option. The final configuration should resemble shown in below screen shot. Go to the Columns tab to confirm that the column delimiter is Comma Delimited.
Next, go to the Advanced tab. By default, each of the columns is set to a 50- character [DT_STR] column. However, this will cause issues with this file, because several columns contain more than 100 characters of data, which would result in a truncation error. Therefore, change the AddressLine1 and AddressLine2 columns to String [DT_STR], which is 150 characters wide, as shown in below screen shot. After you’ve properly set these two columns, click OK to save the Connection Manager.
Creating the Data Flow
With the mundane work of creating the connections now out of the way, you can create the transformations. As you did in the last package, you must first create a Data Flow Task by dragging it from the Toolbox. Name this task Load Corporate Data. Double-click the task to go to the Data Flow tab.
Drag and drop a Flat File Source onto the design pane and rename it Uncleansed Corporate Data. (You could also use the Source Assistant as shown previously; you are being shown a different method here intentionally.) Double-click the source and select FF_Corporation_DAT as the Connection Manager you’ll be using. Click OK to close the screen. You’ll add the destination and transformation in a moment after the scenario is expanded a bit.
Handling Dirty Data
Before you go deeper into this scenario, take a moment to look more closely at this data. As you were creating the connection, if you are a very observant person (I did not notice this until it was too late), you may have noticed that some of the important data that you’ll need is missing. For example, the city and state are missing from some of the records.
NOTE The Data Profiling Task can also help with this situation; it is covered in Loading a Data Warehouse Topic.
To fix this for the marketing department, you’ll use some of the transformations that were discussed in the last few Post to send the good records down one path and the bad records down a different path. You will then attempt to cleanse the bad records and then send those back through the main path. There may be some records you can’t cleanse (such as corporations with foreign postal codes), which you’ll have to write to an error log and deal with another time.
First, standardize the postal code to a five-digit format. Currently, some have five digits and some have the full nine-digit zip code with a dash (five digits, a dash, and four more digits). Some are nine-digit zip codes without the dash. To standardize the zip code, you use the Derived Column Transformation. Drag it from the Toolbox and rename it Standardize Zip Code.
Connect the source to the transformation and double-click the Standardize Zip Code Transformation to configure it. Expand the Columns tree in the upper-left corner, find [ZipCode], and drag it onto the Expression column in the grid below. This will prefill some of the information for you in the derived column’s grid area. You now need to create an expression that will take the various zip code formats in the [ZipCode] output column and output only the first five characters. An easy way to do this is with the SUBSTRING function. The SUBSTRING function code would look like this:
This code should be typed into the Expression column in the grid. Next, specify that the derived column will replace the existing ZipCode output by selecting that option from the Derived Column dropdown box. Figure 8-8 shows the completed options. When you are done with the transformation, click OK.
The Conditional Split Transformation
Now that you have standardized the data slightly, drag and drop the Conditional Split Transformation onto the design pane and connect the blue arrow from the Derived Column Transformation called Standardize Zip Code to the Conditional Split. Rename the Conditional Split Transformation Find Bad Records. The Conditional Split Transformation enables you to push bad records into a data-cleansing process.
To cleanse the data that lacks city or state, you’ll write a condition specifying that any row missing a city or state should be moved to a cleansing path in the Data Flow. Double-click the Conditional Split Transformation after you have connected it from the Derived Column Transformation in order to edit it.
Create a condition called Missing State or City by typing its name in the Output Name column. You now need to write an expression that looks for empty records. One way to do this is to use the LTRIM function. The two vertical bars (||) in the following code are the same as a logical OR in your code. Two & operators would represent a logical AND condition. The following code will check for a blank Column 6 or Column 7:
LTRIM([State]) == “” || LTRIM([City]) == “”
The last thing you need to do is give a name to the default output if the coded condition is not met. Call that output Good Data, as shown in below screen shot. The default output contains the data that did not meet your conditions. Click OK to close the editor.
NOTE If you have multiple cases, always place the conditions that you think will capture most of the records at the top of the list, because at runtime the list is evaluated from top to bottom, and you don’t want to evaluate records more times than needed.
The Lookup Transformation
Next, drag and drop the Lookup Transformation onto the design pane. When you connect to it from the Conditional Split Transformation, you’ll see the Input Output Selection dialog (shown in below screenshot). Select Missing State or City and click OK. This will send any bad records to the Lookup Transformation from the Conditional Split. Rename the Lookup Transformation Fix Bad Records.
The Lookup Transformation enables you to map a city and state to the rows that are missing that information by looking the record up in the ZipCode table you loaded earlier. Open the transformation editor for the Lookup Transformation. Then, in the General tab, ensure that the Full Cache property is set and that you have the OLE DB Connection Manager property set for the Connection Type. Change the No Matching Entries dropdown box to “Redirect rows to no match output,” as shown in below screen shot
In the Connection tab, select OLEDB_AdventureWorks as the Connection Manager that contains your Lookup table. Select ZipCode from the “Use a Table or View” dropdown menu. For simplicity, you are just selecting the table, but the best practice is always to type in a SQL command and select only the needed columns.
Next, go to the Columns tab and drag ZipCode from the left Available Input Columns to the right ZipCode column in the Available Lookup Columns table. This will create an arrow between the two tables, as shown in Figure 8-12. Then, check the StateAbbr and City columns that you wish to output. This will transfer their information to the bottom grid. Change the Add as New Column option to Replace for the given column name as well. Specify that you want these columns to replace the existing City and State. Refer to Figure 8-12 to see the final configuration. Click OK to exit the transformation editor. There are many more options available here, but you should stick with the basics for the time being. With the configuration you just did, the potentially blank or bad city and state columns will be populated from the ZipCode table.
The Union All Transformation
Now that your dirty data is cleansed, send the sanitized data back into the main data path by using a Union All Transformation. Drag and drop the Union All Transformation onto the design pane and connect the Fix Bad Records Lookup Transformation and the Find Bad Records Conditional Split Transformation to the Union All Transformation. When you drag the blue line from the Lookup Transformation, you are prompted to define which output you want to send to the Union All Transformation. Select the Lookup Match Output. There is nothing more to configure with the Union All Transformation.
The last step in the Data Flow is to send the data to an OLE DB Destination. Drag the OLE DB Destination to the design pane and rename it Mail Merge Table. Connect the Union All Transformation to the destination. Doubleclick the destination and select OLEDB_AdventureWorks from the Connection Manager dropdown. For the Use a Table or View option, select the New button next to the dropdown. The default DDL for creating the table uses the destination’s name (AdventureWorks), and the data types may not be exactly what you want, as shown here:
CREATE TABLE [Mail Merge Table] (
Go ahead and change the schema to something a bit more useful. Change the table name and each column to something more meaningful, as shown in the following example (Ch08SQL.txt). These changes may cause the destination to show warnings about truncation after you click OK. If so, these warnings can be ignored for the purpose of this example.
NOTE Warnings in a package do not indicate the package will fail. In this case the zip code is trimmed to 5 characters so you know the data is not going to be truncated as indicated by the warning. It is acceptable to run packages with warning, especially in cases where unnecessary tasks would need to be added to remove the warning.
CREATE TABLE MarketingCorporation(
FilingDate varchar(10) NULL
You may have to manually map some of the columns this time because the column names are different. Go to the Mappings tab and map each column to its new name. Click OK to close the editor.
Handling More Bad Data
The unpolished package is essentially complete, but it has one fatal flaw that you’re about to discover. Execute the package. As shown in below screen shot, when you do this, you can see, for example, that in the 010305c.dat file, four records were sent to be cleansed by the Lookup Transformation. Of those, only two had the potential to be cleansed. The other two records were for companies outside the country, so they could not be located in the Lookup Transformation that contained only Florida zip codes. These two records were essentially lost because you specified in the Lookup Transformation to redirect the rows without a match to a “no match output” , but you have not set up a destination for this output. Recall that the business requirement was to send marketing a list of domestic addresses for their mail merge product. They didn’t care about the international addresses because they didn’t have a business presence in other countries.
In this example, you want to send those two rows to an error queue for further investigation by a business analyst and to be cleaned manually. To do this properly, you need to audit each record that fails the match and create an ErrorQueue table on the SQL Server. Drag over the Audit Transformation found under the Other Transformations section of the SSIS Toolbox. Rename the Audit Transformation Add Auditing Info and connect the remaining blue arrow from the Fix Bad Records Transformation to the Audit Transformation.
With the Lookup problems now being handled, double-click the Audit Transformation to configure it. Add two additional columns to the output. Select Task Name and Package Name from the dropdown boxes in the Audit Type column. Remove the spaces in each default output column name, as shown in below screenshot, to make it easier to query later. You should output this auditing information because you may have multiple packages and tasks loading data into the corporation table, and you’ll want to track from which package the error actually originated. Click OK when you are done.
The last thing you need to do to polish up the package is send the bad rows to the SQL Server ErrorQueue table. Drag another OLE DB Destination over to the design pane and connect the Audit Transformation to it. Rename the destination Error Queue. Double-click the destination and select OLEDB_AdventureWorks as the Connection Manager, and click New to add the ErrorQueue table. Name the table ErrorQueue and follow a schema similar to the one shown here (Ch08SQL.txt):
CREATE TABLE [ErrorQueue] (
NOTE In error queue tables like the one just illustrated, be very generous when defining the schema. In other words, you don’t want to create another transformation error trying to write into the error queue table. Instead, consider defining everything as a varchar column, providing more space than actually needed.
You may have to map some of the columns this time because of the column names being different. Go to the Mappings tab and map each column to its new name. Click OK to close the editor.
You are now ready to re-execute the package. This time, my data file contained four records that need to be fixed, and two of those were sent to the error queue. The final package would look something like the one shown in below screen shot when executed.
Looping and the Dynamic Tasks
You’ve gone a long way in this Creating an End-to-End Package Topic toward creating a self-healing package, but it’s not very reusable yet. Your next task in the business requirements is to configure the package so that it reads a directory for any .DAT file and performs the preceding tasks on that collection of files.
Your first task is to loop through any set of .DAT files in the C:\ProSSIS\Data\Ch08 folder and load them into your database just as you did with the single file. To meet this business requirement, you need to use the Foreach Loop Container. Go to the Control Flow tab in the same package that you’ve been working in, and drag the container onto the design pane. Then, drag the “Load Corporate Data” Data Flow Task onto the container. Rename the container Loop Through Files.
Double-click the container to configure it. Go to the Collection tab and select Foreach File Enumerator from the Enumerator dropdown box. Next, specify that the folder will be C:\ProSSIS\Data\Ch08 and that the files will have the *.DAT extension, as shown in below screen shoot.
You need to now map the variables to the results of the Foreach File enumeration. Go to the Variable Mappings tab inside the Foreach Loop Editor and select <New Variable…> from the Variable column dropdown box. This will open the Add Variable dialog. For the container, you’ll remain at the package level. You could assign the scope of the variable to the container, but keep things simple for this example. Name the variable strExtractFileName and click OK, leaving the rest of the options at their default settings.
You will then see the variable in the Variable column and the number 0 in the Index option. Because the Foreach File Enumerator option has only one column, you’ll see only an index of 0 for this column. If you used a different enumerator option, you could enter a number for each column that was returned from the enumerator. Click OK to leave the Foreach Loop editor.
Making the Package Dynamic
Now that the loop is created, you need to set the filename in the Corporation Extract Connection Manager to be equal to the filename that the enumerator retrieves dynamically. To meet this business requirement, right-click the Corporation Extract Connection Manager and select Properties (note that you’re clicking Properties, not Edit as you’ve done previously). In the Properties pane for this Connection Manager, click the ellipsis button next to the Expressions option. By clicking the ellipsis button, you open the Property Expressions Editor. Select ConnectionString from the Property dropdown box and then click the ellipsis under the Expression column next to the connection string property you just selected, this will open the Expression Builder window, as shown in below screen shot. You can either type @[User::strExtractFileName] in the Expression column or click the ellipsis button, and then drag and drop the variable into the expression window. By entering @[User::strExtractFileName], you are setting the filename in the Connection Manager to be equal to the current value of the strExtractFileName variable that you set in the Foreach Loop earlier. Click OK to exit the open windows. Note in the Property window that there is a single expression by clicking the plus sign next to Expressions.
As it stands right now, each time the loop finds a .DAT file in the C:\ProSSIS\Data\Ch08 directory, it will set the strExtractFileName variable to that path and filename. Then, the Connection Manager will use that variable as its filename and run the Data Flow Task one time for each file it finds. You now have a reusable package that can be run against any file in the format you designated earlier.
The only missing technical solution to complete is the archiving of the files after you load them. Before you begin solving that problem, manually create an archive directory under C:\ProSSIS\Data\Ch08 called C:\ProSSIS\Data\Ch08\Archive. Right-click in the Connection Manager window and select New File Connection. Select Existing Folder for the Usage Type, and point the file to the C:\ProSSIS\Data\Ch08\Archive directory. Click OK and rename the newly created Connection Manager Archive.
Next, drag a File System Task into the Loop Through Files Container and connect the container to the “Load Corporate Data” Data Flow Task with an On Success constraint (the green arrow should be attached to the File System Task). Rename that task Archive File.
Double-click the “Archive File” File System Task to open the editor (shown in below screen shot). Set the Operation dropdown box to Move file. Next, change the Destination Connection from a variable to the archive Connection Manager that you just created. Also, select True for the OverwriteDestination option, which overwrites a file if it already exists in the archive folder. The SourceConnection dropdown box should be set to the FF_Corporation_DAT Connection Manager that you created earlier in Creating an End-to-End Package Topic. You have now configured the task to move the file currently in the Foreach Loop to the directory in the Archive File Connection Manager. Click OK to close the editor.
Your complete package should now be ready to execute. Save the package before you execute it. If you successfully implemented the solution, your Control Flow should look something like shown in below screen shot when executed. When you execute the package, you’ll see the Control Flow items flash green once for each .DAT file in the directory. To run the package again, you must copy the files back into the working directory from the archive folder.