Data Preparation Tasks in SSIS
DATA PREPARATION TASKS
Before processing data from other systems, you sometimes have to first retrieve it or validate the content to determine your level of confidence in the data’s quality. SSIS provides a set of tasks that can be used to retrieve data files using the files and folders available in the file system, or it can reach out using FTP and web service protocols. The following sections explore these tasks in SSIS.
Data profiling is the process of examining data and collecting metadata about the quality of the data, about frequency of statistical patterns, interdependencies, uniqueness, and redundancy. This type of analytical activity is important for the overall quality and health of an operational data store (ODS) or data warehouse. In fact, you’ve most likely been doing this activity whether or not you actually have a defined tool to perform it. Now, rather than use a set of complicated queries or rely on a third-party product, you have a Data Profiling Task as part of the SSIS development environment.
The Data Profiling Task is located in the SSIS Toolbox, but you probably shouldn’t attempt to use the results to make an automated workflow decision in the SSIS package Control Flow. Rather, it is more of an ad hoc tool for placement in a design-time package that will be run manually outside of a scheduled process. In fact, the task doesn’t have built-in conditional workflow logic, but technically you can use XPath queries on the results. The profiler can only report on statistics in the data; you still need to make judgments about these statistics. For example, a column may contain an overwhelming amount of NULL values, but the profiler doesn’t know whether this reflects a valid business scenario.
You can view the structured output file that is produced by the Data Profiling Task in a special Data Profiler Viewer that provides drill-downs back to the detail level. To access this viewer, select SQL Server ⇒ Integration Services from the Start menu. Once the tool is loaded, use the Open button to browse to the output file that will be generated by the Data Profiling Task. Figure 3-7 shows an example of an analysis of the DimCustomer table in the AdventureWorksDW database. You can see here that the majority of the rows in the MiddleName column are null.
The task provides a set of defined profile request types that can be modified like the other tasks in specific properties. The following list describes the different request types and how you can use them to profile your data:
- Candidate Key Profile Request: The profile request will examine a column or set of columns to determine the likelihood of there being a unique candidate key for the data set. Use this to determine whether you have duplicate key values or whether it is possible to build a natural key with the data.
- Column Length Distribution Profile: This profile request enables you to analyze the statistical profile of all the data in a column, with the percentage of incidence for each length. You can use this to help you determine whether your data column length settings are set correctly or to look for bad data in attributes that are known to be one fixed size.
- Column Null Ratio Profile Request: This profile request looks at the ratio of NULL values in a column. Use this to determine whether you have a data quality problem in your source system for critical data elements.
- Column Pattern Profile Request: This profile request enables you to apply regular expressions to a string column to determine the pass/fail ratio across all the rows. Use this to evaluate business data using business formatting rules.
- Column Statistics Profile Request: This profile request can analyze all the rows and provide statistical information about the unique values across the entire source. This can help you find low incidence values that may indicate bad data. For example, a finding of only one color type in a set of 1 million rows may indicate that you have a bad color attribute value.
- Functional Dependency Profile Request: This is one of two profile requests that enable you to examine relationships between tables and columns to look for discrepancies within a known dependency. For example, you can use this request to find countries with incorrect currency codes.
- Value Inclusion Profile Request: This profile request tests to determine whether the values in one column are all included in a separate lookup or dimension table. Use this to test foreign key relationships.
There are two ways to activate these profiles. The first is to click the Quick Profile button on the Data Profiling Task Editor. This creates a set of profiles to run against the same table. You can also skip the quick profile option and create the profiles one by one. Either way you can navigate to the Profile Requests table to configure the request and add regular expressions or other parameter values to the task properties. Figure 3-8 shows the Data Profiling Task Editor with all the requests defined for the DimCustomer table.
For each profile request type, the lower section of the editor for the Request Properties will change to accept the configurable values. Note that the ConnectionManager property must be set to an ADO.NET-based Connection Manager, like the one here connected to AdventureWorksDW. Moreover, you must create this connection prior to attempting to configure this task, but this is a minor inconvenience for such a powerful and welcome addition to the SSIS toolset, which rivals more expensive ETL tools.
File System Task
The File System Task is a configurable GUI component that performs file operations available in the System.IO.File .NET class. If you are used to coding in VBScript, this is an out-of-the-box replacement for the VBScript utility classes that you used to write using the COM-based FileSystemObject. In either case, the File System Task can perform basic file operations such as the following:
- Copy Directory: Copies all files from one directory to another. You must provide the source and destination directories.
- Copy File: Copies a specific file. You must provide the source and destination filename.
- Create Directory: Creates a directory. You must provide the source directory name and indicate whether the task should fail if the destination directory already exists.
- Delete Directory: Deletes a directory. You must provide the source directory to delete.
- Delete Directory Content: Deletes all files in a source directory Delete File: Deletes a specifically provided source file
- Move Directory: Moves a provided source directory to a destination directory. You must indicate whether the task should fail if the destination directory already exists.
- Move File: Moves a specific provided source file to a destination. You must indicate whether the task should fail if the destination file already exists.
- Rename File: Moves a specific provided source file to a destination by changing the name. You must indicate whether the task should fail if the destination file already exists.
- Set Attributes: Sets Hidden, Read-Only, Archive, or System attributes on a provided source file.
One benefit that may not be apparent in these functional descriptions is that the creation of directory structures does not have to be made recursively. For example, you may create the path named c:\ssis\tasks\my file system task\ using the Create Directory form of the File System Task by simply providing the path. You don’t have to create each part of the directory separately, as you did in the DTS legacy product. This capability greatly reduces the typical file operation coding to a simple configuration task for directory operations. However, don’t assume that you can do the same with a file-level operation. If you attempt to rename a file from c:\ssis\ to c:\ssis\my archive\ and the folder \my archive\ doesn’t exist, you will get an error that the path is not found.
Another feature of this task that may not be apparent is that it is written for a single operation. This is by design. If you need to iterate over a series of files or directories, the File System Task can be simply placed within a Looping Container. By keeping the task granular and singularly focused, it is simplified and easily reused.
Most of the properties in this task are set in the General tab of the File System Task Editor (see below diagram). The contents of this tab may vary widely according to what you set in the Operation property. These options correspond to specific file operations that the task can perform. Once the option is set, you may be prompted to complete other properties not shown in this figure.
In this case, the Operation property is set to Move file, which should move a file from a working source path to an archive destination path. The IsDestinationPathVariable property enables you to specify whether the destination path will be set to an SSIS variable or use a Connection Manager. If this is set to true, the dynamic property DestinationVariable sets the destination path to a variable. If it is set to false, then the DestinationConnection option will be available so you can select the Connection Manager that contains your file or directory. These same properties exist for the source connection in the bottom of the tab. The OverwriteDestination option, set to false by default, specifies whether the task will overwrite the destination file or directory if it already exists. To get an idea of how you configure this task, see the example in the next section.
Archiving a File
Consider a typical use of the File System Task for an ETL process from a mainframe system. To automate a nightly data load, the process would look like this:
- A file or series of similar files would be generated from a mainframe or other source system and dumped to a network drive.
- An SSIS package would start on a schedule to poll a directory looking for files to process. If any files were found, they would be moved into a staging or working directory.
- The data would be extracted out of the file(s).
- The file(s) would then be archived to another directory.
In legacy DTS packages, each of these steps would require some coding in the ActiveX Script Task. You would have to write one task in VBScript to poll the directory to determine whether the file arrived. Another script would pick up the file and move it to another directory. The last script would archive the file. Even worse, the ActiveX scripts had to use the code-and-paste method of code reuse to perform the same function in other packages. Typically, this led to various states of unfinished code that was better in some packages than others and most certainly a nightmare to maintain.
In SSIS, the File System Task can simplify the creation of a package by performing these ETL file-based requirements. We’ll postpone the task of polling until later in this chapter when we get to the WMI Event Watcher Task. The iteration of files is also discussed in detail in Chapter 6. However, you can use what you know about the File System Task to move the file to an archive directory.
1. Create a new project named Tasks and a package in c:\ProSSIS\Tasks\FileSystemTask or download the complete code from www.wrox.com/go/prossis2014.
2. Create a subdirectory called c:\ProSSIS\Tasks\FileSystemTask\Archive, and create a dummy file called myfile.txt in c:\ProSSIS\Tasks\FileSystemTask.
3. Add a File System Task into the Control Flow.
4. In the new task, change the operation to Move File. Select <New Connection> from the SourceConnection dropdown.
5. When the File Connection Manager Editor opens, select Existing File and type C:\ProSSIS\Tasks\FileSystemTask\MyFile.txt for your file.
6. For the DestinationConnection property, select the <New Connection> option and choose Existing Folder when the File Connection Manager Editor reopens. This time type
C:\ProSSIS\Tasks\FileSystemTask\Archive for the path name.
7. Now run the SSIS package. You’ll see the file myfile.txt move into the archive directory.
8. If you want to rename the file as you move it to a date-based filename, you need to specify the full filename in the variable and use the Rename File option of the File System Task. That achieves the movement of the file and a new filename in one task. The filename can also be dynamically set using a variable as an expression.
For examples of how you can rename this file using a dynamically generated name as you archive, see Chapter 5.
The SSIS FTP Task enables the use of the File Transfer Protocol (FTP) in your package development tasks. This task now exposes more FTP command capability, enabling you to create or remove local and remote directories and files. Another change from the legacy DTS FTP Task is the capability to use FTP in passive mode. This solves the problem that DTS had in communicating with FTP servers when the firewalls filtered the incoming data port connection to the server.
The General tab in the FTP Task Editor is where you specify the FTP Connection Manager for the FTP site you wish to access. If you haven’t specified one, follow these steps:
1. Select <New Connection…> under the FTPConnection property. This will open the FTP Connection Manager, where you can configure the FTP connection. In below diagram, the Server Name property contains the FTP address for the FTP server. The Server Port property is set to 21, which is the default port for most FTP sites. You can change this if necessary. The other important option to note here is the “Use passive mode” checkbox.
2. Once you have the FTP connection configured, move to the File Transfer tab. The IsRemotePathVariable and IsLocalPathVariable properties allow the paths to be set to an optional variable. Using variables enables you to set these values dynamically at runtime. The RemotePath property sets the directory or files for the remote FTP system.
3. Once the FTPConnection property from the General tab has been selected, you can browse to the actual remote file system to select the remote path or file by clicking the ellipsis in the Remote Path property. You’ll see a dialog similar to the one shown in Figure 3-11 for browsing the FTP remote paths (and files if you choose Receive files for the Operation property).
The LocalPath property is the Connection Manager that contains a directory on the SSIS side that is going to receive or send the files via FTP. The OverwriteFileAtDest option specifies whether the file at the destination will be overwritten if a conflict exists. Like many FTP clients, you can specify transporting the files in ASCII format by setting the IsTransferAscii option to true. If you set this option to false, the files will be transported in a default binary format. The most important option, of course, is the Operation option, which specifies what type of action you want to perform. In the next section, you’ll set up an SSIS FTP Task to get a file from an FTP server like Microsoft.com.
Getting a File Using FTP
To build an SSIS package that can use FTP to retrieve a file from an FTP server, follow these steps:
1. Create a directory called c:\prossis\tasks\ftptask\ or copy the code from www.wrox.com/go/prossis2014.
2. Create a new project and package in this folder and add an FTP Task to the Control Flow work surface.
3. Double-click the FTP Task to open the editor, and set it up.
4. In the General tab, select <New connection…> for the FTPConnection dropdown. This will open the FTP Connection Editor. Set the Server Name option to ftp.microsoft.com and click Test Connection (refer to Figure 3-10).
5. Click OK to go back to the FTP Task Editor.
6. Go to the File Transfer tab and set it up to resemble Figure 3-12. Select Receive files for the operation.
7. For the RemotePath property, click the ellipsis to browse to the /bussys/readme.txt folder on the remote path.
8. For the Local Path option, set the IsLocalPathVariable property to true and select <New Variable …> to create a new variable named LocalPath that is set to the value of c:\Prossis\Tasks\ftptask\.
9. For the OverwriteFileAtDest property, select True. The final task should look like Figure 3-12.
If you run the package, you’ll see that the file is downloaded from the FTP site at Microsoft to your local file system. In a real-world scenario, you would probably download the file, load it into a SQL Server, and then archive it. This complete scenario is discussed in detail in Chapter 8.
Web Service Task
The Web Service Task in SSIS is used to retrieve XML-based result sets by executing a method on a web service. Just like the other tasks we’ve separated out into the Data Preparation Task category for this chapter, this task only retrieves the data; it doesn’t yet address the need to navigate through the data, or extract sections of the resulting documents. Web services are a big part of advancing service-oriented architectures, and they can be used in SSIS to provide real-time validation of data in your ETL processes or to maintain lookup or dimensional data.
The task requires creation of an HTTP Connection Manager to a specific HTTP endpoint on a website or to a specific Web Services Description Language (WSDL) file on a website. Because of this HTTP connection, keep in mind that the task may not work for some locked-down server environments. If the HTTP Connection Manager doesn’t point to a WSDL file on the site, a local version must be provided. The WSDL file provides a standard XMLformatted list of available methods that can be called in the web service. The WSDL file also provides information about what type of parameters can be used and what results can be expected in return. Figure 3-13 shows how you can configure the HTTP Connection Manager to access a web service called USZIP at www.webservicex.net/uszip.asmx.
This is a simplistic HTTP Connection Manager setup. In this case, no special proxy, credentials, or certificate setup is required. If you are using secure, corporate web services, this undoubtedly will not be the case.
The General Tab
The General tab on the Web Service Task is where you set the HttpConnection property of the task to the HTTP Connection Manager that you have already created or alternatively create at the same time by selecting the <New Connection…> option in the property. In Figure 3-14, the value for the WSDL parameter file has been provided. This indicates to the Connection Manager that the definitions of the web service can be obtained remotely. In this case, you are not required to provide a local version of the WSDL file as well. This property is required only if you don’t provide the WSDL parameter in the Connection Manager. If this is the case, simply provide the local filename and click the Download WSDL button to have the task, at design time, reach out to the HTTP endpoint and retrieve a copy of the WSDL file for you.
The Input Tab
The next step is to define the input that you want to retrieve from the web service. It makes sense that if you have defined the web service in the General tab, you now need to specify the web method that you want to access for the input to the task. The Web Service Task makes this easy by using the WSDL file to provide a dropdown in the Input tab where you can specify the method, as shown in Figure 3-15. If you get an error trying to access the Input tab, consider downloading the WSDL file directly onto your computer.
After you select a web method, such as GetInfoByAreaCode, the Web Service Task uses the WSDL to set up the interface for you to provide how the input parameters will be fed into the task. You can choose to set up hardcoded values as shown in Figure 3-15 or you can fill these parameters with variables.
You can see here that all the named parameters, in this case only USAreaCode, are provided with the expected data types. If you select the Variable option here, the Value column would morph into a dropdown list to allow the selection of a variable. Using variables gives you the flexibility to send something into the Web Service Task dynamically at runtime.
The Output Tab
The remaining tab is the Output tab. Here you have only two options in this task. The resulting retrieval from the web service method can be stored in a file or in a variable. The output is in XML format, so if you choose to save in a variable, select a data type of string. In this example, we’ll set the OutputType property to a file connection, and then set the location of the file to a spot on the local file system.
Running the Web Service Task using this configuration will result in calling the web method GetInfoByZipCode on the web service USZip and retrieving data into an XML file that looks like this:
Retrieving data into a file is good, but using it in an SSIS package is even better. The next example demonstrates how you would use the XML Task to retrieve this same zip code data and use it in a Data Flow.
Retrieving Data Using the Web Service Task and XML Source Component
In this example, you’ll configure the data retrieved from the Web Service Task to be read through the XML source in the Data Flow. Don’t worry if the Data Flow is a little confusing at this time. You’ll see much more about it in the next chapter.
1. Set up a project and package in the directory c:\ProSSIS\tasks\websvc or download the complete package from www.wrox.com/go/prossis2014.
2. Drop a Web Service Task onto the Control Flow design surface and configure the task to use the GetInfoByZipCode method on the web service USZip as shown in the preceding section.
3. Go to the Output tab and set the OutputType to store the results of the web service method to a file of your choosing, such as C:\ProSSIS\Tasks\WebSVC\Output.xml.
4. Drop a Data Flow Task onto the Control Flow design surface and connect the Web Service Task to the Data Flow.
5. In the Data Flow, drop an XML source component on the design surface.
If the XML source contained schema information, you could select the Use Inline Schema option — the Data Access Mode should be set to “XML file location” — and you’d be done. However, you’ve seen the data we are getting from the web service, and no schema is provided. Therefore, you need to generate an XML Schema Definition language file so that SSIS can predict and validate data types and lengths.
NOTE Here’s a little trick that will save you some time. To demonstrate the Web Service Task initially, you set the XML output to go to a file. This was not by accident. Having a concrete file gives you a basis to create an XSD, and you can do it right from the design-time XML Source Component. Just provide the path to the physical XML file you downloaded earlier and click the Generate XSD button. Now you should have an XSD file that looks similar to this:
Notice that the XSD generator is not perfect. It can only predict a data type based on what it sees in the data. Not to give the generator anthropomorphic qualities, but the ZIP and AREA_CODE data elements “look” like numeric values to the generator. You should always examine the XSD that is created and edit it accordingly. Change the sequence element lines for ZIP and AREA_CODE to look like this:
<xs:element minOccurs=”0″ name=”ZIP” type=”xs:string ” />
<xs:element minOccurs=”0″ name=”AREA_CODE” type=” xs:string” />
Now if you refresh the XML Source and select the Columns tab, as shown in Figure 3-16, you should be able to see the columns extracted from the physical XML file.
6. To complete the package, add a Flat File Destination to dump the data into a comma-separated file (CSV file).
7. Connect the output pipeline of the XML source to the Flat File Destination.
8. Click the New button next to the Connection Manager dropdown box to create a new Flat File Connection Manager. Place the file somewhere in the C:\ProSSIS\Tasks\WebSVC directory and call it whatever you’d like, as shown in Figure 3-17.
9. Click OK to go back to the Flat File Destination and then click the Mappings tab to confirm that the columns map appropriately (straight arrows between the left and right). If you save and run the package, it will download the XML file into a variable, and then export the columns and rows to a flat file.
This is hardly a robust example, but it demonstrates that the Web Service Task makes retrieving data from a web service a very simple point-and-click task. However, the Web Service Task can retrieve only the results of a web service call. You may find that you need to prepare, extract, or validate your XML files before running them through your ETL processes. This is where the XML Task comes in.
The XML Task is used when you need to validate, modify, extract, or even create files in an XML format. Earlier we used a Web Service Task to retrieve data in an XML-formatted web service response. In terms of validating this type of XML result, the WSDL that you copy down locally is your contract with the web service, which will break if the XML contents of the results change. In other situations, you may be provided with XML data from a thirdparty source outside of a contractual relationship. In these cases, it is a good practice to validate the XML file against the schema definition before processing the file. This provides an opportunity to handle the issue programmatically.
If you look at the task in Figure 3-18, the editor looks simple. There are two tabs: only one for General configuration and the obligatory Expressions tab.
The current OperationType is set in this example to the Diff operation. This option is one of the more involved operations and requires two XML sources, one as the Input and the other as the Second Operand. However, these properties change based on the selection you make for the OperationType property. The options are as follows:
Validate: This option allows for the schema validation of an XML file against Document Type Definition (DTD) or XML Schema Definition (XSD) binding control documents. You can use this option to ensure that a provided XML file adheres to your expected document format.
XSLT: The Extensible Stylesheet Language Transformations (XSLT) are a subset of the XML language that enables transformation of XML data. You might use this operation at the end of an ETL process to take the resulting data and transform it to meet a presentation format.
XPATH: This option uses the XML Path Language and allows the extraction of sections or specific nodes from the structure of the XML document. You might use this option to extract data from the XML document prior to using the content. For example, you might want to pull out only the orders for a specific customer from an XML file.
Merge: This option allows for the merging of two XML documents with the same structure. You might use this option to combine the results of two extracts from disparate systems into one document.
Diff: This option uses difference algorithms to compare two XML documents to produce a third document called an XML Diffgram that contains the differences between them. Use this option with another XML Task using the Patch option to produce a smaller subset of data to insert into your data store. An example use of this task is extracting only the prices that have changed from a new price sheet in XML format.
Patch: This option applies the results of a Diff operation to an XML document to create a new XML document.
As you might expect, you can configure the task to use either a file source or a variable. The option to input the XML directly is also available, but it’s not as practical. The best way to get an idea of how this task can be used is to look at a few examples.
Validating an XML File
First up is a basic use case that demonstrates how to validate the internal schema format of an XML file. To make sure you are clear on what the XML Task does for you, the validation is not about whether the XML file is properly formed but about whether it contains the proper internal elements. If an XML file is malformed, then simply attempting to load the XML file in the task will generate an error. However, if a missing node is defined within the XSD contract, the XML Task Validation option will inform you that the XML file provided doesn’t meet the conditions of the XSD validation.
For this example, we’ll borrow the information from the XML and XSD files in the Web Service Task example. Recall that we had an XSD that validated a string node for City, State, Zip, Area_Code, and Time_Zone. (See the Web Service Task example to view the XSD format.) You can download this complete example at www.wrox.com/go/prossis2014.
We’ll use three files to exercise this task. The first is a valid XML file named MyGetZipsData.xml that looks like this:
The second file is an invalid XML file named MyGetZipsData_Bad.xml. This file has an improperly named node <CITYZ> that doesn’t match the XSD specification:
The last file is a malformed XML file named MyGetZipsData_ReallyBad.xml. This file has an empty <Table> node and is not a valid XML format:
For this example, follow these steps:
1. Create a package and add a new XML Task to the Control Flow surface.
2. Select the OperationType of Validate, set the Input Source Type to a new file connection, and browse to select the MyGetZipsData.xml file.
3. Expand the OperationResult property in the Output section to configure an additional text file to capture the results of the validation. The result values are only true or false, so you only need a simple text file to see how this works. Typically, you store the result in a variable, so you can test the results to determine the next action to take after validation.
4. Set the OverwriteDestination property to True to allow the result to be overwritten in each run of the task.
5. In the Second Operand, you’ll need to create another file connection to the XSD file. This will be used for validation of the schema.
6. Create another file connection using an existing file that points to this XSD file.
7. Finally, set the validation type to XSD, as we are using an XSD file to validate the XML. The editor at this point should look like Figure 3-19.
This completes the happy path use case. If you execute this task, it should execute successfully, and the results file should contain the value of true to indicate that the XML file contains the correct schema as defined by the XSD file. Now on to the true test:
8. Change the source to a new connection for the MyGetZipsData_Bad.xml file.
9. Execute the task again. This time, although the task completes successfully, the result file contains the value of false to indicate a bad schema. This is really the whole point of the Validation option.
10. Finally, change the source to create a new connection to the poorly formatted XML file MyGetZipsData_ReallyBad.xml to see what happens. In this case, the task actually fails — even though the Validation option’s FailOnValidationFail property is set to False. This is because the validation didn’t fail — the loading of the XML file failed. The error message indicates the problem accurately:
Just be aware of the difference between validating the schema and validating the XML file itself when designing your package Control Flows, and set up accordingly. You need to have a Control Flow for the failure of the task and for the successful completion with a failure result.
This is just one example demonstrating how you can use the XML Task for SSIS development. There are obviously several other uses for this task that are highly legitimate and useful for preparing data to feed into your SSIS ETL package Data Flows. The next section turns to another set of data preparation tasks, which we have separated into their own category, as they deal specifically with retrieval and preparation of RDBMS data.