The Script Component provides another area where programming logic can be applied in an SSIS package. This component, which can be used only in the Data Flow portion of an SSIS package, allows programmatic tasks to occur in the data stream. This component exists to provide, consume, or transform data using .NET code. To differentiate between the various uses of the Script Component, when you create one, you have to choose one of the following three types:
- Source Type Component: The role of this Script Component is to provide data to your Data Flow Task. You can define outputs and their types and use script code to populate them. An example would be reading in a complex file format, possibly XML or something that requires custom coding to read, like HTTP or RSS Sources.
- Destination Type Component: This type of Script Component consumes data much like an Excel or Flat File Destination. This component is the end of the line for the data in your data stream. Here, you’ll typically put the data into a DataSet variable to pass back to the Control Flow for further processing, or send the stream to custom output destinations not supported by built-in SSIS components. Examples of these output destinations can be web service calls, custom XML formats, and multi-record formats for mainframe systems. You can even programmatically connect and send a stream to a printer object.
- Transformation Type Component: This type of Script Component can perform custom transformations on data. It consumes input columns and produces output columns. You would use this component when one of the built-in transformations just isn’t flexible enough.
In this section, you’ll get up to speed on all the specifics of the Script Component, starting first with an explanation of the differences between the Script Task and the Script Component, and then looking at the coding differences in the two models. Finally, you’ll see an example of each implementation type of the Script Component to put all of this information to use.
Differences from a Script Task
You might ask, “Why are there two controls, both the Script Task and the Script Component?” Well, underlying the SSIS architecture are two different implementations that define how the VSTA environment is used for performance. Each Script Task is called only once within a Control Flow, unless it is in a looping control. The Script Component has to be higher octane because it is going to be called per row of data in the data stream. You are also in the context of being able to access the data buffers directly, so you will be able to perform more tasks.
When you are working with these two controls, the bottom line is that there are slightly different ways of doing the same types of things in each. This section of the Topic cycles back through some of the things you did with the Script Task and points out the differences. First you’ll look at the differences in configuring the editor. Then you’ll see what changes when performing programmatic tasks such as accessing variables, using connections, raising events, and logging. Finally, you’ll look at an example that ties everything together.
Configuring the Script Component Editor
You’ll notice the differences starting with the item editor. Adding a Script Component to the Data Flow designer brings up the editor shown in below the screenshot, requesting the component type.
NOTE In order to add the Script Component, you must first add a Data Flow Task to a package.
Selecting one of these options changes how the editor is displayed to configure the control. Essentially, you are choosing whether the control has input buffers, output buffers, or both. below the screen shot shows an example of a Script Component Transformation that has both buffers.
The Script Component Source has only output buffers available, and the Script Component Destination has only input buffers available. You are responsible for defining these buffers by providing the set of typed columns for either the input or outputs. If the data is being fed into the component, the editor can set these up for you. Otherwise, you have to define them yourself. You can do this programmatically in the code, or ahead of time using the editor. Just select the input or output columns collection on the user interface, and click the Add Column button to add a column, as shown in below screen shot.
A helpful tip is to select the Output Columns node on the tree view, so that the new column is added to the bottom of the collection. Once you add a column, you can’t move it up or down. After adding the column, you need to set the Data Type, Length, Precision, and Scale. For details about the SSIS data types, see Using Variables, Parameters, and Expressions Topic.
When you access the scripting environment, you’ll notice some additional differences between the Script Component and the Script Task. Namely, some new classes have been added to the Solution Explorer, as shown in below screen shot.
The name of the class that is used to host custom code is different from that used for the Script Task. Rather than ScriptMain, the class is called main. Internally there are also some differences. The primary difference is the existence of more than one entry point method. The methods you’ll see in the main class depend upon the Script Component type. At least three of the following methods are typically coded and can be used as entry points in the Script Component:
- PreExecute is used for preprocessing tasks like creating expensive connections or file streams.
- PostExecute is used for cleanup tasks or setting variables at the completion of each processed row.
- CreateNewOutputRows is the method to manage the output buffers.
- Input0_ProcessInputRow is the method to manage anything coming from the input buffers. Note that the Input0 part of the name will differ based on the name of the input set in the editor.
The remaining classes are generated automatically based on your input and output columns when you enter into the script environment, so don’t make any changes to these; otherwise, they will be overwritten when you reenter the script environment.
One problem you might encounter in the Script Component Editor and the generation of the BufferWrapper class is that you can name columns in the editor that use keywords or are otherwise invalid when the BufferWrapper class is generated. An example would be an output column named 125K_AMOUNT. If you create such a column, you’ll get an error in the BufferWrapper class stating the following:
Invalid Token 125 in class, struct, or interface member declaration
Don’t attempt to change the property in the buffer class to something like _125K_AMOUNT, because this property is rebuilt the next time you edit the script. Change the name of the output column to _125K_AMOUNT, and the buffer class will change automatically. The biggest difference that you need to pay attention to with the Script Component is that if you make any changes to this editor, you’ll need to open the script environment so that all these base classes can be generated.
Last, but not least, you’ll notice a Connection Managers tab that is not available in the Script Task Editor. This enables you to name specifically the connections that you want to be able to access within the Script Component. Although you are not required to name these connections up front, it is extremely helpful to do so. You’ll see why later, when you connect to a data source. Below the screen shot shows an example of the AdventureWorks connection added to a Script Component.
Now that you understand the differences between the Script Task and Script Component from a setup perspective, you can examine how the coding differs.
Accessing Variables in a Script Component
The same concepts behind accessing variables also apply to the Script Component. You can send the variables into the control by adding them to the ReadOnlyVariables or ReadWriteVariables properties of the editor. You can also choose not to specify them up front and just use the variable dispenser within your Script Component to access, lock, and manipulate variables. We recommend using the properties in the editor for this component because the variables provided in the editor are added to the auto-generated base class variables collection as strongly typed variables. In this control, adding variables to the editor not only removes the need to lock and unlock the variables but also means you don’t have to remember the variable name within the component. Keep in mind that variables can’t be modified in all aspects of the Script Component. Here’s an example of setting the variable ValidationErrors within a Script Component:
As you can see, using variables is easier and more maintainable than in the Script Task because the variable names are available in IntelliSense and checked at compile time. However, if you don’t want to add a variable to each Script Component for some reason, you can still use the variable dispenser in this component. It is located on the base class and can be accessed using the base class, instead of the Dts object. Other than these differences, the variable examples in the Script Task section of this Scripting in SSIS Topic are still applicable. The remaining tasks of connecting to data sources, raising events, and logging follow a similar pattern. The methods for performing the tasks are more strongly named, which makes sense because any late binding (or runtime type checking) within a high-performing Data Flow Task would slow it down.
Subscribe to our youtube channel to get new updates..!
Connecting to Data Sources in a Script Component
A typical use of a connection is in the Source type of the Script Component, because in these types of Data Flow Tasks, the mission is to create a data stream. The origination of that data is usually another external source. If you had a defined SSIS Source Component, then it would be used and you wouldn’t need the Script Component to connect to it.
The coding to connect to a Connection Manager is very simple. You can instantiate a specific Connection Manager and assign the reference to a connection in the component’s collection. Using the connections collection in the Script Component is very similar to using the variables collection. The collection of strongly typed Connection Managers is created every time the script editor is opened. Again, this is helpful because you don’t have to remember the names, and you get compile-time verification and checking.
For example, if you had a package with an OLE DB Connection Manager named myOracleServer and added it to the Script Component with the name OracleConnection, you’d have access to the connection using this code:
For the Script Task, you’ve looked at SSIS’s ability to raise events, and you walked through some examples that demonstrated its scripting capabilities for managing how the package can respond to these events. These same capabilities exist in Script Components, although you need to keep in mind that Script Components run in a data pipeline or stream, so the potential for repeated calls is highly likely. You should fire events sparingly within a Script Component that is generating or processing data in the pipeline to reduce overhead and increase performance. The methods are essentially the same, but without the static Dts object.
NOTE Event handling is covered in more detail in Topic Error and Event Handling.
Here is the code to raise an informational event in a Script Component
Either version of code will generate an event in the Progress Tab that looks like this:
[myScriptComponent] Information: Removed non-ASCII Character
Raising an event is preferred to logging because it enables you to develop a separate workflow for handling the event, but in some instances logging may be preferred.
Like the Script Task, logging in the Script Component writes a message to all enabled log providers. It has the same interface as the Script Task, but it is exposed on the base class. Remember that Script Components run in a data pipeline or stream, so the potential for repeated calls is highly likely. Follow the same rules as those for raising events, and log sparingly within a Script Component that is generating or processing data in the pipeline to reduce overhead and increase performance. If you need to log a message within a Data Flow, you can improve performance by logging only in the PostExecute method, so that the results are logged only once.
Example: Scripting a Log Entry
This example shows how to log one informational entry to the log file providers at the end of a Data Flow Task. To use this code, create a package with a Data Flow Task and add a Script Component as a source with one output column named NewOutputColumn. Create these integer variables as private variables to the main.cs class: validationBadChars, validationLength, and validationInvalidFormat. Then add the following code to the CreateNewOutputRows() method in the main.cs class
In order for this sample to produce a log entry, remember that you have to set up a logging provider (use the menu option SSIS ⇒ Logging). Make sure you specifically select the Data Flow Task in which the Script Component is hosted within SSIS and the logging events specifically for the Script Component. Running the package will produce logging similar to this:
Example: Data Validation
Compared to the Script Task, the Script Component has a steeper learning curve. The example presented in this section is more comprehensive and should enable you to get the bigger picture of how you can use this component in your everyday package development.
A typical use of the Script Component is to validate data within a Data Flow. In this example, contact information from a custom application did not validate its data entry, resulting in poor data quality. Because the destination database has a strict set of requirements for the data, your task is to validate the contact information from a Flat File Source and separate valid from invalid records into two streams: the good stream and the error stream. The good records can continue to another Data Flow; the error records will be sent to an error table for manual cleansing.
Create the contacts table with the following script
The error queue table is virtually identical except it has no strict requirements and a column has been added to capture the rejection reason. All data fields are nullable and set to the maximum known size
Finally, the incoming data format is fixed-width and is defined as follows:
The data file provided as a test sample looks like this
Create a sample of this data file or download a copy from WWW.WROX.COM/GO/PROSSIS2014. Create a new package and add a Data Flow Task. Click on the Data Flow design surface and add a Connection Manager to the Connection Managers tab. Name the Connection Manager “Contacts Mainframe Extract,” browse to the data file, and set the file format to Ragged Right. Flat files with spaces at the end of the specifications are typically difficult to process in some ETL platforms. The Ragged Right option in SSIS provides a way to handle these easily without having to run the file through a Script Task to put a character into a consistent spot or without having the origination system reformat its extract files. Use the Columns tab to visually define the columns. Flip to the Advanced tab to define each of the column names, types, and widths to match the desired values and the new database field name. (You may need to delete an unused column if this is added by the designer.) The designer at this point looks like below the screen shot.
Typically, you may want to define some data with strong types. You can decide to do that here in the Connection Manager or you can do so later using a derived column depending on how confident you are in the source of the data. If the data source is completely unreliable, import data using Unicode strings and use your Data Flow Tasks to validate the data. Then move good data into a strong data type using the Derived Column Transformation.
On the Data Flow surface, drag a Flat File Source to the Data Flow editor pane. Edit the Flat File Source and set the Connection Manager to the Contract Mainframe Extract Connection Manager. This sets up the origination of the data to stream into the Data Flow Task. Check the box labeled “Retain null values from the source as null values in the Data Flow.” This feature provides the consistent testing of null values later.
Now add a Script Component to the Data Flow. When you drop the Script Component, you will be prompted to pick the type of component to create. Select Transformation and click OK. Connect the output of the Flat File Source to the Script Component to pipe the data into this component, where you can program some validation on the data.
Open the Script Component and set the ScriptLanguage property to the language of your choice. On the Input Columns tab, you will notice that Input Name is a dropdown with the name Input 0. It is possible to have more than one source pointed to this Script Component. If so, this dropdown would allow you to individually configure the inputs and select the columns from each input. For this example, select all the input columns. Set the Usage Type for the State and Zip columns to ReadWrite. The reason will be clear later.
Select the Inputs and Outputs tab to see the collection of inputs and outputs and the input columns defined previously. Here you can create additional input and output buffers and columns within each. Expand all the nodes and add these two output columns:
You’ll use the flag to separate the data from the data stream. The rejection reason will be useful to the person who has to perform any manual work on the data later. The designer with all nodes expanded should look like below the screen shot.
Back on the Script tab, click the Edit Script button to enter the VSTA scripting IDE. In the main class, the rules for validation need to be programmatically applied to each data row. In the Input0_ProcessInputRow method that was cogenerated by SSIS using the Script Component designer, add the rules for data validation:
- All fields are required except for the zip code.
- The zip code must be in the format #####-#### or ##### and use numeric digits from 0 through 9. If the zip code is valid for the first five characters but the whole string is not, strip the trailing records and use the first five.
- The state must be two uppercase characters.
Here’s the overall plan: the contents of the file will be sent into the Script Component. This is where programmatic control will be applied to each row processed. The incoming row has three data fields that need to be validated to determine whether all necessary data is present. The State and Zip columns need to be validated additionally by rule, and even cleaned up if possible. The need to fix the data in the stream is why the Zip and State column usage types had to be set to ReadWrite in the designer earlier.
To aid in accomplishing these rules, the data will be validated using regular expressions. Regular expressions are a powerful utility that should be in every developer’s tool belt. They enable you to perform powerful string matching and replacement routines. You can find an excellent tutorial on regular expressions at WWW.REGULAR-EXPRESSIONS.INFO. The regular expressions for matching the data are shown here:
To use the regular expression library, add the .NET System.Text.RegularExpressions namespace to the top of the main class. For performance reasons, create the instances of the RegEx class to validate the ZipCode and the State in the PreExecute() method of the Script Component. This method and the private instances of the Regex classes should look like this
To break up the tasks, create two new private functions to validate the ZipCode and State. Using byRef arguments for the reason and the ZipCode enables the data to be cleaned and the encapsulated logic to return both a true or false and the reason. The ZipCode validation functions should look like this
The state validation functions look like this
Now, to put it all together, add the driver method Input0_ProcessInputRow()
that is fired upon each row of the flat file
Notice that all fields are checked for null values using a property on the Row class that is the field name and an additional tag _IsNull. This is a property code generated by SSIS when you set up the input and output columns on the Script Component. Properties like Zip_IsNull explicitly allow the checking of a null value without encountering a null exception. This is handy as the property returns true if the particular column is NULL.
Next, if the Zip column is not NULL, its value is matched against the regular expression to determine whether it’s in the correct format. If it is, the value is assigned back to the Zip column as a cleaned data element. If the value of the Zip column doesn’t match the regular expression, the script checks whether it is at least five characters long. If true, then the first five characters are retested for a valid ZipCode pattern. Nonmatching values result in a GoodFlag in the output columns being set to False.
The state is trimmed of any leading or trailing white space, and then converted to uppercase and matched against the regular expression. The expression simply checks to see if it contains two uppercase letters between A and Z. If it does, the GoodFlag is set to True and the state value is updated; otherwise, the GoodFlag is set to False.
To send the data to the appropriate table based on the GoodFlag, you must use the Conditional Split Transformation. Add this task to the Data Flow designer and connect the output of the Script Component Task to the Conditional Split Transformation. Edit the Conditional Split Transformation, and add an output named Good with the condition GoodFlag == TRUE and name the default output Bad. This separates the data rows coming out of the Script Component Task into two separate streams. The Conditional Split Transformation Editor should look like below the screen shot.
Add an OLE DB Connection Manager that uses the database you created for the Contacts and ContactsErrorQueue tables. Add two SQL Server Destinations to the Data Flow designer. One, named Validated Contacts SQL Server Destination, should point to the Contacts table; the other, named Error Contacts SQL Server Destination, should point to the ContactsErrorQueue table. Drag the output of the Conditional Split Transformation to the Validated Destination. Set the output stream named Good to the destination. Then open the Mappings tab in the Destination to map the input stream to the columns in the Contacts table. Repeat this for the other Bad output of the Conditional Split Transformation to the Error Destination.
Your final Data Flow should look something like below the screen shot. If you run this package with the Contacts.dat file described at the top of the use case, three contacts will validate, and two will fail with these rejection reasons:
Synchronous versus Asynchronous
Data Flow transformations can handle data rows in one of two ways: synchronously or asynchronously.
- A synchronous component performs its stated operation for every row in the buffer of rows. It does not need to copy the buffer to a new memory space, and does not need to look at multiple rows to create its output. Examples of synchronous components include the Derived Column Transformation and the Row Count Transformation.
- The second type of transformation, an asynchronous component, creates another buffer for the output. It typically used multiple (or all) of the input rows to create a new output. The output usually looks quite different from the input, and the component tends to be slower because of the copying of memory. Asynchronous component examples include the Aggregate Transformation and Sort Transformation.
Script Components can be written to act synchronously or asynchronously. The Data Validation example previously discussed is an example of a synchronous component. Let’s create an asynchronous example for comparison. This example will show how to derive the median value from a set of source values.
Example: Creating a Median Value Asynchronously
As a starting point, use the AdventureWorks database to pull a set of values using an OLE DB Source, such as the TotalDue column from the Sales.SalesOrderHeader table. Similar to when you create a synchronous component, you can use a Script Component from the SSIS Toolbox as a transformation object and select the appropriate input columns, which in this case is the TotalDue column.
The Input and Outputs menu is where you veer off the same path that you would have followed with the synchronous component. The output property named SynchronousInputID needs to be set to None, which lets the component know that it should create a new buffer. The inputs and outputs created can be seen in below the screen shot.
Once the inputs and outputs are prepared, it is time to write the script to perform the median calculation. The full script can be seen here in both languages
Note that there is an ArrayList that sits outside of the methods. This variable is accessed by multiple functions throughout the execution of the component, so it needs to be accessible by all. When then component runs its pre-execute phase, it will initialize the ArrayList and prepare it to be used. Then as each input row is processed, the value will be added to the ArrayList. Finally, in the FinishOutputs method, the median is calculated by sorting the values and pulling the middle value. This value is added to the output buffer and can be inserted into a file or database. The finished and executed package is shown in below screen shot.
At this point, you have a good overview of how scripting works in SSIS and the difference between the Script Task and the Script Component, but as with any programming environment, you need to know how to troubleshoot and debug your code to ensure that everything works correctly. The next section describes some techniques you can use for more advanced SSIS scripting development.
List of Related Microsoft Certification Courses:
|SCCM||SQL Server DBA|
|Team Foundation Server||BizTalk Server Administrator|