How to Use Derived Column Transformation in SSIS with Example
ADVANCED DERIVED COLUMN USE
If you’ve used the data flow in SSIS for any amount of data transformation logic, you will no doubt have used the Derived Column Transformation. It has many basic uses, from basic replacing of NULLs or blanks to text parsing and manipulation.
Using SSIS expressions, the Derived Column Transformation can be used for more advanced data cleansing operations than a simple single expression, such as the following:
- Advanced text code logic to identify and parse text values
- Checking for data ranges and returning a specified value
- Mathematical operations with advanced logic
- Date comparison and operations
The expression language in thorough detail. Below screenshot highlights the Derived Column Transformation expression toolbox within the Derived Column Transformation Editor.
One challenge with the Derived Column Transformation is parsing more complicated text strings and effectively using expressions without duplicating expression logic. This next section walks you through an example of pulling out information from values.
Text Parsing Example
To see an example of text mining, consider the example source data from the following list. It contains oceanographic buoy locations off the coast of the United States. Some of them are near cities, while others are in locations farther off the coast. In addition to the location, the text values also contain some codes and switches irrelevant to what you need to parse.
6N26 /V S. HATTERAS, NC
3D13 /A EDISTO, SC
3D14 /A GRAYS REEF
6N46 /A CANAVERAL, FL
6N47 /A CANAVERAL EAST, FL
3D56 /A ST. AUGUSTINE, FL
3D55 /A FRYING PAN SHOALS
3D36 /D BILOXI, MS
3D35 /D LANEILLE, TX
3D44 /D EILEEN, TX
Can you use the Derived Column Transformation to pull out the locations embedded within the text? For locations that are near cities, can you also identify the appropriate state code? More important, can you do this efficiently and clearly?
Most ETL developers would try to do this in a single Derived Column step with one expression. They would end up with something like this:
SUBSTRING((ISNULL(Location) ? “Unknown” :
TRIM(Location)),FINDSTRING((ISNULL(Location) ? “Unknown” :
+ 3,(FINDSTRING((ISNULL(Location) ? “Unknown” :
TRIM(Location)),”,”,1) == 0 ?
(LEN((ISNULL(Location) ? “Unknown” : TRIM(Location))) –
FINDSTRING((ISNULL(Location) ? “Unknown” : TRIM(Location)),”/”,1) +
(FINDSTRING((ISNULL(Location) ? “Unknown” : TRIM(Location)),”,”,1) –
FINDSTRING((ISNULL(Location) ? “Unknown” : TRIM(Location)),”/”,1) –
To be sure, this code will work. It identifies text values, where the location begins, and when a location has a state code appended to it. However, the clarity of the code leaves much to be desired.
One thing you can notice in the preceding code is the redundancy of some expressions. For example, it is replacing a NULL value in the Location column with “Unknown“. In addition, several FINDSTRING functions are used to locate the “/” in the code.
A better approach is to break the code into multiple steps. below the screen shot shows illustrates a Data Flow that contains two Derived Column Transformations.
The first Derived Column Transformation performs a few preparation steps in the data that is then used in the second transformation.Below screen shot highlights the expressions used in the first “Parsing Preparation” transformation.
This transformation performs the three common expressions needed to handle the string logic that pulls out the location information from the data:
- LocationPosition: This new column simply identifies where the “/” is in the code, since that is immediately before the location is named.
- StatePosition: This expression looks for the existence of a comma (,), which would indicate that the location is a city with an accompanying state as part of the location description.
- Location: This column is replaced with “Unknown” if the Location value is missing.
With these preparation steps, the expression logic needed to perform the parsing of the text becomes a lot cleaner. The following code is part of the second Derived Column Transformation, which parses out the name of the location:
SUBSTRING(Location,LocationPosition + 3,(StatePosition == 0 ?
(LEN(Location) – LocationPosition + 4) :
(StatePosition – LocationPosition – 3)))
Now the expression is more readable and easier to follow. Note that to employ this approach, you need to break your Data Flow into two Derived Column Transformations because in order for expression logic to reference a Data Flow column, it must be available in the input of the transformation.