Understanding Data Types Using Variables, Parameters and Expressions
UNDERSTANDING DATA TYPES
In SSIS, you must pay attention to data types, whether the data is coming from your Data Flow, is stored in variables, or is included in expressions. Failure to do so will cause a lot of frustration because the syntax checker will complain about incompatible data types when you are building expressions. If your Data Flow contains incompatible data types, your packages will raise either warnings or errors (if implicit conversions are made). This will happen even if the conversion is between Unicode and non-Unicode character sets. Comparison operations also are subject to either hard or soft errors during implicit conversion. Bad data type decisions can have a serious impact on performance. This seemingly simple topic causes significant grief for SSIS developers who haven’t yet learned the specifics of data types and how they are converted. The following sections provide a brief overview of how to resolve common data type conversion issues, beginning with a primer on SSIS data types.
SSIS Data Types
If you research the topic of “Integration Services Data Types” in Books Online, you’ll first notice that the data types are named much differently than similar types found in .NET or T-SQL. This nomenclature is troublesome for most new users. The following table provides a matrix between SSIS data types and a typical SQL Server set of data types. You’ll need this table to map between Data Flow columns and variable or parameters data types. The .NET managed types are important only if you are using Script component, CLR, or .NET-based coding to manipulate your Data Flows.
The following table is just for SQL Server. To do a similar analysis for your own data source, look at the mapping files that can be found in this directory: C:\Program Files\Microsoft SQL Server\120\DTS\MappingFiles\. If you’re familiar with OLE DB data types, you’ll understand these SSIS data type enumerations, because they are similar. However, there is more going on than just naming differences.
First, SSIS supports some data types that may not be familiar at all, nor are they applicable to SQL Server — namely, most of the unsigned integer types and a few of the date types. You’ll also notice the availability of the separate date-only (DT_DBDATE) and time-only (DT_DBTIME) types, which prior to SQL Server 2008 were available only for RDBMS databases like DB2 and ORACLE.
With the introduction of similar data types in the SQL Server 2008 engine, they are also applicable in SSIS. Finally, notice the arrow “⇒” in the table, which indicates that these data types are converted to other SSIS data types in Data Flow operations that may be opportunities for performance enhancements.
Date and Time Type Support
SQL Server 2008 included new data types for separate date and time values and an additional time zone-based data type compliant with the ISO 8601 standard. SSIS has always had these data type enumerations for the other RDBMS sources, but as of SQL Server 2008, these can also be used for SQL Server as well, including DT_DBTIMESTAMP2 and DT_DBTIME2, added for more precision, and DT_DBTIMESTAMPOFFSET, added for the ISO DateTimeOffset SQL Server data type.
A common mistake made in SSIS packages is the improper selection of an SSIS date data type. For some reason, DT_DBDATE and DT_DATE are often used for date types in Data Flow components, but improper use of these types can result in overflow errors or the removal of the time element from the date values. SSIS data types provide a larger net for processing incoming values than you may have in your destination data source. It is your responsibility to manage the downcasting or conversion operations. Make sure you are familiar with the data type mappings in the mapping file for your data source and destination, and the specific conversion behavior of each type. A good start would be the date/time types, because there are many rules regarding their conversion, as evidenced by the large section about them in Books Online. You can find these conversion rules for date/time data types under the topic “Integration Services Data Types” found here: http://msdn.microsoft.com/en-us/library/ms141036(v=SQL.120).aspx.
How Wrong Data Types and Sizes Can Affect Performance
If you’ve been working with SSIS for a while, you know that it can use serious memory resources and sometimes be slower than you expect. That’s because the Data Flow components do most of their work in memory. This can be good because it eliminates the most time-consuming I/O operations. However, because SSIS uses memory buffers to accomplish this, the number of rows that can be loaded into a buffer is directly related to the width of the row. The narrower the row, the more rows that can be processed per buffer.
If you are defining the data types of a large input source, pick your data types carefully, so that you are not using the default 50 characters per column for a text file, or the suggested data types of the Connection Manager, when you do not need this extra safety cushion. Also, be aware that there are some tradeoffs when selecting specific data types if they require any conversion as the data is being loaded into the buffers.
Data conversion is a fact of life, and you’ll have to pay for it somewhere in the ETL process. These general guidelines can give you a start:
- Convert only when necessary. Don’t convert columns from a data source that will be dropped from the data stream. Each conversion costs something.
- Convert to the closest type for your destination source using the mapping files. If a value is converted to a nonsupported data type, you’ll incur an additional conversion internal to SSIS to the mapped data type.
- Convert using the closest size and precision. Don’t import all columns as 50-character data columns if you are working with a fixed or reliable file format with columns that don’t require as much space.
- Evaluate the option to convert after the fact. Remember that SSIS is still an ETL tool and sometimes it is more efficient to stage the data and convert it using set-based methods.
The bottom line is that data type issues can be critical in high-volume scenarios, so plan with these guidelines in mind.
Unicode and Non-Unicode Conversion Issues
One aspect of ETL package development that you might not be used to is the default use of Unicode data types in SSIS packages. Not only is this the default import behavior, but all the string functions in SSIS expect Unicode strings as input. Unicode is a great choice if you’re unsure of the incoming data for handling data from import files with special characters, but if you’re not familiar with using this character set, it can be confusing at first. At the very least, using Unicode requires an additional step that is frequently missed, resulting in errors. For a typical demonstration, create a package that imports an Excel data source into a table defined with non-Unicode fields, or download the samples from www.wrox.com. Excel data is imported as Unicode by default, so the mapping step in the destination component complains that the data is not compatible, as shown in Below Screen Shot.
NOTE You may experience some data being replaced by NULLs when importing Excel files using the Excel Connection Manager. This typically occurs when numeric and text data is stored within one column. One solution is to update the extended properties section of the connection string to look like this:
Extended Properties=”EXCEL 14.0;HDR=YES;IMEX=1″
At first, you might assume that all you need to do is change the source data type to match the non-Unicode destination. Using the SQL conversion table as a guide, right click on the source, select the Show Advanced Editor option, and change the column type to DT_STR to match the destination SQL Server varchar data type. Now you’ll find that the same error from Figure 5-4 is occurring on both the source and the destination components. As discussed earlier in this section, SSIS requires purposeful conversion and casting operations. To complete the task, you need to add only a Data Conversion Transformation to convert the DT_WSTR and DT_R8 data types to DT_STR and DT_CY, respectively. The Data Conversion Transformation should look similar to the below Screen Shot.
Notice in this Data Conversion Transformation that the data types and lengths are changed to truncate and convert the incoming string to match the destination source. Also, notice the Code Page setting that auto-defaults to 1252 for ANSI Latin 1. The Code Page setting varies according to the source of the Unicode data you are working with. If you are working with international data sources, you may need to change this to interpret incoming Unicode data correctly.
This type casting operation is a good, simple example of how SSIS packages handle data of differing types. However, within expressions it is not necessary to bring in the conversion component to cast between different types. You can simply use casting operators to change the data types within the expression.
Casting in SSIS Expressions
If you want to experience the developer’s equivalent of poking your eye out, forget to put a casting operator in your Data Flow expressions. SSIS is tightly tied to data types and requires casting, which simply defines the data type for a value or expression. If you forget to use casting or choose the wrong data type, the package may fail or cause errors when trying to insert that column into the final destination.
While you can run into some frustrating issues if you don’t do it, the need for casting is not always intuitive. For example, the result of any string function defaults to the Unicode string type. If you are attempting to store that value in a non-Unicode column, you need to cast. Conversely, if you are storing the value in a variable, you don’t need to cast. (That’s because the data types in variable definitions allow only Unicode; more about that later in the section “Defining Variables.”)
The good news is that casting is easy. In the expression language, this looks just like a .NET primitive cast. The new data type is provided in parentheses right next to the value to be converted. A simple example is casting a 2-byte signed integer to a 4-byte signed integer:
Of course, not all the casting operators are this simple. Some require additional parameters when specific precision, lengths, or code pages have to be considered to perform the operation. These operators are listed in the following table:
Casting causes the most obvious trouble during comparison operations and logical expressions. Remember that all operands in comparison operations must evaluate to a compatible data type. The same rule applies to complex or compound logical expressions. In this case, the entire expression must return a consistent data type, which may require casting sections of the expression that may not readily appear to need casting. This is similar to the situation that you have in T-SQL programming when you attempt to use a number in a where clause for a numeric column, or when using a case statement that needs to return columns of different types. In the where predicate, both the condition and the column must be convertible into a compatible type. For the case statement, each column must be cast to the same variant data type. Later in the Data Flow Topic, you’ll look at examples in which you need to pay attention to casting when using comparison and logical expressions.
A less obvious issue with casting can occur when data becomes truncated during casting. For example, casting Unicode double-byte data to non- Unicode data can result in lost characters. Significant digits can be lost in forced casting from unsigned to signed types or within types like 32-bit integers to 16-bit integers. These errors underscore the importance of wiring up the error outputs in the Data Flow components that have them. Before you look at that, however, look at the following section about variables and parameters and how they are used in dynamic SSIS package development.