Creating the Basic Package in SSIS
To simplify the explanation of the Lookup Transformation’s operation in the next few sections, this section presents a typical ETL problem that is used to demonstrate several solutions using the components configured in various modes.
The AdventureWorks database is a typical OLTP store for a bicycle retailer, and AdventureWorksDW is a database that contains the corresponding denormalized data warehouse structures. Both of these databases, as well as some secondary data, are used to represent a real-world ETL scenario. (If you do not have the databases, download them from www.wrox.com.)
The core operation focuses on extracting fact data from the source system (fact data is discussed in Loading a Data Warehouse Topic); in this scenario you will not yet be loading data into the warehouse itself. Obviously, you would not want to do one without the other in a real-world SSIS package, but it makes it easier to understand the solution if you tackle a smaller subset of the problem by itself.
You will first extract sales order (fact) data from the AdventureWorks, and later you will load it into the AdventureWorksDW database, performing multiple joins along the way. The order information in AdventureWorks is represented by two main tables: SalesOrderHeader and SalesOrderDetail. You need to join these two tables first.
The SalesOrderHeader table has many columns that in the real world would be interesting, but for this exercise you will scale down the columns to just the necessary few. Likewise, the SalesOrderDetail table has many useful columns, but you will use just a few of them. Here are the table structure and first five rows of data for these two tables:
As you can see, you need to join these two tables together because one table contains the order header information and the other contains the order details. below the screen shot shows a conceptual view of what the join would look like.
However, this does not get us all the way there. The CustomerID column is a surrogate key that is specific to the source system, and the very definition of surrogate keys dictates that no other system — including the data warehouse— should have any knowledge of them. Therefore, in order to populate the warehouse you need to get the original business (natural) key. Thus, you must join the SalesOrderHeader table (Sales.SalesOrderHeader) to the Customer table (Sales.Customer) in order to find the customer business key called AccountNumber. After doing that, your conceptual join now looks like shown below the screen shoot.
Similarly for Product, you need to add the Product table (Production.Product) to this join in order to derive the natural key called ProductNumber, as shown in below screen shot.
Referring to the below screen shoot, you can get started by creating a new SSIS package that contains an OLE DB Connection Manager called localhost.AdventureWorks that points to the AdventureWorks database and a single empty Data Flow Task.
Using a Relational Join in the Source
The easiest and most obvious solution in this particular scenario is to use a relational join to extract the data. In other words, you can build a package that has a single source (use an OLE DB Source Component) and set the query string in the source to utilize relational joins. This enables you to take advantage of the benefits of the relational source database to prepare the data before it enters the SSIS Data Flow.
Drop an OLE DB Source Component on the Data Flow design surface, hook it up to the localhost. AdventureWorks Connection Manager, and set its query string as follows:
–columns from Sales.SalesOrderHeader
oh.SalesOrderID, oh.OrderDate, oh.CustomerID,
–columns from Sales.Customer
–columns from Sales.SalesOrderDetail
od.SalesOrderDetailID, od.ProductID, od.OrderQty, od.UnitPrice,
–columns from Production.Product
from Sales.SalesOrderHeader as oh
inner join Sales.Customer as c on (oh.CustomerID = c.CustomerID)
left join Sales.SalesOrderDetail as od on (oh.SalesOrderID =
inner join Production.Product as p on (od.ProductID = p.ProductID);
Note that you can either type this query in by hand or use the Build Query button in the user interface of the OLE DB Source Component to construct it visually. Click the Preview button and make sure that it executes correctly (see the below screen shot).
For seasoned SQL developers, the query should be fairly intuitive — the only thing worth calling out is that a left join is used between the SalesOrderHeader and SalesOrderDetail tables because it is conceivable that an order header could exist without any corresponding details. If an inner join was used here, it would have lost all such rows exhibiting this behavior. Conversely, inner joins were used everywhere else because an order header cannot exist without an associated customer, and a details row cannot exist without an associated product. In business terms, a customer will buy one or (hopefully) more products.
Close the preview dialog; click OK on the OLE DB Source Editor UI, and then hook up the Source Component to a Union All Transformation as shown in below screen shot, which serves as a temporary destination. Add a Data Viewer to the pipeline in order to watch the data travel through the system. Execute the package in debug mode and notice that the required results appear in the Data Viewer window.
Note:The Union All Transformation has nothing to do with this specific solution; it serves simply as a dead end in the Data Flow in order to get a temporary trash destination so that you don’t have to physically land the data in a database or file. This is a great way to test your Data Flows during development; placing a Data Viewer just before the Union All gives you a quick peek at the data. After development you would need to replace the Union All with a real destination. Note that you could also use some other component such as the Conditional Split. Keep in mind that some components, like the Row Count, require extra setup (such as variables), which would make this approach more cumbersome. Thirdparty tools are also available (such as Task Factory by Pragmatic Works) that have trash destinations for testing purposes only.
Using the Merge Join Transformation
Another way you could perform the join is to use Merge Join Transformations. In this specific scenario it does not make much sense because the database will likely perform the most optimal joins, as all the data resides in one place. However, consider a system in which the four tables you are joining reside in different locations; perhaps the sales and customer data is in SQL Server, and the product data is in a flat file, which is dumped nightly from a mainframe. The following steps explain how you can build a package to emulate such a scenario:
- Start again with the basic package and proceed as follows. Because you do not have any actual text files as sources, you will create them inside the same package and then utilize them as needed. Of course, a real solution would not require this step; you just need to do this so that you can emulate a more complex scenario.
- Name the empty Data Flow Task “DFT Create Text Files.” Inside this task create a pipeline that selects the required columns from the Product table in the AdventureWorks database and writes the data to a text file. Here is the SQL statement you will need: select ProductID, ProductNumber
- Connect the source to a Flat File destination and then configure the Flat File Destination Component to write to a location of your choice on your local hard drive, and make sure you select the delimited option and specify column headers when configuring the destination options, as shown in Figure 7-10. Name the flat file Product.txt.
- Execute the package to create a text file containing the Product data. Now create a second Data Flow Task and rename it “DFT Extract Source.” Connect the first and second Data Flow Tasks with a precedence constraint so that they execute serially, as shown in Screen shoot. Inside the second (new) Data Flow Task, you’ll use the Lookup and Merge Join solutions to achieve the same result you did previously.
When using the Lookup Transformation, make sure that the largest table (usually a fact table) is streamed into the component, and the smallest table (usually a dimension table) is cached. That’s because the table that is cached will block the flow while it is loaded into memory, so you want to make sure it is as small as possible. Data Flow execution cannot begin until all Lookup data is loaded into memory. Since all of the data is loaded into memory, it makes the 3GB process limit on 32-bit systems a real challenge. In this case, all the tables are small, but imagine that the order header and details data is the largest, so you don’t want to incur the overhead of caching it. Thus, you can use a Merge Join Transformation instead of a Lookup to achieve the same result, without the overhead of caching a large amount of data. In some situations you can’t control the table’s server location, used in the Lookup, because the source data needs to run through multiple Lookups. A good example of this multiple Lookup Data Flow would be the loading of a fact table.
The simplest solution for retrieving the relational data would be to join the order header and order details tables directly in the Source Component (in a similar manner to that shown earlier). However, the following steps take a more complex route in order to illustrate some of the other options available:
- Drop an OLE DB Source Component on the design surface of the second Data Flow Task and name it “SRC Order Header.” Hook it up to the AdventureWorks Connection Manager and use the following statement as the query:
select SalesOrderID, OrderDate, CustomerID
Note:Of course, you could just choose the Table or View option in the source UI, or use a select* query, and perhaps even deselect specific columns in the Columns tab of the UI. However, these are all bad practices that will usually lead to degraded performance. It is imperative that, where possible, you specify the exact columns you require in the select clause. Furthermore, you should use a predicate (where clause) to limit the number of rows returned to just the ones you need.
- Confirm that the query executes successfully by using the Preview button, and then hook up a Sort Transformation downstream of the source you have just created. Open the editor for the Sort Transformation and choose to sort the data by the SalesOrderID column, as shown in Figure 7-12. The reason you do this is because you will use a Merge Join Transformation later, and it requires sorted input streams. (Note that the Lookup Transformation does not require sorted inputs.) Also, an ORDER BY clausein the source would be better for performance, but this example is giving you experience with the Sort Transform.
- To retrieve the SalesOrderDetails data, drop another OLE DB Source Component on the design surface, name it SRC Details, and set its query as follows. Notice how in this case you have included an ORDER BY clause directly in the SQL select statement. This is more efficient than the way you sorted the order header data, because SQL Server can sort it for you before passing it out-of-process to SSIS. Again, you will see different methods to illustrate the various options available:select SalesOrderID, SalesOrderDetailID, ProductID, OrderQty,
order by SalesOrderID, SalesOrderDetailID, ProductID;
- Now drop a Merge Join Transformation on the surface and connect the outputs from the two Source Components to it. Specify the input coming from SRC Header (via the Sort Transformation) to be the left input, and the input coming from SRC Details to be the right input. You need to do this because, as discussed previously, you want to use a left join in order to keep rows from the header that do not have corresponding detail records.
After connecting both inputs, try to open the editor for the Merge Join Transformation; you should receive an error stating that “The IsSorted property must be set to True on both sources of this transformation.” The reason you get this error is because the Merge Join Transformation requires inputs that are sorted exactly the same way. However, you did ensure this by using a Sort Transformation on one stream and an explicit T-SQL ORDER BY clause on the other stream, so what’s going on? The simple answer is that the OLE DB Source Component works in a pass-through manner, so it doesn’t know that the ORDER BY clause was specified in the second SQL querystatement due to the fact that the metadata returned by SQL Server includes column names, positions, and data types but does not include the sort order. By using the Sort Transformation, you forced SSIS to perform the sort, so it is fully aware of the ordering.
In order to remedy this situation, you have to tell the Source Transformation that its input data is presorted. Be very careful when doing this — by specifying the sort order in the following way, you are asking the system to trust that you know what you are talking about and that the data is in factsorted. If the data is not sorted, or it is sorted other than the way you specified, then your package can act unpredictably, which could lead to data integrity issues and data loss. Use the following steps to specify the sort order:
- Right-click the SRC Details Component and choose Show Advanced Editor. Select the Input and Output Properties tab, shown in below screen shot, and click the Root Node for the default output (not the error output). Inthe property grid on the right-hand side is a property called IsSorted. Change this to True.
- The preceding step tells the component that the data is presorted, but it does not indicate the order. Therefore, the next step is to select the columns that are being sorted on, and assign them values as follows:I)If the column is not sorted, then the value should be zero.
II)If the column is sorted in ascending order, then the value should be positive.
III)If the column is sorted in descending order, then the value should be negative.The absolute value of the number should correspond to the column’s position in the order list. For instance, if the query was sorted as follows, “SalesOrderID ascending, ProductID descending,” then you would assign the value 1 to SalesOrderID and the value -2 to ProductID, with all other columns being 0.
- Expand the Output Columns Node under the same default Output Node, and then select the SalesOrderID column. In the property grid, set the SortKeyPosition value to 1, as shown in below screen shot.
- Close the dialog and try again to open the Merge Join UI; this time you should be successful. By default, the component works in inner join mode, but you can change that very easily by selecting (in this case) Left OuterJoin from the Join type dropdown (see below the screen shot). You can also choose a Full Outer Join, which would perform a Cartesian join of all the data,though depending on the size of the source data, this will have a high memory overhead.
- If you had made a mistake earlier while specifying which input was the left and which was the right, you can click the Swap Inputs button to switch their places. The component will automatically figure out which columns you are joining on based on their sort orders; if it gets it wrong, or there are more columns you need to join on, you can drag a column from the left to the right in order to specify more join criteria. However, the component will refuse any column combinations that are not part of the ordering criteria.
- Finally, drop a Union All Transformation on the surface and connect the output of the Merge Join Transformation to it. Place a Data Viewer on the output path of the Merge Join Transformation and execute the package. Check the results in the Data Viewer; the data should be joined as required.
Merge Join is a useful component to use when memory limits or data size restricts you from using a Lookup Transformation. However, it requires the sorting of both input streams — which may be challenging to do with large data sets — and by design it does not provide any way of caching either data set. The next section examines the Lookup Transformation, which can help you solve join problems in a different way.