Fact Table Loading
Recommended by 0 users
Fact table loading is often simpler than dimension ETL, because a fact table usually involves just inserts and, occasionally, updates. When dealing with large volumes, you may need to handle partition inserts and deal with updates in a different way.
In general, fact table loading involves a few common tasks:
- Preparing your source data to be at the same granularity as your fact table, including having the dimension business keys and measures in the source data
- Acquiring the dimension surrogate keys for any related dimension
- Identifying new records for the fact table (and potentially updates)
The Sales Quota fact table is relatively straightforward and will give you a good start toward developing your fact table ETL:
- In your SSIS project for this Loading a Data Warehouse Topic , create a new package and rename it
- Just like the other packages you developed in this Loading a Data Warehouse Topic , you will use two Connection Managers, one for AdventureWorks, and the other for AdventureWorksDW. If you haven’t already created project-level Connection Managers for these in Solution Explorer, add them before continuing.
- Create a new Data Flow Task and add an OLE DB Source component. Name it Sales Quota Source. Configure the OLE DB Source component to connect to the AdventureWorks Connection Manager, and change the data access mode to SQL command, as shown in Figure 12-29. Add the following code to the SQL command text window:
SELECT QuotaDate, SalesQuota, NationalIDNumber as
EmployeeNationalIDAlternateKey FROM Sales.SalesPersonQuotaHistory
INNER JOIN HumanResources.Employee
ON SalesPersonQuotaHistory.BusinessEntityID =
- To acquire the surrogate keys from the dimension tables, you will use a Lookup Transformation. Drag a Lookup Transformation onto the Data Flow and connect the blue data path output of the OLE DB Source component onto the Lookup Transformation. Rename the Lookup Employee Key.
- Double-click the Employee Key Transformation to bring up the Lookup Editor. On the General property page, leave the Cache mode set to Full cache and the Connection type set to OLE DB Connection Manager.
- On the Connection property page, change the OLE DB Connection Manager dropdown to AdventureWorksDW and enter the following code:
SELECT EmployeeKey, EmployeeNationalIDAlternateKey
WHERE EndDate IS NULL
Including the EndDate IS NULL filter ensures that the most current dimension record surrogate key is acquired in the Lookup.
- Change to the Columns property page and map the EmployeeNationalIDAlternateKey from the input columns to the lookup columns. Then select the checkbox next to the EmployeeKey of the Lookup, as shown in below screen shot.
- Click OK to save your changes to the Lookup Transformation.
- For the DateKey, a Lookup is not needed because the DateKey is a “smart key,” meaning the key is an integer value based on the date itself in YYYYMMDD format. Therefore, you will use a Derived column to calculate the DateKey for the fact table. Add a Derived Column Transformation to the Data Flow and connect the blue data path output of the Employee Lookup to the Derived Column Transformation. When prompted, choose the Lookup Match Output from the Lookup Transformation. Name the Derived Column Date Keys.
- Double-click the Derived Column Transformation and add the following three new Derived Column columns and their associated expressions, as shown in below screen shot:
DateKey: YEAR([QuotaDate]) *10000 + MONTH([QuotaDate]) *100 + DAY([QuotaDate])
CalendarYear: (DT_I2) YEAR([QuotaDate])
CalendarQuarter: (DT_UI1) DATEPART(“q”,[QuotaDate])
At this point in your Data Flow, the data is ready for the fact table. If your data has already been incrementally extracted, so that you are getting only new rows, you can use an OLE DB Destination to insert it right into the fact table. Assume for this tutorial that you need to identify which records are new and which records are updates, and handle them appropriately. The rest of the steps accomplish fact updates and inserts.A Merge Join will be used to match source input records to the actual fact table records, but before you add the Merge Join, you need to add a Sort Transformation to the source records (a requirement of the Merge Join) and extract the fact data into the Data Flow.
- Add a Sort Transformation to the Data Flow and connect the blue data path output from the Derived Column Transformation to the Sort Transformation. Double-click the Sort Transformation to bring up the Sort Transformation Editor and sort the input data by the following columns: EmployeeKey, CalendarYear, and CalendarQuarter, as shown in below screen shot. The CalendarYear and CalendarQuarter are important columns for this fact table because they identify the date grain, the level of detail at which the fact table is associated with the date dimension. As a general rule, the Sort Transformation is a very powerful transformation as long as it is working with manageable data sizes, in the thousands and millions, but not the tens or hundreds of millions (if you have a lot of memory, you can scale up as well). An alternate to the Sort is described in steps 12–14, as well as in Joining Data Topic and Understanding and Tuning the Data Flow Engine Topic.
Below screen shot shows what your Data Flow should look like at this point.
- Add a new OLE DB Source component to the Data Flow and name it Sales Quota Fact. Configure the OLE DB Source to use the AdventureWorksDW Connection Manager and use the following SQL command:
SELECT EmployeeKey, CalendarYear
, CalendarQuarter, SalesAmountQuota
ORDER BY 1,2,3
- Because you are using an ORDER BY statement in the query (sorting by the first three columns in order), you need to configure the OLE DB Source component to know that the data is entering the Data Flow sorted. First, click OK to save the changes to the OLE DB Source and then right-click the Sales Quota Fact component and choose Show Advanced Editor.
- On the Input and Output Properties tab, click the OLE DB Source Output object in the left window; in the right window, change the IsSorted property to True, as shown in below screen shot.
- Expand the OLE DB Source Output on the left and then expand the Output Columns folder. Make the following changes to the Output Column properties:
a. Select the EmployeeKey column and change its SortKeyPosition to 1, as shown in Figure 12-35. (If the sort order were descending, you would enter a -1 into the SortKeyPosition.)
b. Select the CalendarYear column and change its SortKeyPosition to 2.
c. Select the CalendarQuarter column and change its SortKeyPosition to 3.
d. Click OK to save the changes to the advanced properties.
- Add a Merge Join Transformation to the Data Flow. First, connect the blue data path output from the Sort Transformation onto the Merge Join. When prompted, choose the input option named Merge Join Left Input. Then connect the blue data path output from the Sales Quota Fact Source to the Merge Join.
- Double-click the Merge Join Transformation to open its editor. You will see that the EmployeeKey, CalendarYear, and CalendarQuarter columns are already joined between inputs. Make the following changes, as shown in Figure 12-36:
a.Change the Join type dropdown to a Left outer join.
b. Check the SalesQuota, EmployeeKey, DateKey, CalendarYear, CalendarQuarter, and QuotaDate columns from the Sort input list and then change the Output Alias for QuotaDate to Date.
c. Check the SalesAmountQuota from the Sales Quota Fact column list and then change the Output Alias for this column to SalesAmountQuota_Fact.
- Click OK to save your Merge Join configuration.
- Your next objective is to identify which records are new quotas and which are changed sales quotas. A conditional split will be used to accomplish this task; therefore, drag a Conditional Split Transformation onto the Data Flow and connect the blue data path output from the Merge Join Transformation to the Conditional Split. Rename the Conditional Split to Identify Inserts and Updates.
- Double-click the Conditional Split to open the editor and make the following changes, as shown in below screen shot:
a.Add a new condition named New Fact Records with the following condition: ISNULL([SalesAmountQuota_Fact]). If the measure from the fact is null, it indicates that the fact record does not exist for the employee and date combination.
b. Add a second condition named Fact Updates with the following condition: [SalesQuota] != [SalesAmountQuota_Fact].
c. Change the default output name to No Changes.
- Click OK to save the changes to the Conditional Split.
- Add an OLE DB Destination component to the Data Flow and name it Fact Inserts. Drag the blue data path output from the Conditional Split Transformation to the OLE DB Destination. When prompted to choose an output from the Conditional Split, choose the New Fact Records output.
- Double-click the Fact Inserts Destination and change the OLE DB Connection Manager to AdventureWorksDW. In the “Name of the table or view” dropdown, choose the [dbo].[FactSalesQuota] table.
- Switch to the Mappings property page and match up the SalesQuota column from the Available Input Columns list to the SalesAmountQuota in the Available Destinations column list, as shown in Below screen shot. The other columns (EmployeeKey, DateKey, CalendarYear, and CalendarQuarter) should already match. Click OK to save your changes to the OLE DB Destination.
- To handle the fact table updates, drag an OLE DB Command Transformation to the Data Flow and rename it Fact Updates. Drag the blue data path output from the Conditional Split onto the Fact Updates Transformation, and when prompted, choose the Fact Update output from the Conditional Split.
- Double-click the OLE DB Command Transformation and change the Connection Manager dropdown to AdventureWorksDW. On the Component Properties tab, add the following code to the SQLCommand property (make sure you click the ellipsis button to open an editor window):
SET SalesAmountQuota = ?
WHERE EmployeeKey = ?
AND CalendarYear = ?
AND CalendarQuarter = ?
- Switch to the Column Mappings tab and map the SalesQuota to Param_0, Employee_Key to Param_1, CalendarYear to Param_2, and CalendarQuarter to Param_3, as shown in below screen shot.
- Click OK to save your changes to the OLE DB Command update. Your fact table ETL for the FactSalesQuota is complete and should look similar to below screen shot.
If you test this package out, you will find that the inserts fail. This is because the date dimension is populated through 2006, but several 2007 and 2008 dates exist that are needed for the fact table. For the purposes of this exercise, you can just drop the foreign key constraint on the table, which will enable your FactSalesQuota package to execute successfully. In reality, as part of your ETL, you would create a recurring script that populated the DateDim table with new dates:
ALTER TABLE [dbo].[FactSalesQuota] DROP CONSTRAINT
Here are some final considerations for fact table ETL:
- A Merge Join was used in this case to help identify which records were updates or inserts, based on matching the source to the fact table. Refer to the Joining Data Topic to see other alternatives to associating the source to fact table.
- For the inserts and updates, you may want to leverage the relational engine to handle either the insert or the update at the same time. T-SQL in SQL Server supports a MERGE statement that will perform either an insert or an update depending on whether the record exists. See Using the Relational Engine Topic for more about how to use this feature.
- Another alternative to the OLE DB Command fact table updates is to use a set-based update. The OLE DB command works well and is easy for small data volumes; however, your situation may not allow per-row updates. Consider staging the updates to a table and then performing a set-based update (through a multirow SQL UPDATE statement) by joining the staging table to the fact table and updating the sales quota that way.
- Inserts are another area of improvement considerations. Fact tables often contain millions of rows, so you should look for ways to optimize the inserts. Consider dropping the indexes, loading the fact table, and then recreating the indexes. This could be much faster. See Understanding and Tuning the Data Flow Engine Topic for ideas on how to tune the inserts.
- If you have partitions in place, you can insert the data right into the partitioned fact table; however, when you are dealing with high volumes, the relational engine overhead may inhibit performance. In these situations, consider switching the current partition out in order to load it separately, then you can switch it back into the partitioned table.
Inferred members are another challenge for fact table ETL. How do you handle a missing dimension key? One approach includes scanning the fact table source for missing keys and adding the inferred member dimension records before the fact table ETL runs. An alternative is to redirect the missing row when the lookup doesn’t have a match, then add the dimension key during the ETL, followed by bringing the row back into the ETL through a Union All. One final approach is to handle the inferred members after the fact table ETL finishes. You would need to stage the records that have missing keys, add the inferred members, and then reprocess the staged records into the fact table.
As you can see, fact tables have some unique challenges, but overall they can be handled effectively with SSIS. Now that you have loaded both your dimensions and fact tables, the next step is to process your SSAS cubes, if SSAS is part of your data warehouse or business intelligence project.