Incremental Loading Data using Control Table Pattern
So far, most of the data loading procedures that have been explained in this tutorial have done a full load or a truncate and load. While this is fine for smaller amounts of rows, it would be unfeasible to do with millions of rows. In this Topic, you’re going to learn how to take the knowledge you’ve gained and apply the practices to an incremental load of data.
The first pattern will be a control table pattern. In this pattern, you’ll use a table to determine when the last load of the data was. Then the package will determine which rows to load based on the last load date. The other alternative used in this Topic is a Change Data Capture (CDC) pattern. This pattern will require that you have Enterprise Edition of SQL Server and will automatically identify the rows to be transferred based on a given date.
CONTROL TABLE PATTERN
The most conventional incremental load pattern is the control table pattern. The pattern uses a table that the developer creates to store operational data about the last load. A sample table looks like this:
CREATE TABLE [dbo].[ControlTable](
[SourceTable] [varchar](50) NOT NULL,
[LastLoadID] [int] NOT NULL,
[LastLoadDate] [datetime] NOT NULL,
[RowsInserted] [int] NOT NULL,
CONSTRAINT [PK_ControlTable] PRIMARY KEY CLUSTERED
In this pattern, you would have a row in the control table for each table that you wish to create a load process for. This table is not only used by your SSIS package to determine how much data to load but it also becomes an audit table to see which tables have and have not been loaded. Each of the incremental load patterns in this Topic follow these steps:
- An Execute SQL Task reads the last load data from the control table into a
- A Data Flow Task reads from the source table where they were modified or created after the date in the variable.
- An Execute SQL Task sets the last load date in the control table to the time when the package began.
To start the example, run the Control Table Example Creation.sql file in the Topic accompanying material . This will create a table to read from called SourceTable and a table to load called DestTable. It will also create and load the control table. Notice the ControlTable table shows a LastLoadDate column of 1900-01-01, meaning the SourceTable has never been read from. The TableName column holds a record for each table you wish to read from. Optionally, there’s a LastLoadID that could be used for identity columns.
Querying the Control Table
Querying the control table you created is simply done through an Execute SQL Task. Start by creating an OLE DB connection manager to whichever database you ran the setup script in. For the purpose of this example, we’ll assume you created the tables in a variant of the AdventureWorks database.
To configure the Execute SQL Task, direct the task to use the previously created connection manager. Then, use a query similar to the one that follows for your SQLStatement property. This query will find out the last time you retrieved data from the table called SourceTable.
SELECT LastLoadDate from
ControlTable where SourceTable = ‘SourceTable’
The answer to the query should be stored into a variable to be used later. To do this, set the ResultSet property (shown in below screen shot) to Single Row. Doing this will allow you to use the ResultSet tab. Go to that tab, and click the Add button to create a new resultset. Then change the ResultName property from NewResultName to 0. This stores the result from the first column into the variable of your choosing. You could have also typed the column name from the query as well (LastLoadDate) into the property.
Next, select New Variable from the drop-down box in the ResultSet tab. This will open the Add Variable dialog box (shown in Figure 11-2). Ensure the variable is scoped to the package and call it SourceTableLoadDate. Define the data type of the variable as a DateTime and set the default value to 2099- 01-01. This ensures that if someone were to run the package without running this Execute SQL Task, that no date will be retrieved.
Querying the Source Table
With the date now set in the variable, you’re ready to retrieve any new data from your table called SourceTable. You’ll do this with a Data Flow Task that you connect to the Execute SQL Task. Create an OLE DB Source in the Data Flow and have it use the connection manager you created earlier. Then, set the Data Access Mode property to SQL Command and type the following query in the query window below:
SELECT * from SourceTable WHERE CreatedDate BETWEEN ? and ?
The two question marks represent input parameters that will be passed into the query. To set the values for the placeholders click Parameters, which opens the Set Query Parameters dialog box (shown in below screen shot). Set the first parameter to User::SourceTableLoadDate and the second parameter to System::StartTime. The StartTime variable represents the start time of the package. When both parameters are passed into the query, it essentially requests all the rows that have not been loaded since the last load until the time the package started.
With the OLE DB Source now configured, drag an OLE DB Destination over to the Data Flow and connect it to the OLE DB Source. Configure the OLE DB Destination to use the same connection manager and load the table called DestTable. After configuring the mappings, the simple Data Flow is complete and the final Data Flow should resemble Below screen shot.
Updating the Control Table
Back in the Control Flow you need one more Execute SQL Task to update the control table. Connect it to the Data Flow Task. To configure the Execute SQL Task, connect it to the same connection manager you have been using and type the following query into the SQLStatement property.
SET LastLoadDate = ?
WHERE SourceTable = ‘SourceTable’
The question mark in this case represents the time the package began, and to pass it in, go to the Parameter Mapping page and configure it as shown in below screen shot. Click the Add button and set the variable name being passed in as System::StartTime, the Data Type as Date, and the Parameter Name as 0.
Once you’re done configuring this, your package is ready to run (shown in below screen shot). This task is going to update the Control Table and set the last load date as the time the package started, so the next time the package runs it will only get current values. The first time you run the package, you should see three rows go through the data flow. Subsequent runs should show zero rows but try also adding a row to the SourceTable table and running the package over again. If configured correctly, only the single row will go through. You’ll also see that the control table is constantly being updated with a new LastLoadDate column.