Data Loading

This section focuses on data loading. Many of the same techniques presented in the data extraction section apply here too, so the focus is on areas that have not been covered before.

Database Snapshots

Database snapshots were introduced as a way to persist the state of a database at a specific point in time. The underlying technology is referred to as copy-on-first-write, which is a fancy way of saying that once you create the database snapshot, it is relatively cheap to maintain because it only tracks things that have changed since the database snapshot was created. Once you have created a database snapshot, you can change the primary database in any way — for instance, changing rows, creating indexes, and dropping tables. If at any stage you want to revert all your changes back to when you created the database snapshot, you can do that very easily by doing a database restore using the database snapshot as the media source.

In concept, the technology sounds very similar to backup and restore, the key difference being that this is a completely online operation, and depending on your data loads, the operations can be near instantaneous. This is because when you create the snapshot, it is a metadata operation only — you do not physically “back up” any data. When you “restore” the database from the snapshot, you do not restore all the data; rather, you restore only what has changed in the interim period.

This technique proves very useful in ETL when you want to prototype any data changes. You can create a package that makes any data changes you like, confident in the knowledge that you can easily roll back the database to a clean state in a short amount of time. Of course, you could achieve the same goals using backup and restore (or transactional semantics), but those methods typically have more overhead and/or take more time. Snapshots may also be a useful tool in operational ETL; you can imagine a scenario whereby a snapshot is taken before an ETL load and then if there are any problems, the data changes can be easily rolled back.

There is a performance overhead to using snapshots, because you can think of them as a “live” backup. Any activity on the source database incurs activity on the snapshot database, because the first change to any database page causes that page to be copied to the database snapshot. Any subsequent changes to the same page do not cause further copy operations but still have some overhead due to the writing to both source and snapshot. You need to test the performance overhead in the solutions you create, though you should expect to see an overhead of anywhere from 5 percent to 20 percent.

Because you are writing data to the destination database in this section, it is useful to create a database snapshot so you can roll back your changes very easily. Run this complete script:

–Use a snapshot to make it simple to rollback the DML
USE master;
–To create a snapshot you need to close all other connections on the
–Check if there is already a snapshot on this DB
IF EXISTS (SELECT [name] FROM sys.databases
WHERE [name] = N’AdventureWorksDW_Snapshot’) BEGIN
–If so RESTORE the database from the snapshot
FROM DATABASE_SNAPSHOT = N’AdventureWorksDW_Snapshot’;
–If there were no errors, drop the snapshot
IF @@error = 0 DROP DATABASE [AdventureWorksDW_Snapshot];
END; –if
–OK, let’s create a new snapshot on the DB
CREATE DATABASE [AdventureWorksDW_Snapshot] ON (
NAME = N’AdventureWorksDW_Data’,
–Make sure you specify a valid location for the snapshot file here
FILENAME = N’c:\ProSSIS\Data\Ch13\AdventureWorksDW_Data.ss’)
AS SNAPSHOT OF [AdventureWorksDW];

The script should take only a couple of seconds to run. It creates a database file in the specified folder that it tagged as being a snapshot of the AdventureWorksDW database. You can run the following command to list all the database snapshots on the server:

–List database snapshots
d.[name] AS DatabaseName,
s.[name] AS SnapshotName
FROM sys.databases AS s
INNER JOIN sys.databases AS d
ON (s.source_database_id = d.database_id);

You should now have a snapshot called “AdventureWorksDW_Snapshot.” This snapshot is your “live backup” of AdventureWorksDW. Once you have ensured that the database snapshot is in place, test its functionality by changing some data or metadata in AdventureWorksDW. For instance, you can create a new table in the database and insert a few rows:

–Create a new table and add some rows
USE AdventureWorksDW;
CREATE TABLE dbo.TableToTestSnapshot(ID INT);

You can confirm the table is present in the database by running this statement. You should get back three rows:

–Confirm the table exists and has rows
SELECT * FROM dbo.TableToTestSnapshot;

Now you can test the snapshot rollback functionality. Imagine that the change you made to the database had much more impact than just creating a new table (perhaps you dropped the complete sales transaction table, for instance) and you now want to roll the changes back. Execute the same script that you used to originally create the snapshot; you will notice that the script includes a check to ensure that the snapshot exists; then, if so, it issues a RESTORE … FROM DATABASE_SNAPSHOT command.

After running the script, try running the SELECT command again that returned the three rows. You should get an error saying the table “TableToTestSnapshot” does not exist. This is good news; the database has been restored to its previous state! Of course, this same logic applies whether you created a table or dropped one, added or deleted rows, or performed just about any other operation. The really cool benefit is that it should have taken only a couple of seconds to run this “live restore.”

As part of the original snapshot script, the database was rolled back, but the script should also have created a new snapshot in the old one’s place. Make sure the snapshot is present before continuing with the next sections, because you want to make it simple to roll back any changes you make.

Not only can you use database snapshots for prototyping, you can add tasks to your regularly occurring ETL jobs to create the snapshots — and even restore them if needed! A solution that uses this methodology is robust enough to correct its own mistakes and can be part of an enterprise data warehouse solution.

The MERGE Operator

If your source data table is conveniently partitioned into data you want to insert, data you want to delete, and data you want to update, then it is simple to use the INSERT, UPDATE, and DELETE statements to perform the respective operations. However, it is often the case that the data is not presented to you in this format. More often than not you have a source system with a range of data that needs to be loaded, but you have no way of distinguishing which rows should be applied in which way. The source contains a mix of new, updated, and unchanged rows.

One way you can solve this problem is to build logic that compares each incoming row with the destination table, using a Lookup Transformation (see Joining Data Topic for more information). Another way to do this would be to use Change Data Capture (see Incremental Loads in SSIS Topic for more information) to tell you explicitly which rows and columns were changed, and in what way. 

There are many other ways of doing this too, but if none of these methods are suitable, you have an alternative, which comes in the form of the T-SQL operator called MERGE (also known in some circles as “upsert” because of its mixed Update/Insert behavior).

The MERGE statement is similar in usage to the INSERT, UPDATE, and DELETE statements; however, it is more useful in that it can perform all three of their duties within the same operation. Here is pseudocode to represent how it works; after this you will delve into the real syntax and try some examples:

MERGE INTO Destination
Using these semantics:
all actions optional>
If a row in the Destination matches a row in the Source then: UPDATE
If a row exists in the Source but not in the Destination then: INSERT
If a row exists in the Destination but not in the Source then: DELETE
FROM Source;

As you can see, you can issue a single statement to SQL Server, and it is able to figure out on a row-by-row basis which rows should be INSERT-ed, UPDATEed, and DELETE-ed in the destination. This can provide a huge time savings compared to doing it the old way: issuing two or three separate statements to achieve the same goal. Note that SQL Server is not just cleverly rewriting the MERGE query back into INSERT and UPDATE statements behind the scenes; this functionality is a DML primitive deep within the SQL core engine, and as such it is highly efficient.

Now you are going to apply this knowledge to a real set of tables. In the extraction section of this Using the Relational Engine Topic you used customer data from AdventureWorks and compared it to data in AdventureWorksDW. There were some rows that occurred in both tables, some rows that were only in the source, and some rows that were only in the destination. You will now use MERGE to synchronize the rows from AdventureWorks to AdventureWorksDW so that both tables contain the same data.

This is not a real-world scenario because you would not typically write rows directly from the source to the destination without cleaning and shaping the data in an ETL tool like SSIS, but for the sake of convenience the example demonstrates the concepts.

First, you need to add a new column to the destination table so you can see what happens after you run the statement. This is not something you would need to do in the real solution.

USE AdventureWorksDW;
–Add a column to the destination table to help us track what
–You would not do this in a real solution, this just helps the
ALTER TABLE dbo.DimCustomer ADD Operation NVARCHAR(10);

Now you can run the MERGE statement. The code is commented to explain what it does. The destination data is updated from the source in the manner specified by the various options. There are blank lines between each main section of the command to improve readability, but keep in mind that this is a single statement:

USE AdventureWorksDW;
–Merge rows from source into the destination
–Define the destination table
INTO AdventureWorksDW.dbo.DimCustomer AS [Dest] –Friendly alias
–Define the source query
SELECT AccountNumber AS CustomerAlternateKey,
–Keep example simple by using just a few data columns
FROM AdventureWorks.Sales.Customer c
INNER JOIN AdventureWorks.Person.Person p on
) AS [Source] –Friendly alias
–Define the join criteria (how SQL matches source/destination rows)
ON [Dest].CustomerAlternateKey = [Source].CustomerAlternateKey
–If the same key is found in both the source & destination
–For *illustration* purposes, only update every second row
–AND CustomerAlternateKey % 2 = 0
–Then update data values in the destination
[Dest].FirstName = [Source].FirstName ,
[Dest].LastName = [Source].LastName,
[Dest].Operation = N’Updated’
–Note: <WHERE CustomerAlternateKey =…> clause is implicit
–If a key is in the source but not in the destination
–Then insert row into the destination
GeographyKey, CustomerAlternateKey, FirstName,
LastName, DateFirstPurchase, Operation
1, [Source].CustomerAlternateKey, [Source].FirstName,
[Source].LastName, GETDATE(), N’Inserted’
–If a key is in the destination but not in the source…
–Then do something relevant, say, flagging a status field
[Dest].Operation = N’Deleted’;
–Note: <WHERE ContactID =…> clause is implicit
–Alternatively you could have deleted the destination row
–but in AdventureWorksDW that would fail due to FK constraints

After running the statement, you should get a message in the query output pane telling you how many rows were affected:

(19119 row(s) affected)

You can now check the results of the operation by looking at the data in the destination table. If you scroll through the results you should see each row’s Operation column populated with the operation that was applied to it:

–Have a look at the results
SELECT CustomerAlternateKey, DateFirstPurchase, Operation
FROM AdventureWorksDW.dbo.DimCustomer;

Here is a subset of the results. For clarity, the different groups of rows have been separated in this Tutorial by blank lines:

different groups of rows have been separated in this book by blank linesAs you can see, a single MERGE statement has inserted, updated, and deleted rows in the destination in the context of just one operation. The reason why some of the updates could show a NULL operation is if a predicate was used in the WHEN MATCHED section to only UPDATE every second row.

Note that the source query can retrieve data from a different database (as per the example); furthermore, it can even retrieve data using the OPENROWSET() function you read about earlier. However, MERGE requires sorting the source data stream on the join key; SQL Server will automatically sort the source data for you if required, so ensure that the appropriate indexes are in place for a more optimal experience. These indexes should be on the join key columns. Do not confuse this operator with the Merge Join Transformation in SSIS.

If the source query happens to be of the form OPENROWSET(BULK…) — in other words, you are reading from a text file — then make sure you have specified any intrinsic sort order that the text file may already have. If the text file is already sorted in the same manner as the order required for MERGE (or you can ask the source extract system to do so), then SQL is smart enough to not incur a redundant sort operation.

The MERGE operator is a very powerful technique for improving mixed- operation data loads, but how do you use it in the context of SSIS?

If you do not have the benefit of Change Data Capture (discussed in Incremental Loads in SSIS Topic) and the data sizes are too large to use the Lookup Transformation in an efficient manner (see Joining Data Topic), then you may have to extract your data from the source, clean and shape it in SSIS, and then dump the results to a staging table in SQL Server. From the staging table, you now need to apply the rows against the true destination table. You could certainly do this using two or three separate INSERT, UPDATE, and DELETE statements — with each statement JOIN-ing the staging table and the destination table together in order to compare the respective row and column values. However, you can now use a MERGE statement instead. The MERGE operation is more efficient than running the separate statements, and it is more intentional and elegant to develop and maintain. This is also more efficient with larger data sets than the SCD wizard and its OLE DB Command Transformation approach.

Make sure you execute the original snapshot script again in order to undo the changes you made in the destination database.

0 Responses on Data Loading"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.