Contrasting SSIS and The Relational Join
Though the methods and syntax you employ in the relational and SSIS worlds may differ, joining multiple row sets together using congruent keys achieves the same desired result. In the relational database world, the equivalent of a Lookup is accomplished by joining two or more tables together using declarative syntax that executes in a set-based manner. The operation remains close to the data at all times; there is typically no need to move the data out-of-process with respect to the database engine as long as the databases are on the same SQL Server instance (except when joining across databases, though this is usually a nonoptimal operation). When joining tables within the same database, the engine can take advantage of multiple different internal algorithms, knowledge of table statistics, cardinality, temporary storage, cost-based plans, and the benefit of many years of ongoing research and code optimization. Operations can still complete in a resource constrained environment because the platform has many intrinsic functions and operators that simplify multi-step operations, such as implicit parallelism, paging, sorting, and hashing.
In a cost-based optimization database system, the end-user experience is typically transparent; the declarative SQL syntax abstracts the underlying relational machinations such that the user may not in fact know how the problem was solved by the engine. In other words, the engine is capable of transforming a problem statement as defined by the user into an internal form that can be optimized into one of many solution sets — transparently. The end-user experience is usually synchronous and nonblocking; results are materialized in a streaming manner, with the engine effecting the highest degree of parallelism possible.
The operation is atomic in that once a join is specified, the operation either completes or fails in total — there are no substeps that can succeed or fail in a way the user would experience independently. Furthermore, it is not possible to receive two result sets from the query at the same time — for instance, if you specified a left join, then you could not direct the matches to go one direction and the nonmatches somewhere else.
Advanced algorithms allow efficient caching of multiple joins using the same tables — for instance, round-robin read-ahead enables separate T-SQL statements (using the same base tables) to utilize the same caches.
The following relational query joins two tables from the AdventureWorksDW database together. Notice how you join only two tables at a time, using declarative syntax, with particular attention being paid to specification of the join columns:
select sc.EnglishProductSubcategoryName, p.EnglishProductName
from dbo.DimProductSubcategory sc
inner join dbo.DimProduct p
on sc.ProductSubcategoryKey = p.ProductSubcategoryKey;
For reference purposes, Below Screen shoot shows the plan that SQL Server chooses to execute this join.
In SSIS, the data is usually joined using a Lookup Transformation on a buffered basis. The Merge Join Transformation can also be used, though it was designed to solve a different class of patterns. The calculus/algebra for these components is deterministic; the configuration that the user supplies is directly utilized by the engine — in other words, there is no opportunity for the platform to make any intelligent choices based on statistics, cost, cardinality, or count. Furthermore, the data is loaded into out-of-process buffers (with respect to the database engine) and is then treated on a row-byrow manner; therefore, because this moves the data away from the source, you can expect performance and scale to be affected.
Any data moving through an SSIS Data Flow is loaded into memory in data buffers. A batch process is performed on the data in synchronous transformations. The asynchronous transformations, such as the Sort or Aggregate, still perform in batch, but all rows must be loaded into memory before they complete, and therefore bring the pipeline process to a halt. Other transformations, like the OLE DB Command Transformation, perform their work using a row-by-row approach.
The end-user experience is synchronous, though in the case of some modes of the Lookup Transformation the process is blocked while the cache loads in its entirety. Execution is nonatomic in that one of multiple phases of the process can succeed or fail independently. Furthermore, you can direct successful matches to flow out the Lookup Transformation to one consumer, the nonmatches to flow to a separate consumer, and the errors to a third.
Resource usage and performance compete: in Lookup’s full-cache mode —which is typically fastest with smaller data sets — the cache is acquired and then remains in memory until the process (package) terminates, and there are no implicit operators (sorting, hashing, and paging) to balance resource usage. In no-cache or partial-cache modes, the resource usage is initially lower because the cache is charged on the fly; however, overall performance will almost always be lower. The operation is explicitly parallel; individual packages scale out if and only if the developer intentionally created multiple pipelines and manually segmented the data. Even then, bringing the data back together with the Union All Transformation, which is partial blocking, can negate any performance enhancement. Benchmark testing your SSIS packages is necessary to determine the best approach.
There is no opportunity for the Lookup Transformation to implicitly perform in an SMP (or scale-out) manner. The same applies to the Merge Join Transformation — on suitable hardware it will run on a separate thread to other components, but it will not utilize multiple threads within itself.
Below the screen shoot shows an SSIS package that uses a Lookup Transformation to demonstrate the same functionality as the previous SQL statement. Notice how the Product table is pipelined directly into the Lookup Transformation, but the SubCategory table is referenced using properties on the component itself. It is interesting to compare this package with the query plan generated by SQL Server for the previous SQL query. Notice how in this case SQL Server chose to utilize a hash-join operation, which happens to coincide with the mechanics underlying the Lookup Transformation when used in full-cache mode. The explicit design chosen by the developer in SSIS corresponds almost exactly to the plan chosen by SQL Server to generate the same result set.
Below the screen shot shows the same functionality, this time built using a Merge Join Transformation. Notice how similar this looks to the SQL Server plan (though in truth the execution semantics are quite different).