Error outputs can obviously be used to improve reliability, but they also have an important part to play in terms of scalability as well. From a reliability perspective, they are a critical feature for coping with bad data. An appropriately configured component will direct failing rows down the error output, rather than the main output. After being removed from the main Data Flow path, these rows may receive additional treatment and cleansing to enable them to be recovered and merged back into the main flow. They can be merged explicitly, such as with a Union All Transformation, or implicitly through a second adapter directed at the final destination. Alternatively, they could be discarded. Be careful of discarding rows entirely; you may find out you need them later! More often, rows on the error output are logged and dealt with later.
WARNING The error outputs are not always accurate in removing just the bad rows. In the scenario of using delimited flat files as a source, you may have rows missing delimiters. This can cause havoc with the data written to the error output and the expected good data.
The capability to recover rows is perhaps the most useful course of action. If a data item is missing in the source extract but required in the final destination, the error flow path can be used to fix this. If the data item is available from a secondary system, then a lookup could be used. If the data item is not available elsewhere, then perhaps a default value could be used instead.
In other situations, the data may be out of range for the process or destination. If the data causes an integrity violation, then the failed data could be used to populate the constraining reference with new values, and then the data itself could be successfully processed. If a data type conflict occurs, then maybe a simple truncation would suffice, or an additional set of logic could be applied to try to detect the real value, such as with data time values held in strings. The data could then be converted into the required format.
When assumptions or fixes have been made to data in this way, it is best practice to always mark rows as having been manipulated; that way, if additional information becomes available later, they can be targeted directly. In addition, whenever an assumption is made, it should be clearly identified as such to the end user.
All the scenarios described here revolve around trying to recover from poor data, within the pipeline and the current session, allowing processing to continue, and ideally fixing the problem such that data is recovered. This is a new concept compared with DTS and several other products, but the ability to fix errors in real time is a very valuable option that you should always consider when building solutions.
The obvious question that then occurs is this: Why not include the additional transformations used to correct and cleanse the data in the main Data Flow path, so that any problems are dealt with before they cause an error? This would mean that all data flows down a single path, and the overall Data Flow design may be simpler, with no branching and merging flows. This is where the scalability factor should enter into your solution design. Ideally, you want to build the simplest Data Flow Task possible, using as few transformations as possible. The less work you perform, the greater the performance and therefore scalability.
Below screenshot illustrates a Data Flow Task used to load some data. In this contrived scenario, some of the rows will be missing values for SpecialtyCode and ConsultantCode. The source data contains text descriptions as well, so these are being used to perform a lookup to retrieve the missing values. The initial design logic specifies evaluating the column for NULL values in a Conditional Split Transformation. Bad rows are directed to an alternate output that connects to the Lookup Transformation. Once the lookup has populated the missing value, the rows are then fed back into the main pipeline through the Union All Transformation. The same pattern is followed for the SpecialtyCode and ConsultantCode columns, ensuring that the final insert through the OLE DB Destination has all good data. This is the base design for solving your problem, and it follows the procedural logic quite closely.
Figure below shows two alternative Data Flow Task designs, presented side by side for easy comparison. In the first design, you disregard any existing data in the SpecialtyCode and ConsultantCode columns and populate them entirely through the lookup. Although this may seem like wasted effort, the overall design is simpler, and in testing it was slightly faster compared to the more complicated design in Figure 15-32. This example used a test data set that had a bad row ratio of 1 in 3 — that is, one row in three had missing values. If the ratio dropped to 1 in 6 for the bad rows, then the two methods performed the same.
The second design assumes that all data is good until proven otherwise, so you insert directly into the destination. Rows that fail because of the missing values pass down the error output, “OLE DB Destination Error Output,” and are then processed through the two lookups. The choice between the two designs is whether you fix all rows or only those that fail. Using the 1 in 3 bad rows test data, fixing only the failed rows was 20 percent faster than fixing all rows. When the bad row ratio dropped to 1 in 6, the performance gain also dropped, to only 10 percent.
As demonstrated by the preceding examples, the decision regarding where to include the corrective transformations is based on the ratio of good rows to bad rows, as compared to how much work is required to validate the quality of the data. The cost of fixing the data should be excluded if possible, as that is required regardless of the design, but often the two are inseparable.
The performance characteristics of the corrective transformations should also be considered. In the preceding examples, you used lookups, which are inherently expensive transformations. The test data and lookup reference data included only six distinct values to minimize the impact on overall testing. Lookups with more distinct values, and higher cardinality, will be more expensive, as the caching becomes less effective and itself consumes more resources.
In summary, the more expensive the verification, the more bad rows you require to justify adding the validation and fix to the main flow. For fewer bad rows, or a more expensive validation procedure, you have increased justification for keeping the main flow simple and for using the error flow to perform the corrective work.
The overall number of rows should also influence your design, because any advantages or disadvantages are amplified with a greater number of rows, regardless of the ratio. For a smaller number of rows, the fixed costs may outweigh the implied benefits, as any component has a cost to manage at runtime, so a more complicated Data Flow Task may not be worthwhile with fewer overall rows.
This concept of using error flows versus the main flow to correct data quality issues and related errors is not confined to those outputs that implement the error output explicitly. You can apply the same logic manually, primarily through the use of the Conditional Split Transformation, as shown in the first example. You can perform a simple test to detect any potential issues, and direct rows of differing quality down different outputs. Where expensive operations are required, your goal is to ensure that as few rows as possible follow this path and that the majority of the rows follow cheaper, and usually simpler, routes to their destination.
Finally, don’t be put off by the term error output; it isn’t something to be avoided at all costs. Component developers often take advantage of the rich underlying pipeline architecture, using error outputs as a simple way of indicating the result of a transformation for a given row. They don’t affect the overall success or failure state of a package, so don’t be put off from using them.
Of course, the performance figures quoted here are for demonstration purposes only. They illustrate the differences between the methods described, but they should not be taken as literal values that you can expect to reproduce, unless you’re using exactly the same design, data, and environment. The key point is that testing such scenarios should be a routine part of your development practice.
Get Updates on Tech posts, Interview & Certification questions and training schedules