Joining Data Using tMap, Hierarchical Joins | Talend
Joining data using tMap
tMap is more powerful in terms of functionality.
1. tMap can have many outputs links.
2. With tMap we can use the expression on the columns while providing the joining condition.
3. In tMap we have option to store the intermediate data in the disc.
4. In tMap, we can enable the option to reload the look-up for every record.
5. tMap supports more types of join model, includes unique join, first join and all join.
6. tMap allows you to link multiple look-up flows into it, and supports to load multiple look-up flows parallel.
7. tMap supports ‘die on error’ option.
Hence tMap is quite powerful component. Being a powerful component, it generates more code and it may take more space and time to load in the memory.
Open the job jo_cook_ch04_0070_tMapJoin.
How to do it…
- Right-click tFileInputDelimited. Go to Row | Main and connect it to tMap_1. Change the name of the flow to order.
- Open tMap, and you should see two input tables: customer and order.
- Select the customerId field from the customer table and drag it to the customerId Expr. key in the order table.
- You will see a purple key icon and a flow showing the linked fields.
- Type “Card” into the key field for orderType.
- Drag all the order fields apart from customerId to the output. Your tMap should now look like the following screenshot:
- Close tMap and run the job.
- You will see that there is a single row for each customer, and many of the fields are null.
- Re-open the tMap, and click tMap settings for the input flow order.
- Change Match Model to All matches and Join Model to Inner Join.
- Close tMap and run the job. You will see that only the rows that have an orderType of card have output, but there are now multiple records per customer.
- Add a new output to tMap and rename it to notMatched.
- Drag all the customer fields into the new output.
- Click tMap settings, and set Catch lookup inner join reject to true.
- Close tMap and add another tLogRow. Select tLogRow mode of Table (print values in cells of a table).
- Join the notMatched flow from tMap to the new tLogRow and run the job.
- You should now see two tables: one containing all Card transactions for customers and another showing all customers who have no Card transactions.
How it works…
tMap allows for different join types to be defined using expressions as keys. In this example, we used a variable from the main flow, plus a constant (“Card”) as our join keys.
The first execution of the job performed a left outer join, so all input records are output and non-matched fields are set to null (or default value if they are Java primitives). In addition, the first execution also specified to use only a unique match, thus printing out only one row per customer.
The second execution, however, specified that we wanted to do an inner join with all matches, so the output contained all orders where the customer paid with a credit card.
In the second execution, we also defined a second output that caught all the rows from the main flow that did not have any matches to the lookup.
This recipe illustrates the main features of joining using tMap, but only joins one table to another. It is also possible to join the same table with many other different keys from many lookups in a single tMap.
The next two recipes will show some examples of this.
The lookups are processed slightly earlier than the main flow. Due to the small volumes of data in this recipe, it isn’t apparent, but if you replace the file for tFileInputDelimited_2 with chapter04_jo_0080_orderData_large.csv, then this will become very apparent.
You will see that tMap loads the lookup data into memory tables at the start of the job before it begins processing the main data flow.
For batch data integration jobs, this is an efficient method, since it reduces the lookup time per transaction on the main flow, however, in the recipe Using reload at each row to process real-time/near real-time data, we will see how this method is not appropriate for small volume, real-time or near real-time data.
Also, be aware that in order to process large lookups, you will need to ensure that you have enough memory available and allocated to hold all the lookup data. If not, then the process will return out of memory errors.
Hierarchical joins using tMap
tMap has another level of joining capability, in that it can join together data in a hierarchical fashion. This simple example shows how easily this can be achieved using tMap.
Open the job jo_cook_ch04_0080_hierarchicaltMapJoin.
How to accomplish it…
- Open the tMap. You will see three input tables.
- Select customerId from the customer table and drag it into Expr.key of the customerId in the order table.
- You will see that a join link, a purple key symbol has been added to the column.
- Change the tMap settings for the order table to Inner Join and All Matches
- Now, select orderId from the order table and drag it to orderId in the orderItem table.
- Change the tMap settings for the orderItem table to Inner Join and All Matches. Exit tMap and run the job.
- You should see a printed table containing denormalized customer/order/orderitem rows.
How it works…
This job works on the hierarchy that exists between customer, order, and order item. A customer has many orders and an order has many order items.
The key for orders is the customer, and the key for order items is order. Thus, to get all the order items for a customer, it is necessary to first find the keys for all the orders, and then find all the order items that match the order keys.
As you can see, tMap allows this relationship to be defined easily, simply by dragging the relevant parent key to the child structure.