DATABASE SESSIONS allow the developer to control how and when the data is committed to a database. This recipe shows how this is achieved in Talend. Many management options are available for database connections including editing and duplicating the connection or adding a task to it.
The sections below explain in detail these management options.
Getting ready
Open the job jo_cook_ch07_0060_databaseSession.
On inspection, you will see that the job has been set up to commit after each record that has been written.
How to do it…
The steps to be performed are as follows:
How it works…
The tMysqlConnection component establishes a connection and begins a session. When you select Use an existing connection in an output component, it adds the component to the session, thus ensuring that the records written require an explicit commit. The commit component will commit the writes to the database and close the session.
Executions
Multiple outputs
Don’t forget the commit
A common beginner’s error is to assume that the connection simply shortcuts the need for manually setting up a component, which is true, but it also begins a database session. So, if you do not add a commit component, you will not get any data written to the database.
Committing but not closing
The commit component is automatically set to close a session. If you wish to commit, but keep a session open, then the tMysqlCommit component has an option to enable the session to be kept open after the commit.
Passing a value from the parent Job to the child Job is a common real-world requirement.
Related Article: Talend Interview Questions |
Environment
This procedure was written with:
The following environment was used to create the suggested procedure.
Getting ready
Open the Job
jo_cook_ch07_0070_databaseSessionParent
which is the same as the completed version from the previous recipe, but with the main process replaced with a child job. On inspection, you should see that the child job has a connection set up and it is the same connection as the parent's job.
How to achieve it…
The steps to be performed are as follows:
How it works…
The tMysqlConnection component establishes a connection and begins a session in the parent job, as does tMysqlConnection in the child job. The problem in this scenario is that they both create individual sessions, so that when we run the parent, no records are committed to the database despite records being written by the child.
When we define a shared connection in the parent of “cookbook”, the session information then becomes available as a session in the child if we choose to use it, and in this example, we do so by using the shared connection registered by the parent.
So the connection for the parent and child are now using the same session and when the commit is executed, the records added by the child are also committed.
Many applications will write to/delete from the same table in many different ways, using different fields as keys and often updating different fields at different times. This recipe shows how this can be achieved without having to create new schemas each time.
Related Article: Talend Database Tutorial |
Getting ready
Open the job jo_cook_ch07_0080_fieldSelection.
How to achieve it…
The steps to be performed are as follows:
How it works…
First, we set the insert method to Insert or update. This allows us to write to new and existing records using the same component.
The first execution is an insert, so that the createdDate and createdBy columns are populated for the record in the table and the updatedDate and updatedBy columns are null.
Any subsequent write to this table for the given key is an update, so this will leave the createdDate fields as they were set when first inserted and now populates the updatedDate and updatedBy columns and the new value of customerName.
There’s more…
This is a great method for ensuring that pre-defined schemas are used within the jobs, thus encouraging re-usability and traceability, and also allows us to update rows using a different set of key values depending upon the data we have at hand.
Updating
Any key may be used to perform an update, not just the primary key for the table, since Talend will be creating a SQL WHERE clause under the covers. You simply need to select the fields that you wish to use as the key in the column, Update key within the Field options section.
Deleting
You should also notice that there is a column in the list of fields for defining the deletion key. The same method applies to deleting rows as for update; however, the column Deletion key should be used instead in the Advanced settings tab and the Action on data set to Delete in the Basic settings.
Many database applications require a log of rejects and errors to be kept to enable erroneous records to be identified, so that manual repairs can take place. This recipe shows how to capture errors for individual rows using the reject row facility.
Getting ready
Open the job jo_cook_ch07_0090_rejectsAndErrors.
How to achieve it…
The steps to be performed are as follows:
How it works…
There’s more…
There are some more points to take a glance:
Die on error
In addition to having any bulk insert methods deactivated, the option to Die on error must also be deactivated in order for rejects to be captured using this method.
Efficiency
The ability to reject rows without killing the job is incredibly useful and does make for simpler code and error management, but does come at a cost; namely, the rows must be inserted one at a time.
Inserting rows one at a time is nowhere near as efficient as using bulk insertion methods or by specifying block sizes, so you may find that for some databases there is a trade-off between loading speeds and error reporting and management.
Error management
Having individual rows being rejected makes fixing a problem much simpler since we have a one-to-one match between the error for a single reject row.
In contrast, the use of batch insert methods, such as the MySQL extended insert method will return one error but reject a whole batch of rows, both good and bad.
So this gives us slightly more of a headache, as when one row fails, the whole batch is rejected, giving us a situation of having good and bad records in a batch of rejects, which in turn forces us to create more complex methods of fixing them for a single reject.
This simple recipe shows how to execute database management and table-related commands.
Getting ready
Create a new job jo_cook_ch07_0100_tableManagement.
How to do it…
The steps to be performed are as follows:
"
CREATE TABLE `test_mysqlrow` ( `id` int(11) NOT NULL,
`data` varchar(45) DEFAULT NULL, PRIMARY KEY (`id`)
)
"
How it works…
The tMysqlRow component is the database equivalent of both tJava and tJavaRow, it can be used within a flow like tJavaRow or standalone like tJava.
That said, the tMysqlRow component is most commonly used standalone, like tJava, as in this case, where we create a table.
There’s more…
This simple example shows a single, isolated use of tMysqlRow. On most occasions, it is used prior to processing to create temporary tables or to drop constraints or indexes prior to bulk loading, and also after processing to remove temporary tables and restore constraints or indexes.
Note
tMysqlRow can also be used on a row-by-row basis to perform say inserts, but this is usually simpler to do, using tMySQLOutput.
Many application databases will use surrogate keys as a means of uniquely identifying rows in a table. As a result of this, it is often necessary to capture the surrogate key for the record after writing a record, so that any associated child elements will be able to reference the parent’s surrogate as a foreign key. This recipe shows one method of creating surrogate keys in a relation and later discusses a few more methods.
Getting ready
Open the job jo_cook_ch07_0110_surrogateKeys.
How to achieve it…
The steps to be performed are as follows:
"
SELECT COALESCE(MAX(customerId),0) FROM globalCustomer
"
globalMap.put("maxCustomerId",input_row.maxCustomerId);
System.out.println("Max customer id = "
+ globalMap.get("maxCustomerId"));
Numeric.sequence("customer",((Integer)
globalMap.get("maxCustomerId"))+1,1)
Numeric.sequence("order",((Integer)
globalMap.get("maxOrderId"))+1,1)
How it works…
The unique keys are created through the following two stages:
The first stage is performed in tMysqlInput, and that is to capture the maximum value for the key in the customer3NF and order3NF table. These values are then written to the globalMap variables for later use.
Now that we know the highest key value in the customer3NF table, the second stage is to create a sequence that starts at the maximum value plus one and use the result of the sequence as the surrogate key for each input in the customer row.
We then write the customer data to the customer3NF table, then read the order data and in tMap, join to the customer table on the natural key(customerURN and source), so that we can get the customer surrogate for use as the foreign key for the order data.
In tMap, we also generate the order surrogate and then write the completed order row to the database.
There’s more…
The method shown is very efficient for batch processing or data migrations where we can blitz a table at a time. There is one note of caution, we need to be certain that no other job or process is writing to the same table using the same method at the same time. If we cannot be certain, then we should use an alternative method.
Added efficiency using hashMap key table
This method can be made more efficient by not re-reading the globalCustomer table. Provided you have enough memory, it is better to copy the surrogate key (generated) and natural keys (customerURN and source) into a hashMap. This hashMap can then be used as a lookup, avoiding the need to reload the customer table in its entirety from the database, instead of simply reading data already stored in memory.
Ranges
A slight variation in the preceding method is to earmark a range of values prior to writing. This can be achieved by writing a temporary marker record into the table with a key of the current maximum value plus the number of records to be written plus one (or a given range plus one). This means that a second process will start its load at the top of the range, thus avoiding any overlap. Remember, though, to remove the marker record at the end of the writes for the range.
Sequences
Sequences are database objects that will return a unique value every time they are read. To generate surrogate keys, the process simply adds a lookup to the main flow that will select the next sequence value for the table and use the returned lookup.
Note
Note that the lookup must be made to reload at each row to ensure each row gets a value.
This method is good for both real-time and batch updates, however, not all database versions support sequences.
It is slightly less efficient than the Talend generated sequence method, but this method does have the advantage of being usable by many processes at the same time. It is, therefore, the best method to use in real-time/web service-based environment.
Autoincrement keys
Some databases will allow fields that will automatically generate a new one-up number whenever a record is written to the table. If the key field is set as autoincrement, then we do not need to generate a key, we simply need to write the record to a table and the database will do the rest.
This method, however, does have a downside, in that we need to re-read the table using the natural key to find the database-created surrogate.
The LastInsertId component
Autoincrement procedure
The final option in this list is to use an auto-increment key field but to write the record to the database via a stored procedure. The procedure can then write the record and capture the last value written. This has the advantage of working with most databases, but the disadvantage of having to maintain a stored procedure/function.
Related Article: Talend Tutorial for Beginners |
The tHash components are great for storing intermediate data in memory and are very efficient, but do not allow updates. Database tables allow updates, but aren’t as efficient when writing and reading data on a row-by-row basis, especially when there are large numbers of rows to be processed.
This recipe shows how we can get the best of both worlds using a feature of the HSQL database that allows us to define databases that only reside in memory for the given process.
Open the new job jo_cook_ch07_0120_inProcessDatabase.
How to achieve it…
The first thing we need to do is to create a memory copy of the current MySQL country table, by copying the previous execution position from the persistent table in MySQL:
How it works…
The key features of the job are as described in the following section.
In-memory components
All of the HSQLDB components are set up in the same way and use the same database (inmem).
Initialize the data
The data is initialized from the persistent copy of the country table.
tMap
There are a few key features to be noted within tMap:
Both the lookups are reloaded at each row. This is to ensure that any in-flight changes are made available immediately after the update.
We have a lookup for the maximum value of the ID in the country table. This is used when we aren’t able to find a country and need a new key, which will be the highest current ID plus one.
So when writing to the output, we either copy the country key (if the country is found) or we copy the ID for a new country record (maximum ID value plus one).
This is achieved using the code:
country.id == null ? maxCountryId.id + 1 : country.id.
When a new country is found, then we create the new ID using max ID plus one and write it to the new country flow. This flow is used to insert the new row into the inline table.
Write back
Finally, we need to copy the new version of the table to the persistent country table ready for the next execution.
There’s more…
This method is most useful when you need to refer values in a table, but where those values are likely to change during the course of a single execution.
This example is a fairly typical application of this method and while it is possible to achieve the same results using other techniques, this method is by far the simplest to understand and to implement.
It is also incredibly valuable when large numbers of records are to be processed, which would normally mean large numbers of individual reads in the database, which will be very slow.
If the in-memory table is very large then consider using the reload at each row method with a key as detailed in the recipe Using reload at each row to process real-time/near real-time data, in MAPPING DATA.
Explore TALEND Sample Resumes! Download & Edit, Get Noticed by Top Employers! |
Memory
As with all memory storage techniques, ensure that you have enough memory to hold all of the reference tables, before and after the execution. Ensuring that you only store the fields that are required in memory that will allow you to fit a large number of records in the memory.
Name | Dates | |
---|---|---|
Talend Training | Sep 17 to Oct 02 | View Details |
Talend Training | Sep 21 to Oct 06 | View Details |
Talend Training | Sep 24 to Oct 09 | View Details |
Talend Training | Sep 28 to Oct 13 | View Details |
Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.