Talend Database Tutorial
This segment contains recipes that show some of the techniques used to read and write data to databases. It also contains recipes that show techniques to manage tables within a database.
Many applications and organizations rely on databases to store their corporate data. These can be application databases to support operational systems, data warehouses, data marts, or ODS (Operational Data Store), and any data integration developer must understand how to manipulate the database objects and the data held within the database.
Talend usually connects to a database using JDBC, so it can connect to any data source for which there is a JDBC driver, which means that Talend can connect to all of the most popular databases and a host of less well-known ones too.
In order to perform the recipes, you need to ensure that you have a database installed.
You will also need a general working knowledge of databases and have a reasonable understanding of SQL.
It’s worth noting that there are many flavors of databases and while all act generally in the same manner, there are nuances and extensions that may exist in say, Oracle, but not in MySQL.
Setting up a database connection
If you often need to connect to database tables of any kind, then you may want to centralize the connection information details in the Metadata folder in the Repository tree view.
This recipe shows how a database connection can be easily created using the Talend supplied wizard. Note that this recipe and the next use the connection cookbookDB _myCopy. This version of the database metadata is used for demonstration only.
Ensure that MySQL is running and you have executed the command.
How to achieve it…
The steps to be performed are as follows:
- Navigate to Metadata | Db Connections. Right-click on it and select Create connection.
- Set the name as cookbookDB_myCopy and click on Next.
- Select the database as MySQL from the drop-down list and then enter the details as shown in the following screenshot:
- Click on the Check button and you should see the message: cookbookDB_myCopy connection successful.
- Next, click on the button Export as context.
- The context dialogue will appear.
- Change the name to CookbookDB_myCopy, click on Next then Finish.
- Click on Finish, then Yes to the modifications, then Ok to complete the setup.
Now that you have the required parameters for connection, you will need to update the context information in the databases cookbookDB and cookbookDBCompleted to ensure that the recipes in this section will complete successfully. You can do this by amending the details in the contexts for both connections.
Once you have changed the context variables. Go back to the metadata and check connections for both.
How it works…
This wizard allows you to easily build up the JDBC connection string for your MySQL database and check that you can connect to it.
The database information is then exported to a context of the same name.
Database connections are very important within Talend and can be a great time-saver.
Using the connection
If you drag the connection to an open job, you will be presented with a list of options to create components that are pre-populated with the job information.
If, however, you decide to use a component from the palette, then the database information can be populated in the component by dragging the connection and dropping it on the component. This should work for most of the components, but for those where it doesn’t, for example, tCreateTable, you need to open the component, change the Property Type to repository, and then select the connection.
Always create database connections
While you can work with all the database components without using a metadata connection, it is highly recommended that you set one up for every database you use. Shortest time taken to do this is easily recovered later on.
As with all Talend artifacts, careful naming can make life easier in the long term, so do check to see if you have naming standards regarding connections already defined.
If not, then it is recommended that the database connections be named exactly the same as the database/schema (including case) for easy identification. If you are using a schema within the database then include both, separated by an underscore, for example, FINANCE_SALES.
When developing jobs in Talend, it’s sometimes necessary to run them on different environments. For other business cases, you need to pass values between multiple sub-jobs in a project. To solve this kind of issues, Talend introduced the notion of “contexts”.
In the above recipe, you will notice that immediately after checking the connection, we exported the settings to a context. This is a good practice and worth doing immediately after the connection has been tested and is working, since it will allow the database settings to be defined as context parameters; a must when the code is promoted between environments.
Importing the table schemas
Once we have the connection set up, it is very easy to grab table metadata from the database for use in our jobs. The following example shows how simple this is.
If you have set up the database correctly, then you should be able to see three tables in the database; customer, order, and order_item.
How to accomplish it…
The next step is to import the database table schema for the desired table.
Select “Retrieve Schema” from the context menu of your database connection. You can skip any filter restrictions and proceed with “Next” to the selection of the table schema of the “” database table.
After selecting the schema for the table it can be imported with the “Finish” button and it will be available as a data input source for the job in the “Table schemas” of your “Db Connections”.
You should check and maybe adapt the column data types in the schema definition (especially for the data types of kind date or datetime) by means of editing the schema again (“Edit Schema” from the context menu).
Verify if the date parser format for your values from the database are correctly or change the „Date Pattern“ appropriately:
In our example, we have to change the „Date Pattern“ from the default value of “dd-MM-yyyy” to the data representation in our database table (as seen above): “yyyy-MM-dd”. The time part for datetime values will be passed directly and recognized without any special definition.
This screenshot shows the correctly adapted first Date Pattern for the column createdon. The second column has not been changed yet but also needs to be changed to the same value: “yyyy-MM-dd”.
Finally save the changed Schema settings.
How it works…
Talend uses the connection information to gather schema and table information from the database for presenting to the wizard. When we select the tables, Talend uses the information to create Talend schemas for use in our jobs.
As with database connections, schemas, when dragged onto the canvas, will provide us a host of options as to how we wish to use the schema; for example, as an input table or output table.
It is quicker and easier to work with schemas via the metadata panel than to try to hand-crank them from the palette, so do get into the habit of using the metadata panel for data sources and targets.
Using context and globalMap variables in SQL queries
It is possible to use parameters passed or created in a job to drive the results of a SQL query. This short recipe shows how.
Open the job jo_cook_ch07_0020_contextInQuery, which is a copy of the job from the previous recipe.
How to achieve it…
The steps to be performed are as follows:
- Open the tMysqlInput
- Change the final line to
HAVING SUM(o.`orderTotal`) > "+context.minOrderValue.
- Run the job, and you will be prompted for a value.
- Enter 10 and click on Ok.
- You should see that the result set contains orders where the sum of the lines is > 10.
How it works…
The SQL statement used by tMysqlInput is held in a string, so it can be manipulated just as any other string in Java. Thus, we can alter the value of the string using normal Java conventions; in this case, a concatenate (+).
The statement is thus translated at runtime and the value of the context is substituted into the SQL query, which returns the customized result set.
Following are some additional points to be noted regarding the use of parameterized queries.
The globalMap variables
globalMap variables can be used in exactly the same way as the context variable.
Developing the query
In most cases, it is better to test a query using fixed values in the database client and then substitute these for variables once the query has been proven.
Following are some other notes on developing queries with parameters:
- Be careful of spacing. If the query needs a space in place between items, then the space needs to exist in the query.
- String values need to be quoted in the query, for example, the code where name = ‘Fred’, would be parameterized as where name = ‘”+context.selectedName+”‘”.
- As usual, make small changes, then test, especially with SQL strings. In fact, it is wise to make changes to SQL that use parameters in isolation, since missed quotation marks will often result in a large number of compile-time errors.
Reloading at each row
This technique is absolutely essential for the reload at each row processing used in real time scenarios that is described in detail in Mapping Data