How to Manage Metadata using Talend Schema
This segment contains a detailed discussion about metadata and Talend schemas and recipes that highlight some of the less used / less known features associated with schemas, along with more commonly used features, such as generic and fixed schemas:
- Hand-cranking a built-in schema
- Propagating schema changes
- Creating a generic schema from existing metadata
- Cutting and pasting schema information
- Creating schemas from lists
Managing metadata is one of the most important aspects of developing Talend jobs, and the most common form of metadata used within Talend jobs is the schema.
The Talend repository supports database metadata. This allows you to define connections to databases and then import schema objects such as tables. Setting up this metadata allows you to test your database connectivity from within Talend, as well as making it convenient to work with databases within your Jobs.
For successful development of jobs, it is essential that the metadata defined in a data source accurately describes the format of its underlying data. Failure to correctly define the data will result in numerous errors and waste of time tracking down problems with data formats that could otherwise be avoided.
Talend provides a host of wizards for capturing metadata from a variety of data sources such as database tables, delimited files, and Excel worksheets and stores them within its built-in metadata repository.
A Schema define the inputs and outputs of your Jobs. Whether you’re reading or writing from Databases, flat files or other data sources, schemas define how data moves around your Job.
Any component that supports data flow provides a Schema Editor that allows you to see and define the schemas that are used by that component. Components may use schemas that are defined by Repository objects, and you may also use copy & paste to copy schema elements from one component to another.
Talend stores metadata definitions in schemas, which may be built into individual components or stored in its metadata repository, as shown in the following screenshot:
In general, it is best practice to define source and target metadata using a repository schema and mid-flow metadata as a Built-In schema.
The main exception to this rule is when dealing with one-off generated source data, such as a database query. Despite being a data source, it is easier to store the schemas for these custom queries as Built-In rather than cluttering the repository with single-use schemas.
Although you can manually define schemas for each of your components, a much easier approach is to define schemas is through Repository Metadata. This is especially important if you intend to use the same schema definition more than once in your Job, or use it in multiple Jobs.
Many of these repository definitions can use wizards that can not only read a source definition to define your schema; but can also sample data to determine attributes that cannot be obtained from the source definition alone. A good example of this is Repository->Metadata->File delimited.
You can also define Generic schemas, where a specific schema type is not available.
The benefits of using Repository schemas are:
- They can be re-used across multiple jobs, thus reducing the amount of re-keying.
- Talend will ensure that changes made to a Repository schema are cascaded to all jobs that use the schema, thus avoiding the need to scan jobs manually for Built-In schemas that need to be changed.
- Impact analysis reports can be generated showing where a Repository schema is being used within a project. This enables the impact of changes to be assessed more accurately when planning changes to any underlying data sources.
Any schema can be easily converted into a generic schema to enable it to be re-used.
Generic schemas aren’t tied to a particular source, so they can be used as a shared resource across multiple types of data source or they can be used to define data sources that are generated, such as the output from custom SQL queries.
How to do it…
From repository schema
Open repository schema
fd_cook_0020_customerDelimitedthat can be found in the delimited schemas, ensuring that you click the metadata, rather than the parent schema.
Right-click metadata, and then select copy to Generic schema. This creates a schema
- Move the new schema to the chapter 2 folder and double click it to edit it.
Change the name to
From a built-in schema
Open the Talend Job
jo_cook_ch02_0020_builtInSchemaand open the
Click the highlighted button, shown in the following screenshot:
Now create a new generic schema from the saved XML file by right-clicking Generic schemas, and selecting the option Create generic schema from xml:
- Select the XML file that was just saved, and click finish to create the new Generic schema
How it works…
Under the covers, Talend stores schemas in XML format files, regardless of the type of schema. This means that schemas can be easily converted between types, in this case between built-in and repository.
Schemas captured from a particular type of data source are stored in the metadata repository in a folder for that data type (for example, CSV file schemas are stored in the directory for delimited files).
There are, however instances where schemas will be shared across multiple types. For example, a CSV file and Excel file could be used to directly load a database table.
If you import the metadata from one of the sources, it will be stored in the folder for that source, which could make it hard to find.
By storing the schema as a Generic schema, it is more obvious that the schema isn’t used just for a single source.
Generated data sources
It is often necessary to perform a query against a database and return the result set to the Talend job. It is often the case that the same query is used multiple times in many jobs.
By storing the schema for the result set in a generic schema, it removes the tedious process of having to create the same schema over and over again manually every time the query is used.
Fixed schemas and columns
Some components, such as tLogCatcher, have predefined schemas that are read-only. These can be easily recognized due to the fact that the whole schema is gray.
You may also find that certain flows, for instance the reject flows, have fixed columns that have been added to the original schema. This is because Talend will add the errorCode and errorMessage fields to the schema to store the error information. These additional fields will be green to distinguish them as Talend fields.
Hand-cranking a built-in schema
In this recipe, we are presented with a CSV file that does not have a heading row and needs to create a schema for the data. This is a basic recipe with which most readers should be familiar: however, it does provide a framework for discussion of some of the more important principles of Talend schemas.
The record we will be defining is as follows:
John Smith,27/11/1990,2012-01-10 10:24:54.953
As you can see, this contains the fields; first name, last name, date of birth, timestamp, and age. Note that age is an empty string.
Open a new Talend Job (jo_cook_ch02_0000_handCrankedSchema), so that the right-hand palette becomes available.
How to do it…
- Drag a tFileInputDelimited component from the palette, and open it by double clicking it.
- Click the Edit Schema button (…), shown in the following screenshot, to open the schema editor:
- Click the + button to add a column:
- Type name into the column, and set the length to 50.
- Click the + button three more times to add three more columns.
- Type dateOfBirth into the second column, select a type of date, and set the date pattern to dd/MM/yyyy. Alternatively, press Ctrl+Space to open a list of common patterns and select this one.
- Type timestamp into the third column, select a type of date and set the date pattern to yyyy-MM-dd HH:mm:ss.SSS.
- Type age into the fourth column, set the type to Integer, tick the Null box, and set the length to 3. Your schema should now look like the following screenshot:
- Click OK to return to the component view.
Date patterns within Talend conform to the Java date format.
Date patterns are case sensitive in Java, and upper and lower case letters often have a very different function.
In the timestamp, there are MM and mm characters. These are the month and minute definitions and care should be taken to ensure that they are used correctly in the date and time portions of a date field.
Note also the ss and SSS fields. These are seconds and milliseconds. Again, care must be taken in their use within the time portion of a date.
HH and hh are also case sensitive. HH is the hour portion of a 24-hour timestamp, whereas hh is a 12-hour time.
All Talend data types have the potential to be set to null, but in some cases, this may result in a type change, as described in the following section.
Try removing the tick from the null box for age. You will notice that the type changes from Integer to int. This is because int is a primitive Java type that cannot be null, whereas for the Object type Integer, null is an acceptable value.
A good example of the use of int over Integer is when mandatory values are required for say a database table. If the field is set as int, a null value will cause an error to be thrown, highlighting either a data or job error.
The distinction between primitives and objects becomes more important as you use Talend and Java more frequently, because primitive types do not always act in the same way or have the same range of features as object types.
Talend will generally ignore field lengths in a schema, but that does not mean that they are unimportant. In fact, it is best practice to ensure that field lengths are complete and accurate for all schemas, especially database schemas.
When creating a temporary table in a database using Talend, all field lengths must be present for the DBMS to create the table. Failure to do so will result in job errors.
Most schemas will not require any keys; however, like field lengths, they become very important for database schemas.
Key fields are used during database update statements to match records to be updated. If the insert or update method is used to populate a table, then failure to specify the correct key(s) will result in a record being inserted rather than updated.
Often during development, it is necessary to change schemas by adding, removing, or re-ordering columns. This often is a very onerous task, especially if a schema is used in multiple jobs.
Storing schemas in the metadata enable the schema to be re-used. If a shared schema is changed, then Talend will prompt to find out if the changes should be applied to all jobs.
If the change is performed, then the next time that the job is opened, the component using the schema will normally be highlighted as in error, because the schema no longer matches.
Talend provides mechanisms within the schema dialogues that takes some of the pain away from ensuring that changes are assimilated into all the jobs.
How to do it…
- Open the Talend Job jo_cook_ch02_0010_propagateSchema so that the right-hand palette becomes available. Then, from the metadata palette, open the Generic schema sc_cook_0010_genericCustomer.
- Add a new field emailAddress, as shown in the following screenshot:
- Click Finish to save the change to the schema. Then, click Yes to apply the changes to all jobs when prompted.
- Click OK to accept the changes in the next dialogue box. You will now see that the job has an error in the output.
- Open the tFileOutputDelimited, and click the Edit Schema button to open the schema and select the View Schema option.
- As you can see in the following screenshot, the table on the left-hand side is different from that on the right-hand side. Click the to copy the right hand schema into the left-hand panel.
- Click OK to save the changes.
How it works…
When Talend updates the job schema for an output component, it does not propagate the change to the upstream component. Using the << option allows the developer to copy all the changes from the output schema back into the previous component, ready for a rule to be applied.
Using this method also ensures that the link to the Generic schema is maintained. It is possible to make the change in the previous tMap output; however, this would cause the output schema to become Built-in, which is an undesirable result.
In the preceding example, only one component is changed and the error is removed; however, in many jobs, this will not ensure that the changes are complete. It is a rarity to add fields only to then do nothing with them. Thus, it is often necessary to propagate the changed row forward through all components in a job to ensure it is copied to the output correctly or ensure that a field that has been reverse propagated is correctly populated from upstream data.
When adding new fields to an output, it is best to change the schema of the output and reverse propagate the new field, especially when using Repository schemas. The reason for this is that if the schema is changed using tMap, then Talend will automatically change the type of schema from repository to Built-In, thus breaking the link to the Repository schema. In most cases, this is not a desirable outcome.
Be careful during reverse propagation that field names have not changed, especially with the tMap outputs. If you change the name of a field and reverse propagate to tMap, then the rule will disappear and will need to be re-entered.
In these cases, it is worth changing the field names in the tMap output schema prior to reverse propagating a schema. Make sure that you choose not to propagate this change from tMap to avoid the output being changed to Built-in. This will cause the output file to be in error, but when the Repository schema change is applied, the schemas will match, and the error will disappear.
Creating a generic schema from the existing metadata
Any schema can be easily converted into a generic schema to enable it to be re-used. The following recipe shows two methods of creating generic schemas; the first from a pre-existing schema in the metadata repository and the second from a built-in schema.
How to do it…
From repository schema
- Open repository schema fd_cook_0020_customerDelimited, ensuring that you click the metadata, rather than the parent schema.
- Right-click metadata, and then select copy to Generic schema. This creates a schema fd_cook_0020_customerDelimited.
- Move the new schema folder and double click it to edit it.
- Change the name to sc_cook_0020_genericCustomer1.
From a built-in schema
- Open the Talend Job jo_cook_ch02_0020_builtInSchema and open the tFileOutput Delimited component.
- Click the highlighted button, shown in the following screenshot:
- This will open a windows file save dialogue. Save the file as sc_cook_0020_genericCustomer.xml.
- Now, create a new generic schema from the saved XML file by right-clicking Generic schemas, and selecting the option Create generic schema from xml:
- Select the XML file that was just saved, and click finish to create the new Generic schema sc_cook_0020_genericCustomer2.
Cutting and pasting schema information
This technique is a real time saver, but isn’t always immediately obvious, because the schema dialogue does not contain a right-click option.
How to do it…
- Make a copy of the job jo_cook_ch02_0020_builtInSchema, rename it to jo_cook_ch02_0030_copySchema, and open the new job.
- Drag a tFileOutputExcel component from the right-hand palette.
- Open tFileOutputDelimited, and then open the schema.
- Click the left-hand panel and press Ctrl+A to select all the columns.
- Press Ctrl+C to copy the highlighted columns.
- Open the tFileOutputExcel component, and then open the schema. It should be blank.
- Press CTRL+V to paste the columns.
How it works…
Press CTRL+V to paste the columns. Talend allows the standard windows shortcut keys to be used to cut and paste column information between schemas.
You can also use Ctrl+left mouse button to highlight individual columns and Shift+left mouse button to highlight a range as per the usual Windows conventions.
You can also use Ctrl+left mouse button to highlight individual columns and Shift+left mouse button to highlight a range as per the usual Windows conventions. Note that the pasted columns are added to the end of a schema, they do not replace existing columns. This means that in many cases further work will be needed to move the new columns to the correct place in the schema.
Dropping schemas to empty components
This simple tip is a useful time saver, especially when using generic schemas and the Hash components.
How to do it…
- Open the job jo_cook_ch02_0040_dragSchema. If you open the tHashOutput components, you will see that they all share the same schema; the schemas are all Built- In.
- In the left-hand window, open the generic schema sc_cook_0040_genericCustomer, so that you can see the actual metadata.
- Drag the metadata icon over each of the tHashOutput components.
- You will now see that all three components share the same generic schema.
How it works…
When you drag a metadata schema onto a component, the component is automatically populated with the new schema.
This is not generally a well-known feature of Talend, but it saves you having to navigate through the schema dialogues when you wish to share a common schema.
This method is particularly useful when using tHashInput components as re-usable lookups, based upon the schema of an existing tHashOutput. Each time you add an additional lookup, the generic schema can simply be dragged from the repository onto the new component, saving time and effort.
Creating schemas from lists
This next recipe doesn’t make use of Talend at all. Rather, it is a technique to save lots of tedious typing when creating schemas from documents and/or spreadsheets.
How to do it…
- Open the MS Word document customerFieldList.docx. As you can see, there are a reasonable number of field descriptions that would take a reasonable amount of time to define individually.
- Select all the column names from the word document and paste into an Excel spreadsheet:
- Now select all the fields, right click it, and select Copy.
- Go to the second worksheet and click the top-left cell.
- Then, right-click and select Paste Special, and select the option Transpose
- This will copy the previous vertical list into a horizontal list.
- Delete the initial worksheet and save the file as a CSV file named TransposedCustomer.csv
- You can then import the CSV file using the wizard for File delimited and stating that the file has a heading row.
- Set the field separator as Comma, and tick the box Set heading row as column names.
- Click on Next, and you should see the individual fields listed in the schema. You are now able to add the field types and lengths.
- If you wish, you can then copy the delimited schema to a generic schema.
How it works…
The transpose facility of the spreadsheet enables a vertical list of fields to be converted into a horizontal list. By saving this list as a CSV file, the horizontal field list can be highlighted as a heading row during an import into Talend. This automatically fills in the field names in the schema, thus avoiding the need to type in the names of the columns individually.
Even after importing a list using this method, you will still have to ensure that column types and lengths are populated, however, if you also add data to the CSV file prior to importing it, Talend will try to guess the type and length of each column during the import stage.
It is possible to force Talend to guess correctly by adding data to the file that matches the type exactly. There are two methods that can be used:
Transpose the data
Starting with the original list, add a second column to the list, and populate it with data values for each of the fields.
When transposing the data in the spreadsheet, copy both the column of field names and the data; and transpose both list columns, so that they become a heading row and a row of data.
Edit the CSV file
The second method is to add a row of data either to excel or CSV files manually prior to importing the metadata.
What data to add? If you take care to add data that is the maximum representative size of the column, then Talend will usually guess the correct types and lengths.
For example, if the field is a ten character string, for example, then ensure that you add ten characters to the data in either the list column or the CSV file. For numbers, ensure that you use numbers to let Talend know that the field is numeric.
In the preceding example, if you only set the number fields to 99999999.999 prior to import, it will save significant time. This is easy to do in Word or Excel and can save time when defining large schemas.