Processing Multiple and Validation Files – Talend
Processing multiple files at once
Often, with batch processes, it is required that multiple files are processed by the same job in a single tranche. This example shows how this can be achieved by merging a group of input files into a single output.
Open the jo_cook_ch08 _0120_multipleFiles job. You will notice that it is currently reading a single file to a temporary file, and then copying the temporary file to a permanent output.
How to do it…
The steps for processing multiple files at once are as follows:
- Add a tFileList component, open it, and set the directory to
- Click on the + button under the Filemask box, and add the filemask
- Your tFileList should look like the one shown, as follows:
- Move the OnSubjobOk from the tFileInputDelimited to the tFileList.
- Add a tJava
- Right-click on tFileList and select Row, then Iterate, and link to the tJava.
- Right-click on the tJava and select Trigger, then OnComponentOk.
- Link it to the tFileInputDelimited (customer)
- Open the tFileInputDelimited component, and change the file name to
- Move the OnSubjobOk link from tFileInputDelimited (customer) to the tFileList component.
- Your job should look like the one shown as follows:
- Run the job, and you will see that the output file contains information from the three input files.
- To make the job output more useful, open tJava and insert the following code
System.out.println("Processing file: "+ ((String)globalMap.get("tFileList_1_CURRENT_FILE"))).
- Run the job again, and you will see that the console now logs the individual files as they are found.
How it works…
This job merges all files in a directory into a temporary file ready for processing as a single entity; in this case, renaming the temporary file to a permanent output file name.
The tFileList component is an iterator that is triggered by each file found that fits the specified mask.
So as each file is found, the file details are stored in globalMap, and then all linked components and sub jobs will be processed until no more files are found.
As you can see from the job, the tFileInputDelimited component reads from the file specified in globalMap by tFileList, and tFileOutputDelimited writes to the globalMap variable specified by tCreateTemporaryFile.
Once all files have been read and processed, tFileList is then complete, and the onSubjobOk link will be triggered, thus copying the temporary file into a final permanent merged file.
In this job, we have only one sub job that is executed as part of the Iterate, but it is probably more common to have many. In a traditional programming language, this would mean that all the processing linked to the Iterate would be in a programming loop.
It is also possible to have further iterations below the first one, for instance, if you are navigating your way down a set of directories to find input files for processing.
The tJava component named dummy is just that. It performs no logic and is present in the code just to make it more readable. This is because it allows the processing for each iteration to sit in individual sub jobs as if they are within a normal, atomic job that processes just one file.
Processing control/validation files
Some organizations prefer to use a companion (control/validation) file containing file information instead of storing the information in the file header or trailer. This means that the detail file is much simpler to process, because it is a normal flat file.
In this recipe, the control file has the same name as the detail file; however, it is suffixed with .ctrl rather than .txt. This recipe shows how the control file is processed.
Open the jo_cook_ch08_0130 _controlFile job. You will see that tFileList_1 is looking for files with the mask of chapter08_jo_0130_customerData*.txt. There are two of these in the directory.
How to achieve it…
The steps for processing control/validation files
- Copy the first sub job.
- Change the new tFileList mask to StringHandling.EREPLACE(((String)globalMap.get(“tFileList_1_CURREN
- Open tJava_2 and change the command to
System.out.println("Found control file: "+ ((String)globalMap.get("tFileList_2_CURRENT_FILE")));.
- Connect the first and second sub job, using OnComponentOk.
- Repeat the same for the second and third sub jobs.
- Your job should now look like this:
- Run the job, and you will see that the main process is called once per file/control combination.
How it works…
The first tFileList looks for files that fit the mask “chapter08_jo_0130_customerData*.txt”, of which there are three.
For each .txt file, it finds the file that fits the mask, and then performs another tFileList. This time, however, the mask is the actual file name, but with .txt replaced with .ctrl. This has the effect of searching for a control file that has exactly the same name as the text file.
Once a match is found, then we have both file names in globalMap together, and the file details can be validated and processed by whatever means within the main processing section.