This segment contains recipes that show some of the techniques used to read and write data to files. It also contains recipes that show techniques to MANAGE FILES within a file system.
It isn’t very efficient to process large batches of information via a web service, nor is it particularly desirable to pull data from an application database during peak hours. Thus, many organizations still maintain a file-based overnight batch process using large extracts in file format.
In addition, many older, legacy applications rely solely on file-based data for communicating with the outside world.
It is therefore very important for the data integration developer to understand many file types and be able to manage them efficiently and effectively.
If you would like to Enrich your career with a Talend certified professional, then visit Mindmajix - A Global online training platform: “Talend Certification Course” . This course will help you to achieve excellence in this domain.
This simple recipe shows how a file can be built-in within different sub jobs by appending data to an existing file. The append method is one way of building complex files.
Open the jo_cook_ch08_0010_fileAppend job.
The steps for appending records to a file are as follows:
The first sub job creates the file, and the second sub job appends to the same file.
While relatively trivial, this recipe demonstrates a very powerful method for creating files that do not adhere to the norm, such as files containing a mixture of fixed and delimited data or free format strings.
Concatenating files using the append method
In addition to concatenating records using the append method, it is also possible to use the same method to concatenate many files into one file.
In most cases, the files would be identified using tFileList, and then appended to a single output file. One word of caution though; if you forget to set the mode to append, it will result in each file overwriting its predecessor, leaving you with just the output from a single file; the last one to be found.
It also presents an additional problem when the job is run for a second time. In that the file will already exist, and the new data will simply be added to the old data.
[Related Article:-Administering File]
It is thus a good idea when using file append to add a tFileDelete component to the beginning of the job. This means that a file is always created anew by the first iteration (but always make sure Fail on error is unchecked for the deleted component, otherwise it will fail when you run the job at the first time).
Ensure that the file name is unique for each run, perhaps by adding the process ID to the file name or even the datetime including milliseconds.
A regular expression (regex) is a powerful method for pattern matching and replacement within many programming languages. One interesting use for regular expressions is when dealing with unusual input formats that are difficult to describe using normal delimited or fixed-width file formatting. This recipe shows how regex can be used to identify a set of input columns from an unstructured input row.
The screenshot of the chapter8_jo_0020_jobLogData.txt file is as follows:
You should notice that there is neither an obvious delimiter nor does each record fit a fixed-width format.
Now, open the jo_cook_ch08_0020_readRegexData job.
How to achieve it…
The steps for reading rows using regular expressions are as follows:
"^job: "+ "([a-zA-Z0-9_]*)"+ ". *start *"+ "([0-9][0-9]:[0-9][0-9]:[0-9][0-9])"+ " - "+ "(success|failure)"+ ".*"
Regular expressions require a pattern that will match a whole line of data, with parts of the pattern in brackets being retained and the rest being discarded. A short explanation of the regular expression code for the preceding example is detailed as follows:
As mentioned, this is only a brief description of the previous code and of regular expressions.
[Related Article:-Creating Validation Rules and Binary Error Codes in Talend]
You will probably have noticed that the lines in the pattern are organized in terms of patterns to keep and to discard. The regular expression doesn’t need to be specified in this way; however, it does make identifying the fields much easier than if the whole pattern were simply on a single line.
Java regular expressions will not ignore carriage return or newline characters, unless explicitly told to do so. If you need to create regular expressions that span multiple lines, then simply add (?s) to the beginning of a regular expression. This has the same effect as the Java DOTALL option.
Occasionally, it is necessary to create intermediate files within a job that are only used during the lifetime of the job. This recipe shows how to use Talend temporary files.
Open the jo_cook_ch04_0030_temporaryFile job.
The steps for using temporary files are as follows:
How it works…
The tCreateTemporaryFile component creates an empty file that is then available for writing in the main sub job. The name of the file is stored in the globalMap variable tCreateTemporaryFile_1_FILEPATH, which is referenced by both the output and input components.At the end of the job, Talend then deletes the temporary file to free the space.
Temporary files are best used when there are large volumes of data that would or could possibly cause memory issues. For lower volume data, it is recommended that tHashMap or an in process table is used instead, because this will create much more performant code. See WORKING WITH DATABASES, for an example of using the in-process table.
It is necessary to keep the XXXX format in the temporary file name, because Talend uses this to add additional information to the file name to ensure that it is unique for the job instance. This is very important if multiple instances of a job are executed at the same time, because it would be catastrophic if all of the jobs are written to the same temporary file.
While not strictly file based, there are alternative methods for storing intermediate data that are more efficient than using temporary files, as long as there is enough memory to hold the temporary data. This recipe shows how to do this using the tHashMap component.
Open the jo_cook_ch08_0040_temporaryDatatHashMap job. You will notice that this is the same job as in the previous recipe.
How to do it…
The steps for storing intermediate data in memory using tHashMap are as follows:
How it works…
tHashMap creates an in memory structure that holds all the data in the flow. It can then be used as an input in a downstream sub job.
tHashMap relies on enough memory being present for all the hash mapped data to be stored. If there is not enough memory, then it is best to use a temporary file.
tHashMap is a very useful component for storing data in memory, and can also be used to store lookup data that can be reused across multiple joins.
This recipe follows on from the previous recipe, but shows how the information in the header can be added to the detail data, and the data in the trailer used for validation, as is typically the case with files of this type.
Open the jo_cook_ch08_0080_useHeaderAndTrailerInfo job. This job is the completed job from the previous recipe; however, do note that the tLogRow components have now been replaced with tHashOutput components. Also, note that three tHashInput components have also been added and configured.
How to achieve it…
We will be performing two main tasks; the first is to use the trailer information to validate the file, and then take a column from the header to use in all the output records.
and change the final row in the file to 00004.
How it works…
The job first separates the header trailer and detail rows, validates the row count in the trailer against the number of physical rows read, and then adds the file date to each detail row.
Validating using the trailer information
The validation sub-job compares the row count from the trailer with the actual row count. The only output from tMap is controlled by a condition that checks the value of the trailer count with the number of lines (tExtractDelimitedFields_2_NB_LINE) passed through the tExtractDelimitedFields component, which is stored in globalMap.
The header data is added to the tMap as a lookup. You will notice that there are no join links between the detail row and the header, which means that we have an all-all-join. So, every detail record is matched to the header record, thus populating the filing date for every detail record.
You could choose to copy the header and trailer information into globalMap as an alternative to using tHashMap components, using a tFlowToIterate component in place of the header and trailer components, or a tJava component, where the column values are explicitly copied into globalMap.This method does make adding the data to the detail simpler (because no join is needed, just a reference to globalMap); however, it also makes the validation slightly more complex, because the compare function does need a row to be created using a tFixedFlowInput.
The tDie message in this job isn’t very useful. If you wish to make it more useful, then change the message to that given as follows:
“Error: Counts do not match. Trailer: “+rowCountError.trailerCount+”. Actual : “+rowCountError.actualCount
To copy data to globalMap using tJava, see the Setting context variables and globalMap variables using tJava recipe is shown in USING JAVA IN TALEND.
Ravindra Savaram is a Content Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.