DataStage is a popular tool in the industry, and proficiency in it can open up job opportunities in various organizations. This DataStage Interview Questions blog covers all the important questions that are asked by top companies in most DataStage-related job interviews. By studying them, you can crack your job interview easily in the corporate world. So do checkout them to know the top questions asked by recruiters today!
DataStage Interview Questions And Answers 2021. Here Mindmajix sharing a list of 60 Real-Time DataStage Interview Questions For Freshers and Experienced. These DataStage questions were asked in various interviews and prepared by DataStage experts. Learn DataStage interview questions and crack your next interview.
We have categorized DataStage Interview Questions into 4 levels they are:
Below mentioned are the Top Frequently asked Datastage Interview Questions and Answers that will help you to prepare for the Datastage interview. Let's have a look at them.
DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition.
Explore DataStage Tutorial for more information |
Parallel extender in DataStage is the data extraction and transformation application for parallel processing.
There are two types of parallel processing's are available they are:
Actually, every process contains a conductor process where the execution was started and a section leader process for each processing node and a player process for each set of combined operators, and an individual player process for each uncombined operator.
Whenever we want to kill a process we should have to destroy the player process and then the section leader process and then the conductor process.
Using "dsjob" command as follows.
dsjob -run -jobstatus projectname jobname
ex: $dsjob -run and also the options like
Want to Enrich your career with a DataStage certified professional, then enroll in our “DataStage Training” This course will help you to achieve excellence in this domain. |
Sequential File:
Dataset:
Descriptor File: Which is created in a defined folder/path.
Data File: Created in the Dataset folder mentioned in the configuration file.
Fileset:
DataStage Flow Designer Features:
There are many benefits with Flow designer, they are:
HBase connector is used to connect to tables stored in the HBase database and perform the following operations:
Hive connector supports modulus partition mode and minimum-maximum partition mode during the read operation.
A) Kafka connector has been enhanced with the following new capabilities:
Amazon S3 connector now supports connecting by using an HTTP proxy server.
File connector has been enhanced with the following new capabilities:
InfoSphere Information Server is capable of scaling to meet any information volume requirement so that companies can deliver business results faster and with higher quality results. InfoSphere Information Server provides a single unified platform that enables companies to understand, cleanse, transform, and deliver trustworthy and context-rich information.
In the InfoSphere information server there are four tiers are available, they are:
The client tier includes the client programs and consoles that are used for development and administration and the computers where they are installed.
The engine tier includes the logical group of components (the InfoSphere Information Server engine components, service agents, and so on) and the computer where those components are installed. The engine runs jobs and other tasks for product modules.
The services tier includes the application server, common services, and product services for the suite and product modules, and the computer where those components are installed. The services tier provides common services (such as metadata and logging) and services that are specific to certain product modules. On the services tier, the WebSphere® Application Server hosts the services. The services tier also hosts InfoSphere Information Server applications that are web-based.
The metadata repository tier includes the metadata repository, the InfoSphere Information Analyzer analysis database (if installed), and the computer where these components are installed. The metadata repository contains the shared metadata, data, and configuration information for InfoSphere Information Server product modules. The analysis database stores extended analysis data for InfoSphere Information Analyzer.
DataStage provides the elements that are necessary to build data integration and transformation flows.
These elements include
Stages are the basic building blocks in InfoSphere DataStage, providing a rich, unique set of functionality that performs either a simple or advanced data integration task. Stages represent the processing steps that will be performed on the data.
A link is a representation of a data flow that joins the stages in a job. A link connects data sources to processing stages, connects processing stages to each other, and also connects those processing stages to target systems. Links are like pipes through which the data flows from one stage to the next.
Jobs include the design objects and compiled programmatic elements that can connect to data sources, extract and transform that data, and then load that data into a target system. Jobs are created within a visual paradigm that enables instant understanding of the goal of the job.
A sequence job is a special type of job that you can use to create a workflow by running other jobs in a specified order. This type of job was previously called a job sequence.
Table definitions specify the format of the data that you want to use at each stage of a job. They can be shared by all the jobs in a project and between all projects in InfoSphere DataStage. Typically, table definitions are loaded into source stages. They are sometimes loaded into target stages and other stages.
Containers are reusable objects that hold user-defined groupings of stages and links. Containers create a level of reuse that allows you to use the same set of logic several times while reducing the maintenance. Containers make it easy to share a workflow because you can simplify and modularize your job designs by replacing complex areas of the diagram with a single container.
A project is a container that organizes and provides security for objects that are supplied, created, or maintained for data integration, data profiling, quality monitoring, and so on.
InfoSphere DataStage brings the power of parallel processing to the data extraction and transformation process. InfoSphere DataStage jobs automatically inherit the capabilities of data pipelining and data partitioning, allowing you to design an integration process without concern for data volumes or time constraints, and without any requirements for hand-coding.
InfoSphere DataStage jobs use two types of parallel processing:
Data pipelining is the process of extracting records from the data source system and moving them through the sequence of processing functions that are defined in the data flow that is defined by the job. Because records are flowing through the pipeline, they can be processed without writing the records to disk.
Data partitioning is an approach to parallelism that involves breaking the records into partitions, or subsets of records. Data partitioning generally provides linear increases in application performance.
When you design a job, you select the type of data partitioning algorithm that you want to use (hash, range, modulus, and so on). Then, at runtime, InfoSphere DataStage uses that selection for the number of degrees of parallelism that are specified dynamically at run time through the configuration file.
A single stage might correspond to a single operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, InfoSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job.
OSH is the scripting language used internally by the parallel engine.
Players are the workhorse processes in a parallel job. There is generally a player for each operator on each node. Players are the children of section leaders; there is one section leader per processing node. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file).
the two major ways of combining data in an InfoSphere DataStage job are via a Lookup stage or a Join stage
We should aim to use modular development techniques in your job designs in order to maximize the reuse of parallel jobs and components and save yourself time.
InfoSphere DataStage automatically performs buffering on the links of certain stages. This is primarily intended to prevent deadlock situations arising (where one stage is unable to read its input because a previous stage in the job is blocked from writing to its output).
Here are the points on how to import and export data into Datastage
The collection library is a set of related operators that are concerned with collecting partitioned data.
The collection library contains three collectors:
The Ordered collector reads all records from the first partition, then all records from the second partition, and so on. This collection method preserves the sorted order of an input data set that has been totally sorted. In a totally sorted data set, the records in each partition of the data set, as well as the partitions themselves, are ordered.
The round-robin collector reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the collector starts over. After reaching the final record in any partition, the collector skips that partition.
The sortmerge collector reads records in an order based on one or more fields of the record. The fields used to define record order are called collecting keys.
aggtorec restructure operator groups records that have the same key-field values into an output record
field_export restructure operator combines the input fields specified in your output schema into a string- or raw-valued field
field_import restructure operator exports an input string or raw field to the output fields specified in your import schema.
makesubrec restructure operator combines specified vector fields into a vector of subrecords
makevect restructure operator combines specified fields into a vector of fields of the same type
promotesubrec restructure operator converts input sub-record fields to output top-level fields
splitsubrec restructure operator separates input sub-records into sets of output top-level vector fields
splitvect restructure operator promotes the elements of a fixed-length vector to a set of similarly-named top-level fields
tagbatch restructure operator converts tagged fields into output records whose schema supports all the possible fields of the tag cases.
The contents of tagged aggregates are converted to InfoSphere DataStage-compatible records.
The easiest way to display the first line of a file is using the [head] command.
$> head -1 file.txt
If you specify [head -2] then it would print first 2 records of the file.
Another way can be by using [sed] command. [Sed] is a very powerful text editor which can be used for various text manipulation purposes like this.
$> sed '2,$ d' file.txt
The easiest way is to use the [tail] command.
$> tail -1 file.txt
If you want to do it using [sed] command, here is what you should write:
$> sed -n '$ p' test
The easiest way to do it will be by using [sed] command
$> sed –n ' p' file.txt
You need to replace with the actual line number. So if you want to print the 4th line, the command will be
$> sed –n '4 p' test
Of course you can do it by using [head] and [tail] command as well like below:
$> head - file.txt | tail -1
You need to replace with the actual line number. So if you want to print the 4th line, the command will be
$> head -4 file.txt | tail -1
We already know how [sed] can be used to delete a certain line from the output – by using the'd' switch. So if we want to delete the first line the command should be:
$> sed '1 d' file.txt
But the issue with the above command is, it just prints out all the lines except the first line of the file on the standard output. It does not really change the file in-place. So if you want to delete the first line from the file itself, you have two options.
Either you can redirect the output of the file to some other file and then rename it back to original file like below:
$> sed '1 d' file.txt > new_file.txt
$> mv new_file.txt file.txt
Or, you can use an inbuilt [sed] switch '–i' which changes the file in-place. See below:
$> sed –i '1 d' file.txt
Always remember that [sed] switch '$' refers to the last line. So using this knowledge we can deduce the below command:
$> sed –i '$ d' file.txt
If you want to remove line to line from a given file, you can accomplish the task in the similar method shown above. Here is an example:
$> sed –i '5,7 d' file.txt
The above command will delete line 5 to line 7 from the file file.txt
Explore DataStage Sample Resumes! Download & Edit, Get Noticed by Top Employers! |
Name | Dates | |
---|---|---|
DataStage Training | Nov 02 to Nov 17 | View Details |
DataStage Training | Nov 05 to Nov 20 | View Details |
DataStage Training | Nov 09 to Nov 24 | View Details |
DataStage Training | Nov 12 to Nov 27 | View Details |
Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.