Hadoop – How to Build a work flow using Oozie
Develop sample work flow using oozie:-
Maven is used to build the application bundle and it is assumed that maven is installed on your path.
To build the application, simply run:
Cmd # mvn package
The maven assemble plug-in is used to generate a .tax. gz file which contains all of the work flow and configuration files in the required layout:
/con fig-de fault.xml
/conf/(job con fig)
rm- rf examples-oozie
far-xvzf oozie-examples-0.0.1-SNAPSHOT –bundle.tar.gz
Hadoop fs.rmr/work flows/oozie- examples.
Export oozie -URL=http://host name:11000/oozie
Oozie job –config oozie- examples/job. properties –run
Run parallel map reduce jobs in sub- work flow:-
Oozie job –con fig oozie - examples/job. properties
-D jump. to=parallel- run
The Oozie Coordinator system allows you to define and execute recurrent and inter dependent work flow jobs (data application pipelines)
A data application pipeline is a chain of coordinator work flow jobs that can run at regular intervals, different intervals or be triggered by some external event (data availability).
For example, the output of the last 4 runs of a work flow that runs for every 15 minutes will become the input of another work flow that runs for every 60 minutes.
The coordinator job bundled with this example simply runs the work flow at 5 minutes interval between the given start and end dates
To deploy the coordinator job, run the following command:
oozie job – config oozie – examples/coordinator/word. Properties
-D start=$(date-4 = “%FT%H:%MZ”)
D end=$(date-4 -d “+ 1hour”+%CT%H:%MZ”)-D mode = single –run
To stop the coordinator job run:
Oozie job – kill[word job id]
Oozie scheduler Execution using pig and Map Reduce:-
A Work flow Engine:-
Oozie executes work flow defined as DAG of jobs.
The job type includes: Map Reduce/pig/Hive/any script/custom Java code etc.
Oozie executes work flow based on.
- Time Dependency(Frequency)
- Data Dependency
Command line Tool in Oozie:-
Oozie provides a command line utility, oozie, to perform job and admin tasks.
All operations are done via sub-commands of the oozie CLT
The oozie CLT interacts with oozie via its ws API
To show the client version of Oozie
Cmd # oozie version
For job operations
# oozie job<options>
For job status
# oozie job<options>
For admin operations
# oozie admin
To validate a work flow XML file
# oozie validate<ARGS>
To submit a pig job every thing after ‘-X’ are passed through parameters to pig.
# oozie pig <options>-x<ARGS>
All Oozie CLI sub-commands expect the – Oozie URL option indicate the URL of the Oozie system to run the command.
If the – Oozie option is not specified, the Oozie CLI will look for the Oozie-URL environment variable and uses if set.
If the option is not provided and the environment variable is not set, the Oozie CLI will fail.
The Time Zone-ID option in the job and jobs sub-commands allows you to specify the time zone to use in the output of those sub-commands.
The TIME-ZONE-ID should be one of the standard Java time zone IDs, and you can get list of available time zones with the command oozie info-time zones.