Blog

Hadoop – How to Build a work flow using Oozie

Develop sample work flow using  OOZIE:-

Build:-

 Capture 15 Maven is used to build the APPLICATION bundle and it is assumed that maven is installed on your path.

 To build the application, simply run:

Cmd # mvn package

The maven assemble plug-in is used to generate a .tax. gz file which contains all of the work flow and configuration files in the required layout:

Oozie .examples-[VERSION]-bundle-far.gz
/workflow.xml
/con fig-de fault.xml
/conf/(job con fig)
/lib/(*.job; *.so)

Deploy:–

rm- rf examples-oozie
far-xvzf oozie-examples-0.0.1-SNAPSHOT –bundle.tar.gz
Hadoop fs.rmr/work flows/oozie- examples.

Run:-

Export oozie -URL=http://host name:11000/oozie
Oozie job –config oozie- examples/job. properties –run

Run parallel map reduce jobs in sub- work flow:-

Oozie job –con fig oozie - examples/job. properties
-D jump. to=parallel- run

Coordinator:-

The Oozie Coordinator system allows you to define and execute recurrent and inter dependent work flow jobs (data application pipelines)

A data application pipeline is a chain of coordinator work flow jobs that can run at regular intervals, different intervals or be triggered by some external event (data availability).

For example, the output of the last 4 runs of a work flow that runs for every 15 minutes will become the input of another work flow that runs for every 60 minutes.

The coordinator job bundled with this example simply runs the work flow at 5 minutes interval between the given start and end dates

To deploy the coordinator job, run the following command:

oozie job – config oozie – examples/coordinator/word. Properties
-D start=$(date-4 = “%FT%H:%MZ”)
D end=$(date-4 -d “+ 1hour”+%CT%H:%MZ”)-D mode = single –run

To stop the coordinator job run:

Oozie job – kill[word job id]

Oozie scheduler Execution using pig and Map Reduce:-

A Work flow Engine:-

Oozie executes work flow defined as DAG of jobs.

The job type includes: Map Reduce/pig/Hive/any script/custom Java code etc.

Oozie executes work flow based on.

1. Time Dependency(Frequency)
2. Data Dependency

 

Command line Tool in Oozie:-

Oozie provides a command line utility, oozie, to perform job and admin tasks.

All operations are done via sub-commands of the oozie CLT

The oozie CLT interacts with oozie via its ws API

Commands:-

To show the client version of Oozie

                Cmd # oozie version

For job operations

# oozie job

For job status

# oozie job

For admin operations

# oozie admin

To validate a work flow XML file

                         # oozie validate

? To submit a pig job every thing after ‘-X’ are passed through parameters to pig.

# oozie pig -x

Oozie URL:-

All Oozie CLI sub-commands expect the – Oozie URL option indicate the URL of the Oozie system to run the command.

If the – Oozie option is not specified, the Oozie CLI will look for the Oozie-URL environment variable and uses if set.

If the option is not provided and the environment variable is not set, the Oozie CLI will fail.

Time Zone:-

The Time Zone-ID option in the job and jobs sub-commands allows you to specify the time zone to use in the output of those sub-commands.

The TIME-ZONE-ID should be one of the standard Java time zone IDs, and you can get list of available time zones with the command oozie info-time zones.


RELATED COURSES

Get Updates on Tech posts, Interview & Certification questions and training schedules