Hadoop – How To Build A Work Flow Using Oozie

Develop sample workflow using OOZIE

Build:

Capture 15 Maven is used to build the APPLICATION bundle and it is assumed that maven is installed on your path.

To build the application, simply run:

Cmd # mvn package

The maven assembles plug-in is used to generate a .tax. gz file which contains all of the workflow and configuration files in the required layout:

Oozie .examples-[VERSION]-bundle-far.gz
/workflow.xml
/con fig-de fault.xml
/conf/(job con fig)
/lib/(*.job; *.so)

Deploy:

rm- rf examples-oozie
far-xvzf oozie-examples-0.0.1-SNAPSHOT –bundle.tar.gz
Hadoop fs.rmr/work flows/oozie- examples.

Run:

Export oozie -URL=https://host name:11000/oozie
Oozie job –config oozie- examples/job. properties –run

Run parallel map-reduce jobs in sub-workflow:

Oozie job –con fig oozie - examples/job. properties
-D jump. to=parallel- run

Coordinator:

  • The Oozie Coordinator system allows you to define and execute recurrent and interdependent workflow jobs (data application pipelines)
  • A data application pipeline is a chain of coordinator workflow jobs that can run at regular intervals, different intervals or be triggered by some external event (data availability).
  • For example, the output of the last 4 runs of a workflow that runs for every 15 minutes will become the input of another workflow that runs for every 60 minutes.
  • The coordinator job bundled with this example simply runs the workflow at 5 minutes interval between the given start and end dates

To deploy the coordinator job, run the following command:

oozie job – config oozie – examples/coordinator/word. Properties
-D start=$(date-4 = “%FT%H:%MZ”)
D end=$(date-4 -d “+ 1hour”+%CT%H:%MZ”)-D mode = single –run

To stop the coordinator job run:

Oozie job – kill[word job id]

Related Page: Hadoop – How To Build A Work Flow Using Oozie

Oozie scheduler Execution using pig and Map-Reduce:

A Workflow Engine:

  • Oozie executes workflow defined as DAG of jobs.
  • The job type includes Map Reduce/pig/Hive/any script/custom Java code etc.

Workflow Engine

Oozie executes a workflow based on.

1. Time Dependency(Frequency)
2. Data Dependency

Oozie executes a workflow

Command line Tool in Oozie:

  • Oozie provides a command line utility, oozie, to perform job and admin tasks.
  • All operations are done via sub-commands of the oozie CLT
  • The oozie CLT interacts with oozie via its ws API

Commands:

To show the client version of Oozie

# oozie version

For job operations

# oozie job

For job status

# oozie job

For admin operations

# oozie admin

To validate a workflow XML file

# oozie validate

To submit a pig job everything after ‘-X’ is passed through parameters to pig.

# oozie pig -x

Oozie URL:

  • All Oozie CLI sub-commands expect the – Oozie URL option indicates the URL of the Oozie system to run the command.
  • If the – Oozie option is not specified, the Oozie CLI will look for the Oozie-URL environment variable and uses if set.
  • If the option is not provided and the environment variable is not set, the Oozie CLI will fail.

Time Zone:

  • The Time Zone-ID option in the job and jobs sub-commands allows you to specify the time zone to use in the output of those sub-commands.
  • The TIME-ZONE-ID should be one of the standard Java time zone IDs, and you can get a list of available time zones with the command oozie info-time zones.

MindMajix Youtube Channel

List of Big Data Courses:

 Hadoop Administration MapReduce
 Big Data On AWS Informatica Big Data Integration
 Bigdata Greenplum DBA Informatica Big Data Edition
 Hadoop Hive Impala
 Hadoop Testing Apache Mahout
Course Schedule
NameDates
Hadoop TrainingSep 17 to Oct 02View Details
Hadoop TrainingSep 21 to Oct 06View Details
Hadoop TrainingSep 24 to Oct 09View Details
Hadoop TrainingSep 28 to Oct 13View Details
Last updated: 04 Apr 2023
About Author

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less