Hadoop Stack

In this post, I am exploring Hadoop stack and it’s ecosystem.



Oozie is a server-based workflow engine specialized in running workflow jobs with actions.  It is typically used for managing Apache Hadoop Map/Reduce and Pig Jobs. In Oozie, there are workflow jobs and Coordinator jobs. Typically workflow jobs are Directed Acyclical Graph (DAG) of actions while coordinator jobs are recurrent Ozzie workflow jobs which are triggered by time (or frequency) and based on data availability.

Due to Oozie’s integration with rest of the Hadoop stack, it is easy to support several types of Hadoop jobs out of the box.

From a product point of view, it’s a Java Web Application that runs on Java Servlet container. In Oozie, a workflow is a collection of actions (Hadoop Map/Reduce jobs, Pig jobs) arranged in control dependency DAG (Direct Acyclic Graph)… Here control dependency dictates that from one action to another action – but second action can’t run until the first action is completed.

These workflow definitions are written in hPDL (Process Definition Language). Oozie workflow actions start their jobs in remote systems (like Pig, Hadoop etc.). Once completed, remote systems callback Oozie to notify the action completion and then Oozie proceeds to the next actoin in workflow.

credit: https://oozie.apache.org/docs/4.2.0/DG_Overview.html

From Stackoverflow: DAG (Direct Acyclic Graph)

Graph = structure consisting of nodes, that are connected to each other with edges.
Directed  = The connections between nodes (edges) have a direction: A –> B is not the same as B -> A.
Acyclic = “non-circular” = moving from node to node by following the edges, you will never encounter the same node for the second time.

A good example of a directed acyclic graph is a tree. Note, however, not all directed acyclic graphs are trees 🙂