Hadoop Stack

In this post, I am exploring Hadoop stack and it’s ecosystem.



Oozie is a server-based workflow engine specialized in running workflow jobs with actions.  It is typically used for managing Apache Hadoop Map/Reduce and Pig Jobs. In Oozie, there are workflow jobs and Coordinator jobs. Typically workflow jobs are Directed Acyclical Graph (DAG) of actions while coordinator jobs are recurrent Ozzie workflow jobs which are triggered by time (or frequency) and based on data availability.

Due to Oozie’s integration with rest of the Hadoop stack, it is easy to support several types of Hadoop jobs out of the box.

From a product point of view, it’s a Java Web Application that runs on Java Servlet container. In Oozie, a workflow is a collection of actions (Hadoop Map/Reduce jobs, Pig jobs) arranged in control dependency DAG (Direct Acyclic Graph)… Here control dependency dictates that from one action to another action – but second action can’t run until the first action is completed.

These workflow definitions are written in hPDL (Process Definition Language). Oozie workflow actions start their jobs in remote systems (like Pig, Hadoop etc.). Once completed, remote systems callback Oozie to notify the action completion and then Oozie proceeds to the next actoin in workflow.

credit: https://oozie.apache.org/docs/4.2.0/DG_Overview.html

From Stackoverflow: DAG (Direct Acyclic Graph)

Graph = structure consisting of nodes, that are connected to each other with edges.
Directed  = The connections between nodes (edges) have a direction: A –> B is not the same as B -> A.
Acyclic = “non-circular” = moving from node to node by following the edges, you will never encounter the same node for the second time.

A good example of a directed acyclic graph is a tree. Note, however, not all directed acyclic graphs are trees 🙂

Bare Metal – A dreary (but essential) part of Cloud

Recently I got a chance to attend Open Compute Summit 2016 in San Jose, CA. It was full of industry peers from web scale companies such as Facebook, Google, Microsoft along with many financial institutions like Goldman Sachs, Bloomberg, Fidelity, etc. Overall theme of this summit was to embrace the openness in hardware and embrace commodity hardware. 
From historical point of view, OCP was a project initiated by Facebook few years ago where they opened many of the hardware components – motherboard, power supply, chassis, rack, later switch etc. as they needed things at scale and doing it using branded servers (pre-cut for enterprise by HP, Dell, IBM) wasn’t going to cut for them – thus they created (designed) their own gears. More details here.  
Below is one of the OCP certified server (courtesy: http://www.wiwynn.com). It features very minimalistic feature and a stripped down version of typical Rack Mount server.
Coming back to this year’s summit, considering this was my first year at OCP summit, I had certain expectations and while being there I can say one thing for sure – “Bare Metal does look interesting again”. Why I say that? If it was only about Bare Metal, it certainly a boring thing but when you combine that bare metal with API and particularly if you are operating at a scale (doesn’t have to be at Facebook scale), it’s fun time. Let’s take a look.
Keynote started by Facebook’s Jason Taylor with journey over last year or so and where the community stands now. But fun begun when (another Jason) Jason Waxman from Intel talking about their involvement and how the server and storage (think NVMe) industry is growing and what they see coming in future – including Xeon D and Yosemite.

A good talk was given by Peter Winzer of Bell Labs. I knew UNIX and C born out of Bell Labs but it was fascinating to hear about the history and future of Bell Labs with innovations going in Fiber Optics and capacity of Fiber – with 100G is no brainer but 1Tbps is in the horizon. 

 Microsoft Azure’s CTO Mark Russinovich started discussing about how open Microsoft is – which to be honest other than their .NET framework being open, I had no idea that they have been contributing back to Open Source community – well, it’s a good thing!  In past Microsoft has contributed their server design specs – Open Cloud Server (OCS) and Switch Abstraction Interface (SAI). OCS is the same server and data center design that powers their Azure Hyper-Scale Cloud (~ 1M servers). Using SAI and available APIs help network infrastructure providers integrate software with hardware platforms that are continually and rapidly evolving at cloud speed and scale. For this year, they have been working on a network switch and proposed a new innovation for OCP inclusion called Software Open Networking in the Cloud (SONiC). More details here. 

There were many interesting technologies showcased in Expo but one struck my mind was Storage Archival Solution. This basic configuration can hold 26,112 disks (7.8 PB) with expandable modules spanning pair of datacenter row gives total capacity of up to 181 petabytes (HUGE!!).  Is AWS Glacier running this underneath? Some details here.
For a coder at heart, it was good demonstration by companies such as Microsoft and Intel showing some love for OpenBMC to manage the bare metal. Firmware update seems to be common pain across industries but innovative approach taken by Intel and Microsoft using Capsule – which bring API and Envelop via UEFI – try to make it easier than it seems. 
Overall, it was a good exposure to newer generation of hardware technologies and by accepting contributions from multiple companies, OCP is moving towards standardization on hardware. With standardization and API integration, it will make fun to play with Bare Metal.
Do you still think Bare Metal is dreary?

This article originally appeared on LinkedIn under the title Bare Metal – A dreary (but essential) part of Cloud