Open Data Platform Initiative: Putting an End to “Faux-pen” Source Apache Hadoop Distributions

February 19, 2015 Roman Shaposhnik

featured-odpThis week, Pivotal was among the visionary industry leaders coming together to address some of the biggest customer pain points around rapid evolution and standardization in the big data arena. The new initiative, christened as the Open Data Platform (ODP), received great accolades from across the industry as the needed step in the right direction.

I was at the Pivotal press event and the excitement was palpable. When the live streaming event was over, the really fun part began: smart people asking great questions and double clicking into the next level of detail.

While answering these questions, and sorting through the conversations that emerged in blogosphere and on Twitter—it dawned on me that I was answering similar questions in many places. What follows is a condensed version of those conversations.

But before we dive into that, I’d like to get one misconception out of the way first: the assumption that the Hadoop ecosystem of the projects being developed at Apache Software Foundation (ASF) achieves a natural standardization by the virtue of different vendors working on the same set of projects.

Standardization Does Not Come Naturally

The beauty of the ASF development and governance model (also known as the “Apache Way”) is that it is fully decentralized. Projects are being developed on completely different schedules, typically on multiple development branches. It is also not uncommon for projects to release from multiple branches simultaneously.

This situation is ideal for maintaining the pace of innovation that allowed the Apache Hadoop® ecosystem to be the engine of a multi-billion dollar industry. And while thousands of these blooming flowers look pretty and exciting, the sheer number of them get to be a significant hurdle for anybody trying to integrate with the Hadoop ecosystem as whole (rather than with each individual project). With more than a dozen projects in the Hadoop ecosystem (and counting!), the problem has grown to be big enough that vendors need to step-in. The way they solve it is a tried-and-true method of selecting exact versions of all these individual projects, integrating them together, and calling the end result a distribution.

The Benefits of a Standardized Distribution

Software distributions take away your ability to select components à la carte, but in return they give you predictability, backwards compatibility and API stability.

They are also are great for curating out complexity for the end users, which is usually the value add commercial vendors add to open source projects to make a profit. However, it is extremely tempting for vendors to misuse this as an unnatural form of “stickiness”. Controlling the distribution becomes a new way of exerting the dreaded vendor lock-in.

Even if all the components in each distribution are completely open source, the software responsible for integration, packaging and validation isn’t. Think of it this way: even though both you and Gordon Ramsay can purchase exactly the same list of raw ingredients from the grocery store, the way to combine them and the cooking technique—the recipe—is what makes all the difference.

The claims of some vendors that they, just like everybody else, ship 100% compatible versions of Apache Hadoop®, and that every vendor shipping a Hadoop distribution builds off the Hadoop trunk are misleading at best. A careful look at the bits of core Hadoop available from one of these vendors contains 700+ code changes (all open source, no foul play there!) on top of an officially released version of Apache Hadoop®. By default, without adequate testing, this new distribution can no longer be assumed to be compatible with the core.

So, in order to publish any compatible software that works with each distribution, 3rd parties need to go through a lengthy, vendor controlled certification process. And importantly, the software used for certification is all closed source. The certification is valid for the distribution of that vendor and that vendor only. What that means is, if you’re a 3rd party application developer who wants to support a large number of Hadoop ecosystem projects, you need to re-certify on as many Hadoop distributions as there are different vendors.

On Tuesday we talked about fragmentation, and this is problem we were speaking about. The arduous task of certifying 3rd party components against every combination is impossible for every project to do, and so mini-ecosystems evolve around vendor specific distributions, limiting choice and locking users into specific distributions as they build out their Hadoop deployments.

Real Openness To Reduce Fragmentation

Wouldn’t we, as an industry, all be better off if there was a 100% open, community driven core distribution of Apache Hadoop® complete with validation and certification software? Wouldn’t it be great if all the commercial Hadoop vendors were shipping the bits coming from that core distribution the same way that Ubuntu is shipping the core bits of Debian? After all, this would greatly simplify customer and partner relationships with various Hadoop distributions. Any application certified to run on the open Hadoop distro would be guaranteed to run on all the commercial ones.

The good news is that one Apache project is already pursuing this agenda. It is called Apache® Bigtop and it is a comprehensive packaging, testing, and configuration of the leading open source big data components, including, but not limited to, Hadoop, HBase and Spark.

The bad news is that unless there’s a clear forcing function for vendors to coalesce around Bigtop, vendors tend to be extremely reluctant to contribute to the project. This is exactly the kind of negative dynamics in the world of open source development that ODP (among other things) is looking at changing.

Apache Bigtop

Before I explain how ODP changes these dynamics, let me tell you a few more things about Apache® Bigtop. It is, after all, a project that I personally co-founded and brought the Apache Software Foundation (ASF).

You see, I am obsessed with build software distributions. It started when a hero of mine, Linus Torvalds, completely changed the world of UNIX back in ‘91 by making his Linux kernel freely available. The kernel, as we all know, is not of much use by itself–it requires userland applications to unlock its true value. These early GNU applications had to be managed around Linux kernel in a very careful way, similar to how the Hadoop ecosystem is managed around core Apache Hadoop® theses days. The need for this careful curation set off a mad rush of home grown Linux distributions in which came an hour late and a dollar short. Even if the world didn’t get to experience RULIX, the seed for what would eventually be Bigtop was sown in my mind.

Apache® Bigtop was brought to the ASF while I was at Cloudera. When the project got started, the company fully believed in doing all of the integration, packaging and validation work completely in the open and I was thrilled to be running wild with this idea bootstrapping the Bigtop community. In a few years, with commercial pressures for differentiation building, the attitude changed dramatically and while we were still allowed to contribute things like packaging code, our fellow QA engineers were completely barred from making validation and certification code available.

Then it got worse: the next generation packaging format was developed as a proprietary extension of a closed source management solution (Cloudera Manager), and it became clear that Bigtop needed a new home.

Around that time, I was introduced to Pivotal and was given a chance to pitch this idea to Scott Yara. The slide deck I presented to him was titled “Bigtop Foundation”. We discussed an early vision of how we could fix the situation where all the vendors would leverage Apache® Bigtop but very few are willing to contribute to it. Based on what I was seeing Pivotal do with Cloud Foundry, I knew Pivotal had the guts and the wherewithal to pull this kind of thing off.

I also loved what Pivotal was doing for developers, focusing on making them more agile and productive. What’s more, Scott shared with me how the Pivotal big data portfolio of products, including HAWQ, GemFire and Greenplum Database, could make the open source big data ecosystem even bigger if Pivotal open sourced them. The idea of joining Cloud Foundry and big data efforts together to improve how data could be baked into applications was icing on the cake.

This is when I found a home for both myself and Bigtop. As I said when I started, in my mind, Pivotal is an ideal sponsor for an effort to bring us closer to the fully integrated, easy to use Hadoop platform. This week’s announcement is the first public milestone of that conversation, and I couldn’t be prouder of what we’ve achieved already.

The Open Data Platform + The ASF = Better Together

Fundamentally, the ODP initiative aims at complementing the Apache Software Foundation (ASF) in providing a structured way for vendors to agree on a fully integrated and validated core distribution of Apache Hadoop®.

It breaks the spell of the ‘tragedy of the commons’ that slowed original Apache® Bigtop development and adoption by forcing participating vendors to have their skin in the game. By organizing the ODP, we now have a guiding function that means dollars will flow to fund the development, it means products will align with artifacts produced by the association, and it most definitely means playing by ASF rules when it comes to contributing to individual projects. A big ODP budget means more development funded within the ASF—and this is a good thing for everybody involved.

After all, as they said in the 1983 movie The Right Stuff, “No bucks, no Buck Rogers.”

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

Previous
Running Your Platform—Upgrade Much?
Running Your Platform—Upgrade Much?

A common question for operations when considering running a platform for the first time is, what do you do ...

Next
The Way to Hadoop Native SQL
The Way to Hadoop Native SQL

Today, Pivotal announced it has open sourced HAWQ and MADlib, contributing them to the Apache Software Foun...