Big Data & Brews: Pivotal’s Milind Bhandarkar on Why Hadoop is Like Rocket Science

April 29, 2014 Paul M. Davis

Hadoop’s impact on the emergent Big Data industry cannot be overstated, and Pivotal’s Chief Scientist Milind Bhandarkar has played a key role in its development since its early days. In conversations with Datameer’s CEO Stefan Groschupf for the Big Data & Brews video series, Bhandarkar talks about the early days of Hadoop’s development, his professional experience, and how his unique long-view perspective has influenced Pivotal’s vision.

Screen Shot 2014-04-29 at 7.51.47 AM

During the video interviews, Bhandarkar also speaks to his role within Pivotal, as well as his involvement in the early days of Hadoop development, while working at Yahoo! Both in terms of technology and talent, heritage is a key differentiator for Pivotal’s Hadoop distribution. Pivotal HD boasts robust custom components developed by a team with combined decades of Hadoop experience, who are armed with a leading analytical data management portfolio.

The two major components are HAWQ, an extremely fast SQL engine running on top of the Hadoop File System, and Pivotal GemFire XD, an in-memory SQL processing engine which enables persistent storage on top of HDFS. These components represent the evolution of Greenplum Database and VMware vFabric SQLFire respectively, bringing those technologies’ speed, reliability, and maturity into Pivotal’s Hadoop stack.

As Bhandarkar explains in the video, HAWQ iterates upon a decade of work invested in building the lightning-fast Greenplum Database, bringing that product’s advantages — speed, reliability, and the common, powerful, and expressive language of SQL — to data stored natively in HDFS.

Concurrently, the Pivotal GemFire XD component enables real-time SQL analysis of in-memory data. This allows for high-speed data ingestion and processing for a tiered system geared toward data prioritization and availability — a hot layer of in-memory data, and warm and cold layers of data that resides in HDFS but can be easily queried through HAWQ, MapReduce, or GemFire XD.

As Bhandarkar states, Pivotal’s Hadoop stack enables rapid ingestion and analysis of new data in-memory, while iteratively moving the data to HDFS clusters where it can be easily accessed and quickly processed. “Within a few minutes that data gets on to HDFS,” he says, “which then becomes queryable not only by GemFire itself, but also by MapReduce, by HAWQ, or whatever technologies you have which [are] actually querying the same data.”

Screen Shot 2014-04-29 at 7.49.39 AM

This points to the Business Data Lake model embraced by Pivotal and partner Capgemeni in a recent announcement. The Data Lake metaphor speaks to Hadoop’s ability to store seemingly unlimited amounts of data inexpensively, with the value for finding insights from that mass of data being the role of components running atop HDFS, such as HAWQ and GemFire XD.

Bhandarkar’s experience affords him a long-view perspective, demonstrated both by his career and choice of beer for the Big Data & Brews segment. In the interview, he tells Groschupf that he would break his career into two distinct eras, the first being from 1991 to 2005, when he worked on high performance computing projects for the government of India, building the country’s first indigenous super computer in the country. Following this achievement, he moved to the United States to earn a PhD in Parallel Computing at the University of Illinois at Urbana–Champaign.

After he received his PhD, Bhandarkar decided that the academic path was not for him. Soon after, he was tapped by Yahoo to work on a project that would basically “revamp the entire search content engine.” They began working on a project named Juggernaut — research and development work inspired by Google’s seminal MapReduce and GFS papers.

In the interview, Bhandarkar details that he worked on a small team in those early days of developing what would become Hadoop. The team was abundant with folks focused on search infrastructure, such as Eric Baldeschweiler (head of Yahoo search engine content formerly at Inktomi), Sameer Paranjpye (who built the first version of Dreadnaught, a precursor to Juggernaut,) and former NASA/Ames researcher Owen O’Malley.

“Hadoop is rocket science,” jokes Groschupf during the video interview, To which, Bhandarkar jokingly said, “Yes it is rocket science,” recounting his previous experience at Center for Simulation of Advanced Rockets at the University of Illinois.

Screen Shot 2014-04-29 at 7.50.31 AM

In short time, the team realized that what they were working on was a tool for data science — a tool which came to be known as Hadoop during the five and a half years that Bhandarkar spent at Yahoo. Bhandarkar was there on the ground floor, contributing his first batch of code to Hadoop with its 0.1.1 release, when he contributed the serialization system known as Hadoop Record I/O. This legacy continues with Pivotal’s commitment to contributing to these, and many more, open source software projects.

A decade later, Bhandarkar serves as Pivotal’s Chief Scientist. His role is to build technical strategies towards developing Big Data technologies. During his Big Data & Brews talks, Bhandarkar cites the integration of Apache Spark as an example of a project that he’s been watching for two years, investigating how its innovations can be applied to real use cases undertaken by customer’s using Pivotal’s unified platform-as-a-service.

Watch the entire Big Data & Brews video interview with Pivotal’s Milind Bhandarkar:

Big Data & Brews: Milind Bhandarkar on his Experience Leading up to Pivotal

Big Data & Brews: Milind Bhandarkar of Pivotal Talks About the Beginnings of Hadoop

About the Author

Biography

Previous
New Fellow Travelers Join the Cloud Foundry Foundation Mission
New Fellow Travelers Join the Cloud Foundry Foundation Mission

Today, Pivotal announces eight new Gold level members who have stated their intention to join the Cloud Fou...

Next
CloudFoundry Performance Acceptance Tests
CloudFoundry Performance Acceptance Tests

Simon Leung, Jonathan Berkhahn, and Danial Lavine discuss a framework they’ve created to run performance ac...