3 Biggest Questions Companies Have Before Starting To Tackle Apache Hadoop

June 5, 2015 Stacey Schneider


Last month, I attended the Pivotal Big Data Roadshow in my hometown of Atlanta. I’ve been spouting shocking statistics such as only 4% of companies think they are using big data analytics effectively, and last week was a clear reminder that this statistic likely very true.

At one point, early in the morning, one of the presenters asked the room how many people have Apache Hadoop® deployed at their company. With a variety of people from 35+ local area companies in attendance, including all our “big ones”, a surprising lack of hands went up around the room. Many had experience with traditional business intelligence tools, but none in the modern distributed computing fashion pioneered by Apache Hadoop®.

As the day progressed, I listened to questions to the presenters, and talked to folks at breaks and at lunch. It quickly became obvious the majority of the people there were doing discovery. They knew analytics is a differentiator, and have well established, albeit traditional, Business Intelligence solutions already established. Now they are looking to bring their skills and businesses into the real-time intelligence powered by Hadoop. So they are doing the due diligence now, so they aren’t completely unprepared if a smart, data savvy competitor pops up. Everyone is afraid of those. No one wants their industry’s Uber to come snatch away their market share while they were asleep at the wheel.

The Big Data Roadshow is organized really well for this type of exploration. The morning is a bunch of presentations that orient technologists and business people to the big data market, providing high level architecture guidance, and stressing the use cases and data science of putting your data to work for you. The afternoon is a hands on lab that can be repeated remotely after the class, helping the technologists in the room get their hands dirty and gain confidence in working with big data technologies. (Check out new dates for the Pivotal Big Data Roadshow below!)

Big data is a big topic however, and three more questions were asked repeatedly.

1. How can I convince my org to start on Hadoop now?

I saw this at Strata Hadoop as well. The journey to transforming to be a technology company is hard. You have to learn new skills, or hire them in—which is hard if your project isn’t even real yet. You need support from across the organization, both for the budget approvals but also to sponsor real change. Change is hard, especially when you have to pay salaries and please investors.

Our advice usually is to start small. Do one project, that is either of high value, or isolated enough that the stalwarts of change in your organization can’t object too furiously. If you can, pick a case that doesn’t fit the existing investment profile you have for your data infrastructure, meaning, one where you use data that you’ve never kept before because of cost as it will help justify storing more data. Embrace agile development practices if you can, but regardless plan for results in 6 weeks or less. It may not be the full picture of what you want, but get results and prove merit quickly. Learn from this cycle, and use its benefits to prove that change is good.

For most, that approach is the slow route, and they see the guillotine coming if they don’t speed the process up. For companies in that situation, I really recommend a tour of Pivotal Labs. As I said after Strata Hadoop, Pivotal Labs is Silicon Valley’s secret weapon. Within minutes of being there, anyone from human resources to IT will understand the change and what works. One major insurance company said it took months of lobbying for change, but in just 10 minutes into a visit to Pivotal Labs, they rest of his company finally got it and approved the change. For many, this is enough to realize that they need to partner with Pivotal Labs and open new offices that are dedicated to this new model of technology. This could sound extreme, but having a new office gives a level of isolation to let the change take root and protect results, which will soon speak for themselves.

2. Do I really have to run it? How can we just get to the good parts fast?

At lunch, one world famous brand confessed that like many consumer goods companies, they are a marketing company. They don’t want to spend time on building data centers and maintaining it if they don’t have to. They want to be a tech-savvy, smart marketing company, but they are fighting the idea that they will have to manage their big data infrastructure. They want it in the cloud.

For those companies, the answer is two fold. Yes, you can outsource the hosting, infrastructure, and much of the operations associated with a successful big data stack (or data lake), but in the long run this is not one of the smart decisions to foist off into someone else’s cloud.

Simply put, you need to become an expert in Big Data and Analytics if you want to remain competitive. (Notice I didn’t say data center hosting and infrastructure design!) This is foundational to our belief at Pivotal. Can you obtain Pivotal solutions in a fully hosted solution? Absolutely, we’d be happy to direct you to a number of clouds doing just that, but ultimately you need to become the expert in collecting, analyzing, and operationalizing your big data to fully harness it. Shared analytic service solutions only provide a static, narrowly focused, non-customizable, and locked-in short cut to limited insights. While you’re busy being one of the lemmings, your competition, like Google and Facebook did before them, will be building internal expertise and flexible data infrastructures that will enable agility and excellence that generic shared hosted services will not.

If this process seems daunting considering the skills you have in-house, I strongly suggest you engage with our Pivotal Data Science team. Pivotal Data Scientists are part of the Pivotal Labs organization that embrace the Pivotal Way of building brilliant software. Part of this involves pairing, so our data scientists will pair with your engineers and teach them collaboratively how to build, maintain, and fully capitalize on your big data investment, just like Google and Facebook. Here, knowledge is power for sure.

3. Is a Data Lake really all one big thing?

It is one concept, but it is not one data sink that holds all the data for all applications. The reality is for your data to work for you, you will have to build a system of rivers for your data that flow to analytical and operational stores. There are many architectural details here, but generally speaking, yes, you put all data into one HDFS system. However, it doesn’t have to go to HDFS first. Real time data flows coming in can be intercepted, processed via known models and produce actions and output BEFORE the data lands in the main lake. Tools like Spring XD and Pivotal GemFire facilitate this, allowing for output from the initial flow analysis to be delivered to the app data store, as well as being dropped into the lake for subsequent cross checking.

For the data left in the Data Lake, as new ideas come up, you explore that data, using MapReduce or SQL tools like Pivotal HAWQ. One customer cited that they needed to start doing this now, because they weren’t sure what data they weren’t collecting. For instance, in manufacturing, in order to predict faults or failures in certain products, they would need to start putting part numbers on all inventory, and measuring its status, service records, and other performance measures including load or use in order to determine if true defects exist. Depending on how complex it is, you will pipe some of that data out, possibly enriching it, and put it into another smaller store and run more complex batch analytics on it, using tools like Pivotal Greenplum Database. Of course, ultimately any interactions or enrichments should be archived back to the Data Lake for future reference and reuse.

To operationalize it, you may want to move it again to a real-time in-memory store, like Pivotal GemFire, so users have fast access. This is exactly why Pivotal saw fit to build Big Data Suite. There are many ways to slice and dice the data, and data must flow into different data stores on its way to becoming useful. So you need a flexible toolkit, and the most complete one out there is Pivotal Big Data Suite.

Special Guest From The Red Sox Speaking In Burlington

Each Pivotal Big Data Roadshow is delivered by mostly local resources who are connected to the industries and customers in that region, allowing us to tailor some of the discussion to the attendees. As such, the Burlington Pivotal Big Data Roadshow has been able to invite an exciting special guest speaker, Tim Zue, SVP of Business Development, Boston Red Sox.

Tim Zue will share with attendees his journey providing analytical support behind strategic business decisions to ensure the ultimate customer experience for fans engaging with the Boston Red Sox and Fenway Park.

“Zue is now part of nearly every business decision made at Fenway, from ticket prices to game-day experiences.” -The Robot Builder who helps run the Boston Red Sox, Washington Post

Upcoming Roadshow Dates

  1. Rocky Hill, CT, June 16
  2. Boston, MA, June 18
  3. Melbourne, Australia, July 7
  4. Sydney, Australia, July 8

Big Data Case Studies

Editor’s Note: ©2015 Pivotal Software, Inc. All rights reserved. Pivotal, Greenplum Database, GemFire and HAWQ are trademarks and/or registered trademarks of Pivotal Software, Inc. in the United States and/or other countries. Apache, Apache Hadoop and Hadoop are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author


Continuous Deployment From GitHub To PWS Via Concourse
Continuous Deployment From GitHub To PWS Via Concourse

In this post, Concourse and Pivotal Cloud Foundry expert Dan Higham explains how anyone can set up continuo...

How Do I Migrate Applications to Pivotal Cloud Foundry?
How Do I Migrate Applications to Pivotal Cloud Foundry?

I spend a lot of time talking with customers that are bringing existing applications into Pivotal Cloud Fou...

SpringOne 2022

Register Now