Almost any exec today knows there is a huge potential for big data to make a big impact on their business. They also believe big data usually comes with a big price tag, and in most cases takes years to realize.
In the past they were right. Being cautious made sense. It is hard to justify massive capital expenditure for a project that won’t be realized this fiscal year, nevermind being unsure of what information even lurks in that data. This is why Bain & Co found only 4% of enterprises are actually getting value out of big data today.
In yet another signal today that this view on big data now belongs only in the rearview, Capgemini and Pivotal announced a partnership that is focusing on advancing the strategies of making Data Lakes that can be ‘fished’ with low cost tools. This partnership is not born of an idea that will happen in the future. This idea is firmly rooted in the pent up demand from Capgemini’s customers, and in the maturity of Pivotal tools to make this happen.
In fact, as a signal of how real this demand is, Pivotal and Capgemini are forming a dedicated Center of Excellence (CoE) in India that will scale to 500 dedicated Pivotal experts by 2015, with access to over 8,000 information management practitioners and 6,000 java developers.
By all measures, this is a serious investment from IT leaders. The reason? We collectively believe we are at an inflection point for big data. Open source communities like Hadoop and Redis, along with technology providers like Pivotal, have made big data easier and cheaper to conquer.
Challenge: Harvesting Unstructured, Unconnected Data
Most companies today have customer, inventory, and financial data stored and likely automated just to run the business. However, they have access to a tremendous amount of data they are not using in any kind of automated way. Social media, sensor data, and anything in ‘blob’ form such as documents, pictures and video, all contain important insights.
As I outlined in an earlier post about making big data ubiquitous, these insights can be game changers and serious strategic advantages. They can show you customer buying patterns that tell you when to discontinue a product line, how to avoid customer churn, or—even better—when to attempt to upsell or cross-sell an existing customer or prospect. It can help optimize the scheduling of internal resources, including maintenance schedules, which can avoid failures or downtime. It can be used as the perfect, all-encompassing focus group that will guide product development.
The problem with doing this is that all this information may be accessible, but it’s in disparate systems. Typically, to solve these problems companies build a dedicated data warehouse. The data is expertly architected for fast querying on one facet of a problem.
At the end of the day, you’ve got a one trick pony.
Understanding the Data Lake
Today, we have figured out how to federate data across nodes and use data strategies like Map-Reduce to access it quickly. This solves a couple things.
First, it solves scalability. While there is absolutely more to it, essentially companies are no longer locked into storing and using data that fits on just one massive server.
Second, it has become affordable. Hardware costs are mitigated because you can use commodity servers or the cloud. Software expenses can be controlled using open source technologies.
Last but not least, this really leaves just the human resources with the skill set to do this. This is probably the hardest part, since skill sets like this are increasingly scarce today—which is definitely one of the reasons Capgemini is partnering with us to help their clients get started and solving quickly. (Sidenote: Veteran data nerds can learn these skills and become productive relatively quickly as evidenced in our Rough Guide to Data Science post, but even then you may want a dedicated coach or just a headstart in your first project.)
So, let’s say you have access to the skills either in-house or through a partners like Capgemini and Pivotal. Since it is possible to put all the data together, and the hardware and software are becoming more and more affordable—why would you consider putting only one slice of data in your data analytics?
Why wouldn’t you put it all in?
This is what a Data Lake is. It is a data store that sucks in a flood of data, internal and external, into one place. It puts all your data into one large table. Unstructured or structured, it is all accessible from one place. Any question you have, you can query the single store.
It paves the path so that if you want to find out answers on customer buying patterns today, the information is there. If tomorrow you want to study the volume of service requests against product SKUs, you can do that too.
In essence, the Data Lake provides a way for ad-hoc, open-ended querying of big data for your whole company. And as I said in my previous post, Making Big Data Real-time and Ubiquitous Is Key to Enterprise Adoption, by placing big data within reach in real-time for the entire organization, your business processes will have amazing possibilities to change, improving the bottom line and securing strategic advantages.
More about Pivotal & Capgemini’s Announcement
- See the press release
- See the video and other resources for the new co-innovation partnership between Pivotal and Capgemini.
- Want to start fishing in your own data lake? Contact the new Pivotal and Capgemini CoE.
- Learn more about Pivotal HD, our Hadoop distribution, and HAWQ, our SQL query engine that works on Pivotal HD.
About the Author