Welcome to the July 2015 version of the Pivotal Build Newsletter!
This month, it seems the most prolific topic—“The Industrial Internet”, “The Internet of Things (IoT)”, or “The Internet of Everything”, depending on how you like to call it, has completely taken over. However, it seems that even the emergence of SkyNet will have data quality issues. We found some interesting articles detailing new methods for big data cleanliness. We finish off with some news and commentary in app development, big data, in-memory data grids, open source software, and data science.
Now, to kick things off—why should the average developer or architect care about the Industrial Internet? Some posit the $330 billion enterprise software industry will turn into $1 trillion.
As all products and services become more digitized, there are tons of data-generating devices. But, there is also a new breed of information worker coming online. This includes construction, factory, farm, equipment, and many other types of workers. Often referred to as blue collar, these occupations interact with machinery and equipment. This means they will need apps to help manage machines. In the U.S. alone, there are 87 million citizens in these types of jobs.
The Industrial Internet: Building Design to Ball Bearings
Who thought software developers would help us design buildings, infrastructure, and utilities based on information—allowing one person to do the work of 5? How about using augmented reality to view manuals or see real-time analytics as you walk around a building, factory, oil rig, or shopping mall?
The IoT megatrend is reaching its hands into all facets of industry—getting as far down as ball bearings. Last month, the Harvard Business Review’s shared how sensors, powered by kinetic motion, are located inside ball bearings, and transmit performance information. The manufacturer, a company called SKF Group, also provides 45 different iPad apps so that maintenance crews can monitor 8000 different products. Half a million machines have been connected to the SKF cloud for over two years.
As a result, now we can get an iPad alert to add lubricant—or risk paying for an expensive replacement process.
The Industrial Internet: New Reference Architecture, Lack of Developer Readiness, and Revenue
Hot off the press—13 companies, including Pivotal, joined the AllSeen Alliance, an effort meant to foster standards across IoT.
As reported by Drivers & Controls, Control Design, and others, the Industrial Internet Consortium released their Industrial Internet Reference Architecture. Authored by members from ABB, AT&T, Cisco, Fujitsu, GE, IBM, Intel, RSA, SAP, and other companies, the reference architecture was created to speed development and scale Industrial Internet systems quickly.
Why are these traditionally slow-to-adopt B2B manufacturing companies hiring developers and architects in droves and moving quickly? Better yet, why did SiliconANGLE report that only 50% of developers are ready for IoT? Perhaps it is because there are too many different tools? Well, we techies do like to play with stuff.
No, the answer is money! GE predicted that revenue from its Predix platform, powered by Pivotal, will grow from $1 billion to $5 billion this year—that is 3% of GE’s revenues last year. For those techies out there who follow financial markets, their CEO said, “We’re basically paying for this investment with our own productivity savings. So, as an investor, you kind of get the growth for free.” That is a pretty bold quote and a promise to investors. Among other large players, Pitney Bowes recently came on board as a GE Predix partner. As well, China Telecom is working with GE Predix to accelerate the development of optical networks. Komatsu recently announced working with GE in the same vein. Lastly, BP is going to use the platform to monitor oil rigs.
GE coined the term Industrial Internet and has set the bar with Pivotal products. For example, GE recently tested Pivotal GemFire (now incubating as Apache Geode), proving its massive ability to scale and operate on real-time data compared to other products.
In-Memory & Big Data: Good Bye Databases & Warehouses, I’m Apache Hadoop + Spark + Geode
Given the velocity of the Industrial Internet space, there was also compelling news in the in-memory and big data space, which are integrating, converging, and necessary to support the Industrial Internet.
Interestingly and in support of the Hadoop-as-Data-Warehouse debate, Michael Vizard pointed out how Apache Hadoop® PLUS other things (like Apache Spark™) can pose a serious threat to the incumbent data warehouse players. Similarly, it is very clear that Oracle’s database and data warehouse dominance is highly threatened by open source. This is why Pivotal has published concepts like the butterfly architecture, which combines various types of data processing capabilities into one place. No silos.
Now, Apache Spark™ is very popular for good reason—here is a recent, quick overview. But, some folks get confused about Spark replacing Hadoop. Nope. They play well together, as outlined by TechCrunch, Wikibon Analyst George Gilbert, and Bernard Marr in Forbes. At last month’s Spark Summit, many use cases were shared specifically about Spark—companies like Baidu, NBCUniversal, Autotrader, and Autodesk shared stories. Read them. You’ll see where Spark fits.
Both Hadoop and Spark also play well with Apache Geode and our commercial distribution of it, Pivotal GemFire. Pivotal Big Data Suite includes Apache Spark™ and many other components, including Pivotal GemFire, Pivotal Greenplum Database, and Redis. Why all these seemingly similar pieces? Well, as you can read in our Geode Proposal, there are important differences. We explained many aspects of this at the recent In Memory Computing Summit (IMC Summit). Hopefully, the videos will be out soon. Until then, you can check out these video playlists: overview of the Geode product, background, architecture, etc. or using Geode to create a scalable stock prediction system.
Before we mention some of the Geode info shared at IMC Summit, which we helped sponsor, there is a key, core principle to understand regarding memory versus network or disk—if flash memory took 2 days to access data, then disk would take at least one month, and a network trip across the U.S. would take 4 years. This is the case for all memory versus disk-based systems—we all need both systems.
With that said, our first talk explained how Apache Geode started off as GemFire, and we included a slide on the GemFire journey to date. Pivotal GemFire now has 1000+ customers, and we power portions of every major Wall Street bank, the Department of Defense, travel portals, airlines, trade clearing, online gambling, telecommunications companies, manufacturers, payroll processors, insurance giants, and the largest rail systems on earth. We also covered some of Geode’s performance, like linear scale in general and speed versus Apache Cassandra™. We then outlined the Geode roadmap—with HDFS persistence, off-heap storage, Apache Lucene™ indexes, Apache Spark™ integration, and Cloud Foundry services. The second talk covered the architecture for real-time stock prediction, which you can see here.
Last month also included the Hadoop Summit, which we sponsored too. One of the key questions was about Hadoop crossing the chasm into the mainstream. We all know there is still a road ahead to take the architecture where it needs to go. Yet, it is in process. Outside of making SQL work on Hadoop, one of Pivotal’s strongest capabilities, one of the big hurdles is going to be how companies address the quality of their data.
Big Data Quality: Preparation, Cleansing, Wrangling, Munging, & Gettin’ It Ready
As a recent survey explained, developers spend as much as 90% of their time cleaning data so that it can be analyzed. The top culprits are integrating data stores, combining relational and non-relational data, and the sheer volume. Ultimately, these are big wastes, which is why it was prevalent in the media as we gathered up the info for this newsletter.
What is the impact? Another report, though it is probably pushing the premise for this metric, claims that data quality can help companies generate 70% more revenue, particularly with sales and marketing data. Ok. Even if that is pushing it, we know data quality is a problem. In fact, its been a big problem long before data has become big data, just now its at scale. For a while some data science gurus were even telling us that data quality no longer mattered, or at least there were different approaches to getting more quality information from data.
However, as pointed out in this Datanami article, its possible to approach data quality improvement from an automation point of view. The better we can prepare and improve our source data, the more productive our data scientists and analysts will be.
Humans remain integral to most data quality improvement processes, but more and more we can augment them with automation.. There have also been articles, like this one, talk about using machine learning for self-serve data quality. Along these lines, computer scientists have been looking at evolution and genetic improvement as a way to improve source code. This article explains how such a program took 50,000 lines of source code and sped it up by 70 times—somewhere, the concept might be applied to data as well.
In any event, this article by Paxata explains their perspective regarding the implications of data quality on IoT. The author believes IoT data will still have quality issues, particularly when it is combined with other data sources. At first, one might think that IoT data is more like log file or click steam data, but there are other opportunities. While they didn’t explain the uses cases in a concrete way, we believe IoT data will be combined with transactional field services data, assignment and dispatch systems, invoices and work orders, mobile and tablet-based behavioral data, video, audio, emails, texts, phone call logs, training, and more.
We think Talend’s CEO captured a great thought—the concept of managing data as a data supply chain. Would you accept poor quality parts from suppliers for a manufacturing process? Well, aren’t we manufacturing data, reports, and data science-driven outcomes? The leader of the open source data integration platform also goes on to give some unique perspectives on open source.
Editor’s Note: ©2015 Pivotal Software, Inc. All rights reserved. Pivotal, Greenplum Database, GemFire and HAWQ are trademarks and/or registered trademarks of Pivotal Software, Inc. in the United States and/or other countries. Apache, Apache Hadoop, Hadoop, Apache Lucene, Apache Cassandra, Apache Geode and Apache Spark are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author