In 2015, we can expect to see the data storage and data science landscapes mature, consolidate, break our trust, and surprise us. Last year was a big year for big data, but, in 2015, more enterprises will showcase more production uses of Apache Hadoop®, demonstrate more successes with a broader array of big data technologies, and solidify innovative ways of using data science to improve business outcomes. At the same time, innovators will continue to push the machine learning boundaries, and introduce and mature new techniques requiring less human intervention, like deep learning. As data science practitioners face fewer doubters and are technically empowered and charged with driving new usage of existing and new sources of data, these practitioners will drive their share of big, unexpected, and cost effective wins. There will also be a group of companies which struggle to get Apache Hadoop® right, use data science without the proper guardrails, and either spectacularly and publicly reveal this, or more quietly start competing less effectively with their early-moving, technically thoughtful and strategic competitors.
Here are the top data science and data-related predictions showing up in our collective crystal ball:
Taking Action on Data, Not Just Storing It: Apache Hadoop® Adoption and Growth Continues
In 2015, we will see more and more companies doing more and more with the data they have sitting in Apache Hadoop®. They will be performing more analytics and running more applications—lighting up the dark data and adding greater utility to the data from which they already derive value. This will be partially enabled by further and faster penetration of SQL on Hadoop, as these environments continue to allow more and more data to shift from traditional stores, colocate into one environment, and become accessible to downstream apps and learning algorithms. In addition, the use of YARN will grow to run code that isn’t just for MapReduce analyses because YARN allows for new classes of applications to run beyond MapReduce, and more software vendors will package solutions to run on YARN. This will partially be proven as more mainstream enterprises (retail, telecom, financial services, travel, etc.) as opposed to internet companies (Amazon, Facebook, Google) make personalized offers to consumers through digital touch points.
The Shift from “One Algorithm Wonders” (Point Solutions) to Data Science Platforms
Within data science, there is a complex market evolving, and one analyst recently delivered a snapshot of this “machine intelligence” landscape. There will still be a fervor to hire data scientists, and a number of specialized new apps will power unique use cases in the market, while introducing some challenges noted as dangers further below. Pivotal predicts 2015 will end with a diminished set of one-algorithm wonders that were previously proliferated by VCs. While the market fragmentation these upstarts have created is offering enterprises a large group of point solutions to explore new opportunities to extract value from their data, they are simultaneously introducing integration pain points that enterprises will increasingly reject. We’ll see 2015 evolve in the collective market savviness towards the importance and value of horizontal analytical platforms enabling, data transformation, features creation, model development, operationalization (likely leveraging PMML), and score delivery to applications. Consolidation will occur around vertical plays, driving point solutions into platforms.
There Will Be More News-Worthy Data Science Ethical Failures
Companies across all sectors are becoming increasingly data-rich, and increasingly armed with powerful data modeling tools to extract insights from this data. While high-profile consumer-centric digital players like OKCupid, Uber and Facebook have experienced public shaming on their potential misuse of consumer data, Pivotal sees strong potential in 2015 for a new class of ethical failures associated with the practice of data science itself. As data science-enabling tools mature and become more accessible to technical practitioners without the proper grounding in statistics, the temptation to power new modeling efforts without sufficient critical rigor will grow. While this is an important evolution of the maturity of our industry, it introduces a whole new era of potential modeling failures, whether these are incorrectly using various data formats, using data of poor quality or overly sparse data to power incorrect insights, or even failing to properly draw attention to model accuracies and implications for operationalization. Pivotal sees particular potential for ethical challenges with the rise of the Internet of Things, but the risk doesn’t stop there or with consumer-facing services. In the next year, as the market explores ethical guidelines we will see more of these types of failures, and collectively learn from them.
The Concept Of Data Arbitration Will Become More Popular
As enterprises continue to amass and curate rich data assets representing their footprint in their respective markets, these assets are understood as rich sources of competitive differentiation, let alone artifacts of potentially very private enterprise-consumer interactions requiring careful stewardship. In this new Big Data age, there is strong potential for key industries—healthcare, energy, heavy industry, education, as examples—to bring data assets together to power new efficiency-creating insights that benefit the industry as a whole. These discoveries and efficiencies will be blocked by enterprise-level shielding of proprietary data, but enterprises have increasingly become aware of the downside to this. The need to enable data of differing pedigrees and ownership to come together into shared, anonymized, modeling processes will drive demand for a new breed of data arbitration technologies and services that can allow increased industry-level collaborations. For instance, companies like fitbit store lots of information about a person’s physical activity, glooko has very granular data on blood glucose levels and drug administration data, and healthcare providers have rich medical history, lab results, and medications prescribed. Obviously, there is value in bringing these data sources together to power new insights, but companies are still reluctant to share their proprietary data given a lack of tools to support this, among other concerns. We predict the emergence of a new market in 2015 to address this growing need. Privacy preserving data mining practices enabling algorithms to roam multiple databases to build models on disparate data sources will gain popularity, as will third-party data arbitration services which are common now in digital media. Industry-level attention on data standards like the utility sector’s Common Information Model, will draw increasing focus.
Data From The Internet Of Things (IoT) Shows More Ubiquity and More Use Cases
The well-known Harvard Business School Professor, Michael E. Porter, co-authored an article in November 2014, “How Smart, Connected Products are Transforming Competition.” One of the key tenets is that the expansive use of built-in sensing and computing capabilities are transforming industry structures, the nature of competition, changing the way value is captured and managed, redefining channels, and even forcing companies to ask themselves, “What business am I in?” For example, cell phone companies are already using anonymous and aggregate information for a myriad of purposes outside of allowing two people to communicate.
With this in mind, we will see many additional interesting IoT applications gain adoption in 2015. The concept of automatons (modern, self operating machines) and robotics will blend, and new, more basic robotic applications will be brought to life, beyond the high-profile examples, like self-driving cars. Companies of all types will look to sensors to instrument the physical world, and as such, 2015 will see an amazing richness in creative use-cases for the data flowing in. Disney, for example, has prompted guests to wear data-collecting wristbands and retailers have begun using location-aware apps. Samsung kicked off 2015, with a splash in this space, announcing 90% of its products would be instrumented by 2017, with all products instrumented within five years.
[Big] Data Science Will Shift to the Cloud(s)
Data science practitioners would love to be able to quickly drive ad-hoc discovery and build predictive models in temporary cloud development environments, internal or otherwise, that scale to their data volumes and performance requirements for complex modeling. Today, public clouds like AWS, Google, and MSFT provide easy access to compute environments, but they do this with limited support for truly Big Data and limited access to sophisticated modeling tools that support distributed architectures. As enterprises scale their Apache Hadoop® infrastructure into private PaaS contexts, and data science tools are certified to these environments, ad hoc cloud support for their internal data scientists, will become more commonplace. IT leaders will see immediate advantages in the tighter control over security, privacy, and compliance, while practitioners will revel in quickly sourced distributed environment to power agile, iterative discoveries, with a clearer path to model operationalization in production managed clouds for ongoing maintenance and scoring. The cloud enabled shift in treatment of IT-spend from capex to opex will also be appealing to IT leaders in certain sectors. Easy availability of a private cloud for Apache Hadoop® or a massively parallel data architecture will be at the center of this shift.
Data Science Apps Will Pressure Enterprise Data Architectures
Data driven apps are commonplace, but data science driven apps are not. Data from many sources are stored in the Data Lake as are multiple models that use that data to solve different business problems. The results of these models can be consumed in different ways. The best way to enhance the consumption of data science results, and hence their impact, is to create multiple light-weight apps which are geared towards specific business roles. For example, a scenario analysis app for strategic planning of marketing spend will be different from a sales monitoring app although they would both use the results of a market-mix model that predicts sales. Our Pivotal Data Labs and Pivotal Labs teams worked on many data science driven apps this year. This also involves management of multiple predictive models including refreshing them when required and making their scores available through APIs to be consumed by apps. Enterprises are getting used to this as a serious focus, but it requires a new way of looking at operationalizing the conclusions drawn from numerous data sets. The enterprise data architecture needs to address these new requirements and we will see new APIs becoming standard. We will also see standards like PMML for sharing models across different tools becoming more popular.
Realizing Apache Hadoop® Alone Is Not Enough
Our data science teams were often asked this year, “We have Apache Hadoop®. Can this power our Data Science?” The answer is usually “no”, but a more nuanced response will acknowledge the type of data and data science work intended to be performed. This will be the epiphany for enterprise IT teams for 2015. More and more companies will realize they need more than Apache Hadoop® to power the complex, changing requirements of data science discovery. For data science apps that will continue to grow in ubiquity, colocation of new sets of data that are injected into existing models, or running various preprocessing calculations, we need a data lake architecture to house as much data as needed, all in one place. Of course, different tools and algorithms are needed as well. Spark has caught interest. Tachyon can make Spark even faster. High-scale, real-time apps need in-memory solutions. In other cases, people need a relational layer to run SQL, in a massively parallel way. Then there are machine learning libraries like MLLib, MADlib and GraphLab which will be leveraged for varying projects, as data science tools in a broader tool box. To power an ambitious data science strategy, a data science-centric platform and tool environment will move beyond Apache Hadoop® alone.
Increased Use and Exploitation of Video, Image, and Sound Data
Perhaps this is a safe prediction, and we have seen increased use cases as well as published many blogs on the topic. As much as 90% of media files, such as video, images and sound are considered dark data, where the files are stored but not used, including for analytics. Deep learning in distributed computing platforms is opening brand new opportunities for this data to be used for search and pattern recognition. Typically now, even when images are analyzed, it is done in isolation, meaning the images are not brought together with other structured and unstructured data sources. With increased popularity of Apache Hadoop® we will see more models being built on combined data from images, videos, texts, and structured data. For instance, Pivotal recently explored ways to improve how Earthwatch tracks and analyzes various data sources to study climate change. One idea was to video areas where birds fly to use image detection to identify and track movements. This information stream would pair well with other structured and unstructured data that organization uses to build better understandings of the impacts of climate change.
Making Machine Learning More Accessible: New Tools and Dangers
With McKinsey’s public research on the shortage of data scientists and deep analytical talent, we know the U.S. alone will be short by as much as 190,000 workers in 2018. With this in mind, both university and for-profit educators are making moves to address this gap. Of course, we are also seeing how vendors are increasing their interest in making machine learning available beyond data scientists. Today, the overall data science process includes extraction, transformation, model building, and scoring, and separate tools typically address a single point in this chain. However, the solution needs to be looked at holistically, and building tools for people without the academic background or real-world experience can get companies in big trouble. These tools will continue to mature the collective industry, but they will need to embed in safeguards ensuring proper use of data by non-data scientists (per the earlier trend titled “There Will Be More News-Worthy Data Science Ethical Failures”).
No One Algorithm To Rule Them All
Unlike the one ring to rule them all from the Lord of the Rings movie, there will be no one algorithm to rule them all. In 2015, there will be further establishment and plenty of examples where no one algorithm or tool can be used to properly solve all problems. For example, neural networks have been around a long time, but deep learning has been a fairly new and remarkable advancement within a parallel architecture. Businesses want to look at ways to bring together data from multiple arenas and systems and are driving broader requirements. Data scientists are advancing new algorithms, bringing new data into the fold, taking advantage of parallel processing, and improving model robustness and accuracy. There are more tools and options than ever before.
More on Pivotal’s Data Science Predictions
- See the webinar on Pivotal’s Data Science Predictions for 2015
- Or check out the slides for the webinar on Slideshare
Recent News for Big Data and Data Science Professionals
- Deep Learning and Machine Intelligence Will Eat the World
- Recent Survey Research on Advanced Analytics and Big Data
- You Need an Algorithm, Not a Data Scientist
- How to Scale Native (C/C++) Applications on Pivotal’s MPP Platform
- Distributed Deep Learning on MPP and Apache Hadoop®
- Data Science Case Study: A Healthcare Company’s Journey to Big Data
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author