Spring XD addresses the new demands of big data and real-time data pipelining, but it sets a foundation for much more.
Data Science, Machine Learning and Predictive Analytics are becoming more common across industries. The most successful and innovative companies are currently exploring live data streaming scenarios instead of the traditional batch collection, storage, ETL-like transformations and offline analytical solutions. Two main reasons are demanding this change. First, some data is really only valuable in the moment it’s connected—as the half-life of its business value degrades quickly. Second, the high volume, very diverse data being ingested into the data lake comes from different sources, arrives in different formats and needs several different transformations, enrichments and filtering steps. These scenarios become more and more of a reality with Internet of Things, Industrial Internet, social, and mobile use cases.
For these cases, the open source Spring XD project gives architects, data scientists, and developers a robust, productivity-enhancing foundation—in fact, it can reduce project effort by 50-80%. It simplifies big data projects by orchestrating and automating all steps across multiple data stream pipelines—creating, deploying, and managing many pipelines in a unified, extensible, distributed way.
As Director of Technical Marketing at Pivotal, I work with many customers and prospects on design, architecture, and implementation recommendations, covering how to ingest, manage, and process big and fast data. As the first in a series, this blog article is going to cover streaming scenarios and challenges where the open source Spring XD is a solid choice to lay a foundation.
What is Stream Processing?
Streams process data in motion. Current trends with the Internet of Things (IoT), mobile applications, digital business, big data, data science, and machine learning are driving streaming requirements—typically referred to as real-time streaming, stream computing, and stream processing. Importantly, the business use cases and technical implementations are characterized by the need to act in real-time on large volumes of data in motion—as data is created and ingested. Streaming workflows are often made up of continuous ingestion, various types of data wrangling or transformation, advanced analytical processes, and export steps. With timeframes measured in seconds or even milliseconds, companies process streaming information to act on opportunities before they are lost, using data science and machine learning algorithms to achieve predictive and prescriptive insight wherever possible, as long as it doesn’t impact the real-time requirements for the application in question.
Stream Processing—Technical Requirements, Industries, and Use Cases
Besides the truly real-time technical requirement of streams, most architects quickly see that there is not just one enterprise need for a processing pipeline just like there was not one integration flow. There are many streams to manage—beyond running a single process for managing and monitoring a large number of jobs across a variety of sources, processing steps, and sinks within each pipeline. Data management can cross files, file systems, mail, Twitter streams, RabbitMQ integration, JMS, MQTT, Apache Kafka™, MongoDB, Redis, Splunk, Pivotal GemFire continuous queries, and more. Of course, distributed, scale-out environments are needed to deal with the processing, and the runtimes should both port and extend easily. Importantly, streams should support analytical models and integrate through predictive model markup and other mechanisms.
From a business perspective, streaming shows up in scenarios where compliance, loss prevention, cost reduction, or taking advantage of revenue generating opportunities can only be acted upon in a small window of time. In B2B, streaming is being applied anywhere systems, servers, machines and sensors are managed—manufacturing, energy, telecommunications, utilities, surveillance, network security, logistics, oil rigs, and healthcare. Use cases include alert monitoring, preventative maintenance, re-routing, optimization, availability, and utilization. Companies are also innovating with Internet of Things (IoT), making machines and sensors do new things—like using tractor as soil sensors. In B2C-oriented industries such as media, advertising, consumer packaged goods, and retail, there are also use cases based on data streaming from mobile applications, social networks, and clickstreams. Here, streaming helps to better engage, react, and support consumers in real-time. Of course, financial services companies have historically managed risk, fraud, transactions, markets, and trades with systems that are now ancestors of modern streaming systems.
Spring XD Use Case Example—Fraud
One of the most well-known and highly applicable cases for real-time response is fraud, and the basic scenario and architecture for this need applies to all the scenarios above. Historically, companies might load five sets of fraud-related data into a data warehouse, do daily reports across the data, and make daily decisions. However, thousands of dollars can be lost within 30 seconds. So, waiting a full day can have a big impact.
So, it is valuable to instantaneously detect and even predict the likelihood of fraudulent activity—as financial transactions are made and considered for approval. In addition the system should continuously learn and improve. Assuming machine learning techniques are applied, a real-time data streaming system would include pipelines to automatically look at extensive historical data, likely stored on the data lake, and learn the patterns most likely associated with fraudulent transactions. As events are ingested, the system would look at the new data and try to match it against patterns found on the historical data the machine learning algorithm has been trained on. Scoring results that identify possible fraud within each ingested transaction are available in real-time, declining suspicious transactions as soon as they’re attempted. The prediction algorithm is also constantly being re-trained, so new events become part of the data lake and the system gradually improves itself, learning and identifying new patterns. A high level view of the approach is shown in the diagram below.
Spring XD Solves Foundational Challenges, saving 50-80% on Data Wrangling
While machine learning algorithms are easily available as built-in libraries from data science languages such as R, Python and Madlib, the challenges are often related to the integration of diverse data into those algorithms. Teaching algorithms to identify the right patterns is heavily dependent on historical data. Usually this type of data comes from the data lake. However, the online evaluation is based on current information that usually comes from NoSQL engines, in-memory grids and transactional RDBMs. These different data sources must be combined, adapted, transformed and enriched before being analyzed.
Whatever steps are necessary for the data streaming pipeline, it needs to be done in a highly available, scalable, and lightning fast way so increasing volumes of ingested data won’t cause a bottleneck to the entire process and will allow results to be available immediately. As data sets constantly change and new devices become reality, agility in creating, maintaining and modifying the data streams, without the need for recompilation and down time of the running system also becomes extremely important.
All this data filtering, preparation conversion and integration is known as data wrangling and is currently responsible for 50%-80% of time time spent on data science projects. Spring XD automates and simplifies this by using a simple DSL (and an upcoming web-based graphical interface) to dynamically build data stream pipelines that are fast, scalable, resilient and extensible. It is built on top of Spring Integration and Spring Batch projects, and these have over a decade maturity and many production users for critical use cases across various industries.
Spring XD also integrates dozens of different solutions in a concise data streaming pipelining and orchestration system. It can delegate distributed processing capabilities to Spark Streaming, for example, and directly redirect Spark’s response to the next steps of the data pipeline. Next steps could include running a machine learning algorithm, updating an in-memory data grid, or just delivering information to a traditional database . The same approach can be applied to a variety of different tools, scripts, languages, protocols and platforms. A high level view of the components are shown below, and these components include JDBC, JMS, MQTT, PostgreSQL, Apache Kafka™, MongoDB, Project Geode, Pivotal GemFire, Pivotal HD, Apache Hadoop™ and HDFS, Pivotal HAWQ, Pivotal Greenplum Database, RabbitMQ and Redis. A complete list of sources, sinks and processors is available within the Spring XD Reference Documentation.
In future articles within this series, we will be using the built-in domain specific language (DSL) to configure Spring XD (no code required). We will explore data streaming use cases, incorporate advanced, real-time analytics scenarios and integrate with Apache Spark and other tools.
We will also cover how Spring XD runs and manages various data workloads piped to and from anywhere, including Hadoop, and how its Java runtimes run on Cloud Foundry, Docker, and other cloud environments in a fault tolerant, horizontally scalable way.
- Explore Spring XD.
- Read more about the Pivotal Big Data Suite and the data capabilities mentioned above.
- Read additional blog articles on big data and data science.
Editor’s Note: ©2015 Pivotal Software, Inc. All rights reserved. Pivotal and HAWQ are trademarks and/or registered trademarks of Pivotal Software, Inc. in the United States and/or other countrie. Apache, Apache Hadoop, Hadoop, Apache Spark, Apache Kafka and Apache Ambari are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author