Joint work by Chris Rawles and Jarrod Vawdrey.
If you are using data science for only one-time, ad-hoc analysis, then you are doing it wrong.
There is no doubt that companies can benefit greatly from this type of one-time data science exercise and most start here. However, much more value is created when data science can be applied in real-time scenarios and in an ongoing manner. We can’t just build a machine learning (ML) model and share the insights, we have to go to the next step and operationalize it, making it part of the fabric of our business processes and affecting outcomes in real-time. For example, what becomes possible when we can score human movement in real-time—like a system that can tell you that someone is currently running or moving at 30+ MPH when they shouldn’t be or just fell down on the floor.
In this post, we lay out the foundational components—data ingestion, data storage, model training, and model scoring—for real-time data science and for operationalizing the entire modeling pipeline, all within the context of microservices deployed to Pivotal Cloud Foundry. We also discuss why and when operationalizing real-time data science models is most crucial, industries which gain the most from real-time data science, and introduce a real-time data science pipeline centered around personalized human activity classification.
Operationalizing Data Science Models
As data scientists, we spend much of our time preparing and analyzing data, building machine learning models, running experiments, and writing code—all requirements for a successful data science project. The crucial, concluding step of successful data science project is model operationalization.
Model operationalization is the process of implementing a data science pipeline, including data ingestion, data storage, model training, and model scoring, into the real world and applying it to real data in an automated, low latency manner—affecting business activities as they are happening. Predictive models need to interact, grow, and come alive, often in real-time, to actually induce change and drive action. And, model operationalization is what brings out the full value of a data science workflow. To that end, production-ready, high-quality code is a key requirement for any data science workflow, and crucial steps must be taken to harden data flow pipelines, embed data validation checks, support exception handling, and ensure model validation. They should also run on a Cloud Native stack for the purposes of reliability, accessibility, and scale.
Applications Of Real-time Data Science
While there are many non-real-time data science use cases, real-time data science is a requirement for many applications including the Internet of Things (IoT), fraud detection, risk calculation, health-related alerts, network analysis, marketing personalization, customized rewards programs, and more.
Within the IoT realm, our heavy industry customers strive towards zero unplanned downtime, and we have built predictive models to target their high-risk equipment. For example, the Pivotal Data Science team recently worked with a major oil and gas customer to implement a mud motor failure prediction model. Operationalizing this predictive model has significant impact— according to The American Oil & Gas Reporter, mud motor failure could account for 35% of non-productive time and cost $150,000 per incident. The Pivotal Data Science team has also tackled many other IoT problems—detecting and tracking jet-engine degradation using sensor data, predictive maintenance in automobiles, automobile driver identification, virtual machine capacity forecasting, a connected car pipeline, and more.
Regardless of the domain, enterprises realize tremendous value in operationalizing data science models, incorporating new data and responding appropriately—all in real time.
Real-time Data Science: Building And Scoring A Personalized Activity Model
To make data science operational on a real time basis, we’ve created an example pipeline for the entire, real-time data science stack—data ingestion, data storage, model training, and model scoring. The pipeline demonstrates real-time data science using Pivotal Cloud Foundry (PCF), Pivotal Big Data Suite, Spring Cloud Data Flow, Node.js, RabbitMQ, and Python-based open source machine learning. The pipeline builds and scores a customized “personal activity recognition model” using streaming accelerometric sensor data from a smartphone. To put it into business scenario parlance, this pipeline would allow us to evaluate and score almost any feed of streaming data to drive real time action.
Figure 1. The data science pipeline for personalized real-time activity classification.
The Real-Time Modeling Pipeline
In Figure 1, data is created by the accelerometer on your mobile phone and a mobile app—this is acting as a proxy for any IoT or similar app with streaming data. From there, the data is sent via a TCP channel at 30 Hz through the WebSocket protocol. This data lands on a Node.js endpoint. We could have chosen a faster rate for this feed – 30 Hz was determined to be optimal based on research prior to operationalization. From there, the data moves to RabbitMQ. RabbitMQ is a message passing interface that publishes data through an exchange to the remaining services.
To develop our model we first need a small cache of data to train on. We decided to cache this data in Redis however we could have used Pivotal GemFire, which is especially useful for high-volume and latency-sensitive systems. Redis does not natively consume data from RabbitMQ so we deploy a Spring-XD stream via an API call to pipe the data from RabbitMQ into Redis. As depicted in Figure 2, the data that is piped to Redis is the training data, and it is used as input for the feature engineering and machine learning components, which are written in Python as a microservice deployable to Cloud Foundry using Ian Huston’s Anaconda Python buildpack.
Figure 2. The training workflow. The feature engineering and model training are contained in a Pivotal Cloud Foundry (PCF) Python application.
Our training service accesses each training batch from Redis. These are not batch data uploads but specific groups of time-series data used in the near real-time portion of the architecture. Then, it builds model features, and trains a machine learning model using the Python scikit-learn machine learning library.
The PCF training application workflow consists of a feature engineering stage and a model training stage. Feature engineering is performed using the Python packages numpy and scipy. During feature engineering, a one second moving window is applied and a bandpass filter is subsequently applied to each window. For each window, features are then generated using time-domain summary statistics and the frequency domain Fourier Transform coefficients. Finally, a Random Forest model is trained on the data and the resulting model is stored in Redis.
In addition to the Random Forest algorithm we tested Support Vector Machines, Logistic Regression, Naive Bayes, but the Random Forest model was chosen for operationalization as it demonstrated the best performance at identifying different human activities. It is important to have the ability to rapidly test multiple models during the R&D stage as each approach has tradeoffs between accuracy, computation time, and interpretability. Finally, we chose 20 seconds per activity for the training stage as it generated optimal tradeoff between performance and practicality.
Figure 3. Learning Curve for an individual user showing accuracy as a function of activity training time.
Figure 4. The model scoring workflow. The model scoring PCF application is provided as a service via an API call.
The architecture’s real-time component is shown in Figure 4—the model scoring application. This app has two components. First, the feature engineering component is fed by the streaming input window data from RabbitMQ, and this then feeds the prediction component. The trained model is also retrieved from Redis and scores each 1 second window. At the end, the scoring application is accessed via an API call where it applies the feature engineering process described above and outputs an activity score upon request. With this API, any external app can request the current status of movement activity for a specific user, and get the most likely state based on the highest score.
Scoring-as-a-Service With Cloud Scale
By providing the scoring services as an API, there are two outstanding benefits. First, multiple applications can access the scoring application, even those outside the scope of the initial system. Second, exposed API endpoints allow for autonomy—the application can go through its own set of development iterations without requiring the deployment of other apps.
By implementing model training as a service on Pivotal Cloud Foundry and its Elastic Runtime Services, we have implemented a highly scalable approach as well. For example, when there are more sensors to train on, more data collected, or more real-time queries, we could scale up our pipeline by just spinning up more instances of the apps.
Heading To Strata?
Don’t miss co-author Chris Rawles speak on this same topic:
The Internet of Things: How to do it. Seriously!
11:50am–12:30pm Wednesday, 03/30/2016
Location: 230 B
This framework applies broadly and can incorporate any streaming data source—from machine sensor data to unstructured text data. In our next posts, we take a deep dive into the real-time data science architecture, including both the training and scoring applications. We will also show how to operationalize models by taking prototypical code into production.
- Pivotal Cloud Foundry and Cloud Foundry Foundation
- More blog articles from Pivotal’s data science team
- More blog articles on Redis and RabbitMQ
About the Author