Data Science How-To: Text Analytics-as-a-Service

September 20, 2016 Chris Rawles

sfeatured-36487-textanalytics

At Pivotal, we’ve written about the untapped business value in unstructured data and how we utilize natural language processing (NLP) to help our customers. We continue building upon this work by demonstrating how—after the data exploration, feature engineering, and model building stages—to deploy and operationalize a text analytics model. In this example, we show an approach to deploying a scalable trained sentiment classifier that can also conveniently be used for additional text analytics and other data science tasks.

Specifically, in this post we’ll demonstrate how to:

Serve a classifier as a microservice accessible via a RESTful API
Deploy a Jupyter Notebook (via Jupyter Kernel Gateway) as a microservice to Pivotal Cloud Foundry

If you’d like to jump straight to the code, the GitHub repository is available here.

Data Scientists And End Users: Completing The Analytics Loop

We’ve written about the business impact of deploying machine learning models as a service using a microservice-based approach and API first data science. The full value of data science is realized by operationalizing the data science workflow and exposing model predictions and insights to the end user.

Successful data science models are developed to serve and bring value to an end user—whether the end user is a customer, analyst, or domain expert. End users consume and interact with models in different contexts such as via a web application or a command line API request.

Ultimately, however, no model is perfect and a successful model is an evolving one. The fastest cycle for improving a model is an iterative process of continually gaining user feedback and new data, re-training, and re-deploying over and over again to continuously hone the model. This is something that should be done often and programmatically, as shortening this analytics loop results in better models and more business impact.

Cloud Foundry helps tighten the analytics loop by providing a scalable platform for deploying and managing analytical models. A key benefit of Cloud Foundry is it eliminates the headache of bringing a model to life without spending energy worrying about routing and domain configuration, load balancing, environment installation, etc. This equates to data scientists spending more time building models and writing code, which makes data scientists happy and end users even happier.

Deploying The Model

We demonstrate the process of model operationalization on Cloud Foundry by deploying this sentiment classifier, which is trained in a distributed computing environment using PL/Python in Greenplum Database on 1.6 million Tweets using distant supervision for automatic labeling and a logistic regression model for sentiment analysis. The following example utilizes Jupyter Notebook (via Jupyter Kernel Gateway) for model deployment. In addition, we also built a Flask implementation of this example.

The example consists of 3 files:

text-analytics-service-pcf.ipynb – the Jupyter Python notebook applying the model
manifest.yml – instructs Cloud Foundry how to deploy our application
environment.yml – defines the required environment for our app

The Jupyter notebook text-analytics-service-pcf.ipynb reads a trained serialized Python scikit-learn model which is then exposed as a HTTP POST request.

Next, we write the manifest.yml file, which will instruct Cloud Foundry to call the jupyter-kernelgateway command, exposing our model as a RESTful microservice:

The manifest file specifies instructions and metadata – name, memory usage, disk usage, buildpack, etc. – for pushing an app to Cloud Foundry. The buildpack provides the framework for installing the necessary Python packages using the package managers conda and pip. We indicate the specific required packages in the environment.yml file:

That’s it! With these 3 files, we can now cf push and deploy our app to Cloud Foundry:

We can access the classifier using a POST request returning a result from 0 to 1 where 0 indicates more negative sentiment and 1 indicates more positive sentiment:

Finally, we can easily scale our classifier by spinning up new instances in response to changes in demand using cf scale:

Taking A Model Into Production

Our model is now served as a scalable autonomous microservice. By decoupling our model, different users are able to consume our model in different contexts using our API—whether that user is a developer integrating the model into a web application or a business analyst accessing our model from the commandline. In addition, by decoupling our model from the surrounding systems, we reduce the complexity of our modeling architecture allowing us to deploy and update our model in isolation. Our autonomous model can also be easily integrated into a data processing framework such as Spring Cloud Data Flow.

Model Persistence

Prior to operationalization, the data science workflow—data exploration, feature engineering, and model building—are frequently performed in a distributed architecture optimized for machine learning such as Greenplum Database, Apache HAWQ (incubating), Apache Spark™, etc. Models developed in such environments can be persisted and deployed using Predictive Model Markup Language (PMML). In addition to PMML, models developed in Greenplum and HAWQ using PL/Python, for example, can be persisted—using serialization or other markup languages—and deployed on Cloud Foundry.

Example of a data science workflow. Model training occurs in Greenplum Database and operationalization occurs in Pivotal Cloud Foundry. The model is accessible via an API request accessed from a Spring Cloud Data Flow data processing pipeline.

Jupyter Kernel Gateway

Jupyter Notebook is an essential tool in the data scientist’s toolkit. Deploying a notebook as a microservice offers the advantage of enabling a data scientist to operationalize her code while staying within the Jupyter environment—a setup often ideal for testing and prototyping.

Deploying a model to production requires crucial steps such as incorporating a security layer using API authentication, embedding data validation checks, supporting exception handling, etc.

Web frameworks such as Flask or Django, include authentication support and many essential components for bringing a model to production and building RESTful APIs.

Next Steps

We are continuing upon this work and incorporating our model into a real-time text analytics application. Check out our GitHub repository for updates.

Additional Resources

Python Cloud Foundry examples
Cloud Foundry Documentation
Pivotal Blog: Model Scoring as a Service
Sentiment classifier

About the Author

Chris Rawles is a senior data scientist at Pivotal in New York, New York, where he works with customers across a variety of domains, building models to derive insight and business value from their data. He holds an MS and BA in geophysics from UW-Madison and UC Berkeley, respectively. During his time as a researcher, Chris focused his efforts on using machine-learning to enable research in seismology.

Cloud Foundry: Brazen Opinions And Easy Extensions

The Cloud Foundry community often proudly proclaims a key part of its current success and future lies in th...

Meet Pivotal Cloud Foundry 1.8—Because Time to Value Is Contagious

The new Pivotal Cloud Foundry 1.8 release delivers more power and flexibility to improve the critical measu...

Data Science How-To: Text Analytics-as-a-Service

Data Scientists And End Users: Completing The Analytics Loop

Deploying The Model

Taking A Model Into Production

Model Persistence

Jupyter Kernel Gateway

Next Steps

Additional Resources

About the Author

Previous

Next

Related content in this Stream

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.