API First for Data Science

July 29, 2016 Dat Tran),engineering-blog@pivotal.io (Alicia Bozyk

Joint work by Dat Tran (Data Scientist) and Alicia Bozyk (Senior Software Engineer).

Key Takeaways

Think about wrapping up your data science model as an API as early as possible
Cloud Foundry enables us to reliably expose models as scalable predictive APIs
Use a suitable continuous delivery tool to automatically deploy your code
Deploy as early/often as possible, users can test it and give early/regular feedback
Test-driven development is not appropriate for data science at all phases, especially at the beginning where we need to experiment a lot
Pairing with fellow software engineers helps in developing better production ready code

The Problem

During our engagements we often come across business problems which clients want to address in a data driven manner. Usually, we start with understanding the data, creating useful features from it and then building a model. We cross-validate the model for evaluation. In the past, we would present the results to our clients using powerpoints slides which just end up unused.

Often a major contributor to the models going unused is that the tools used by data scientists can differ from those software engineers commonly use to create production ready applications. Data scientists use R, Python, SAS and/or SQL to solve their problems. Those languages are well established in the analytics community but are not as commonly used to create production-ready apps by software engineers. There are solutions like Shiny, SAS or yhat which help data scientist to expose their model but they are limited in scope. You either have to pay a high price to use it or you are stuck with a vendor lock. On the other hand, you have software engineers who can create apps using languages like Java, Ruby or Swift but in most cases they don’t have an understanding of data science techniques or even languages like R or SAS.

Our data science team has found that a good way to bridge this gap is to follow an API first approach, and has adopted this as one of our core principles. In this blog article, I want to discuss how this can help to create smart apps. I will use an end-to-end smart app example to demonstrate this. The whole example can be found in my repo.

From an Idea to the Smart App

Creating a smart app involves many steps, from data science to making the app itself. But where do we start? Based on data availability we can either kick start the data science part first or lay a foundation to collect enough data.

The Example

In our case, let’s assume we have the data at hand and our problem is to recognize handwritten numbers from zero to nine using the famous Mixed National Institute of Standards and Technology database (MNIST) data. The MNIST dataset consists of 60,000 training and 10,000 testing images with a size of 28x28.

The primary goal is to convert handwritten text into a format which the computer can understand. For instance we can use a sketchpad to draw our numbers and then the output should be the expected digit. This is a typical handwriting recognition problem and can be extremely useful for many other use cases where we not only want to recognize numbers but also texts.

The Exploration Phase

Typically, in our data science engagements we start with an exploration phase during which we experiment with different models and approaches to solve our problem. This phase is usually not test driven and we use interactive tools like Jupyter Notebooks to create model prototypes or just to get a visual understanding of the data. In our example, the MNIST problem is a typical classification problem which can be solve with many approaches. In this case, we will use a deep learning model, particularly a multilayer perceptron (MLP) to solve it.

MLP is a feedforward artificial neural network and is especially very good at learning non-linear relations in the data. Nowadays, neural network models are widely used in computer vision, handwritten recognition and many other areas due to their continuous overperformance against traditional machine learning models.

To create our neural network we will use Keras, a deep learning library written in Python that can be run on top of either Theano or TensorFlow. Keras is especially designed for fast prototyping and its simple library allows us to create, train and evaluate neural networks in a very easy and fast manner.

First we load the data. Luckily, Keras has some built-in datasets, including the MNIST dataset, which are used to demonstrate Keras’ capabilities. Using this option, we can easily load the data and get the train and test dataset in the appropriate numerical format.

(X_train, y_train), (X_test, y_test) = mnist.load_data()

In reality, this might not be so straightforward. Normally, the data that we get from our clients are very messy and we spend substantial time in understanding, cleaning and transforming it.

Next, we do some transformations like normalizing the feature space to zero and one. This is useful to speed up the calculation for the neural network later.

# reshape the array, use float32 and rescale the data
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype("float32")
X_test = X_test.astype("float32")
X_train /= 255
X_test /= 255

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, 10)
Y_test = np_utils.to_categorical(y_test, 10)

Then we create the MLP model with two hidden layers, dropout at each the hidden layer and relu activation. For the output layer we use softmax since it is a multi-class problem.

model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation("relu"))
model.add(Dropout(0.2))
model.add(Dense(512))
model.add(Activation("relu"))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation("softmax"))

Afterwards, we train and evaluate our model. As the evaluation metric we use accuracy which calculates the correctly identified predictions over the entire population.

model.compile(loss="categorical_crossentropy",
              optimizer=RMSprop(),
              metrics=["accuracy"])

model.fit(X_train, Y_train,
          batch_size=128, nb_epoch=20,
          verbose=1, validation_data=(X_test, Y_test))

model.evaluate(X_test, Y_test, verbose=0)

The result that we got, shows a promising test accuracy value of approx. 98% and therefore we will store this model. Keras normally stores the network architecture and model weight separately. Based on the model’s performance we can try improving the result. In our case the model is already performing quite well so that we can use it for our smart app.

Now we know the input and output of the model and the question is how would we design an interaction point for this app? We’ve been working with images of 28 x 28 size which is not optimal from an app point of view. Imagine yourself drawing it on a small sketchpad, this would be painful. It would therefore make sense to resize the output of the image drawn on a sketchpad to fit our machine learning model. Having those thoughts in mind, we are basically ready to create an API for this problem. At this point, it would make sense to pair with software engineers who can help in writing better quality code.

The Production Phase

Next, we can start to put our code into production. At this stage, we start to do test-driven development which helps to keep our code base clean and trustworthy. In our repo, you can see that we have a particular directory structure for our production phase. We put all our production-ready code into a folder src. In our example, we have a module to train the model and store its output to redis (a key-value storage system), another one to expose the model as a RESTful API and one to consume the API. My Labs colleague Alicia uses sketch.js, a simple canvas-based drawing tool for jQuery to create the sketchpad.

For deployment and testing of our apps, we use Pivotal Cloud Foundry (PCF) Dev, which is a smaller distribution of PCF. Cloud Foundry (CF) enables us to reliably expose models as scalable predictive APIs. We use Concourse CI for auto deployment to CF.

Concourse checks if a new commit has been made on our git repo and then runs the tests. If the tests pass, our apps will automatically be pushed to CF. In our instance, we have two spaces, one for testing and one for production. The production is only triggered if there is a new tagged version of the code. This makes sense as you might do some user tests for your app with a smaller amount of users instead of all users in production e.g. A/B testing etc.

The Smart App

Finally, here is the app in action:

A live demo is hosted on Pivotal’s own Cloud Foundry instance, PWS. Here is the link for the skech app! The model is still not perfect yet so there are some incorrect recognitions. Try it out by yourself! We can improve our model though by using a different algorithm like convolutional neural network or increasing the size of the hidden layers. Actually LeCun et al. list the test error rate for many different models on their website.

The Conclusion

Hope the idea of creating smart apps has become clearer now and it was useful for you. Do get in touch with us if you want to discuss more about this!

Operationalizing Data Science Models on the Pivotal Stack

(Joint work by Srivatsan Ramanujam, Regunathan Radhakrishnan, Jin Yu, Kaushik Das ) At Pivotal Data Scienc...

Continuous Integration for Data Science

This is a follow up post on Test-Driven Development for Data Science and API First for Data Science focusin...

API First for Data Science

Key Takeaways

The Problem

From an Idea to the Smart App

The Example

The Exploration Phase

The Production Phase

The Smart App

The Conclusion

Previous

Next

Related content in this Stream

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Wondering what the White House’s executive order on artificial intelligence means for your business? This blog summarizes what you need to know and provides ideas for how to get started.

Synchronizing artificial intelligence and data science to multiple facets of the application life cycle aids enterprises with generating more business value from their applications.

At VMware Explore in Barcelona, we’re announcing new artificial intelligence and machine learning offerings in the VMware Tanzu portfolio that can help organizations drive business innovation.

A new vector database introduced in VMware Tanzu GemFire enables organizations to unlock the full potential of generative AI.

There are differences between working on a traditional software product and one that incorporates data science. Successfully folding data science into a product team is a little like hunting a bear.

In this playbook you’ll find our advice for effective ways to bring this capability—and the humans who drive it—closer into your fold.

Securing Cloud Applications demystifies complex security protocols, algorithms, and patterns, and demonstrates how to put them into practice in everyday development.

DKube on VMware Tanzu enables you to save time, resources, and cost with IT and data science teams collaborating with best-in-class model operations and infrastructure management.

With Domino Data Lab and VMware Tanzu, code-first data science teams can accelerate research, increase collaboration, and deploy models across an optimized multi-cloud infrastructure.

Hi, Spring fans! In this installment, Josh Long (@starbuxman) talks to webassembly, IoT, data science, and Java guru Brian Sletten (@bsletten).

Greenplum is open-source software for massively parallel database used for reporting, analytics, machine learning, artificial intelligence, and high concurrency SQL. Greenplum database is...

author：Hans Zeller Optimizing joins is the core part of any query optimizer. It consists of picking a good join order, the right join algorithms (hash join, nested loop join, etc.) and various...