Continuous Integration for Data Science

February 16, 2017 Dat Tran

This is a follow up post on Test-Driven Development for Data Science and API First for Data Science focusing on Continuous Integration.

Motivation

Last time we wrote about the importance of test-driven development for data science, especially in the context of what we call smart applications. There are many examples for smart applications, for example, Google’s Inbox has a feature that is called Smart Reply which uses machine learning to suggest three possible answers to your incoming messages.

Figure 1: Google’s Smart Reply feature. Photo source: Techcrunch.

Another instance of a smart app is Apple’s Photo app on the iPhone. While taking a photo, it recognizes faces and places through machine learning and automatically relates them together so that searching for photos becomes much smarter.

Figure 2: Apple’s smart photo search feature. Photo source: Appleinsider.

Those apps have in common that they are powered with machine learning features and they are embedded in a piece of software which is continuously updated and delivered to consumers. A common problem is to integrate those parts in a shared team well, especially for a more complex product. In this article, we will discuss how we can use continuous integration (CI) to prevent such a problem with a specific focus on data science.

The Problem

At Pivotal Labs, we also help our clients creating smart applications with a balanced team concept. A balanced team for us consists of software engineers, designers, product managers and data scientists (see Figure 3). Together, we build products which are meaningful in an agile and iterative fashion which means that our piece of software is continuously updated and shipped to customers. So that customers can test it very early and give feedback which are then integrated.

Figure 3: Balanced Team.

An important challenge there is that different team members work on different features at the same time. For example, one team consisting of software engineers and designers might work on the front-end application and another team of software engineers and data scientists might work on the smart feature. In the end the final goal is that every piece in the product should integrate well together.

“Team programming isn’t a divide and conquer problem. It is a divide, conquer, and integrate problem. The integration step is unpredictable, but easily can take more time than the original programming. The longer you wait to integrate, the more it costs and the more unpredictable the cost becomes.” Kent Beck - Extreme Programming Explained

When it comes to the integration part, software engineers have been faced with this problem for a long time. They tend to work on a code base which is shared by many developers. The challenge they face there is that each individual or pairs work on a separate problem and then integrating the solution into the larger code base might be very difficult, especially the longer they wait.

Continuous Integration

To mitigate the integration problem, software engineers came up with the concept of continuous integration or shortly CI. CI is a development practice that requires developers to integrate code into a shared repository multiple times a day. This code is then automatically tested so that teams can detect problems early.

Figure 4: The CI/CD cycle.

Another approach which extends the concept of continuous integration is continuous delivery (CD) which makes sure that the code that we integrate is always in a deployable ready state to users.

Relationship to Data Science

Since machine learning features are also just a part of larger code base for a software project, continuous integration should also be used to prevent integration problems. In the worst case, if we don’t do it, there might be huge integration costs.

In our team, we set up a CI pipeline when putting models in production. In terms of CI pipeline for data science it is more about testing the workflow e.g. that all the scripts from data cleaning, feature extraction to model exposure work as expected. The testing part is something that we already discussed thoroughly in one of our earlier posts.

Concourse

There are many CI/CD tools out there such as Jenkins, Travis and many others. They all have their pros and cons. But in our team we mainly use Concourse. It has many advantages, for example, it treats pipelines as first-class citizens. It also uses Docker to encapsulate tests, so that everything is reproducible. A more detailed list of benefits can be found in this blog post.

CI Example for a Smart Application Project

Now we want to illustrate a simple pipeline of how Concourse can be used in practice for a smart app project.

Figure 5: Simple example for a smart app project.

Figure 5 shows the full pipeline of an example project for a smart app. In this project there are two work streams where one stream is working on the machine learning part and the other is working on the application. An application could be for example a web app, an Android app or an iOS app. Therefore we will have two different work streams (Figure 6):

Figure 6: Separate work streams for the application and ML part.

Figure 7 shows that those two streams have different workflows e.g. for the machine learning pipeline it is important that before the model can be exposed via an API, it needs to be trained and evaluated. So there is a dependency between those two steps. This can be easily handled in Concourse via the config file with a one liner (set the passed option):

- name: test-model-api
  plan:
  - get: data-science-repo
    trigger: true
    passed: [test-model-training]

Surely, this is just an example. In practice there might be more steps involved e.g. you might need to do feature extraction and this also another step which might involve SQL testing. Equivalently this also applies to the application workstream. In our case we only build the app.

Then at some point those two work streams need to be integrated. Figure 7 also shows that there is an integration job at the end which takes care of it.

Figure 7: Integration of the two work streams.

In Concourse, this can be easily achieved. We just need to get the two different repos and then use task to run the tests:

- name: integration
  plan:
  - get: data-science-repo
    trigger: true
    passed: [test-model-api]

  - get: application-repo
    trigger: true
    passed: [test-application-repo]

  - task: build-application-repo
    config:
      platform: linux
      image_resource:
        type: docker-image
        source: {repository: ubuntu}
      inputs:
        - name: application-repo
        - name: data-science-repo
      run:
        path: echo
        args: ["Integrate and run tests here."]

A typical integration could be that having embedded the machine learning feature into the application, a set of tests is automatically run to test the behavior of the user interface with the ML feature e.g. Apple has this QuickType predictive text feature which suggests the next words given a certain word before. In the integration part, we could test if the output of the suggestion bar is given as expected (see Figure 8). There are many other examples how we could test the intergration part but you should understand the concept through this simple example though.

Figure 8: Apple’s QuickType predictive text feature. Photo source: iMore.

Finally, the last step would be to deploy the application, either manually or automatically depending on your deployment strategy. In our example, we are automatically deploying it to Pivotal Web Services, a hosted version of Pivotal Cloud Foundry. Figure 9 shows that we are using the cf resource to do it. In general, we are not limited to it. There are many third-party resource types for Concourse and if one is missing, you can easily build your own resources.

Figure 9: Deployment of the smart application.

Conclusion

We showed that the integration problem is not only a software engineering problem but also very important for smart apps projects to mitigate the integration costs, particularly when working in a balanced team environment. We mainly use Concourse as our CI tool now due to its simplicity. You can build your own pipeline very easily in a programmatic way.

If you want to learn more:

About the Author

Dat works as a Senior Data Scientist at Pivotal. His focus is helping clients understand their data and how it can be used to add value. To do so, he employs a wide range of machine learning algorithms, statistics, and open source tools to help solve his clients’ problems. He is a regular speaker and has presented at PyData and Cloud Foundry Summit. His background is in operations research and econometrics. Dat received his MSc in economics from Humboldt University of Berlin.
Follow on Twitter Follow on Linkedin Visit Website

API First for Data Science

Joint work by Dat Tran (Data Scientist) and Alicia Bozyk (Senior Software Engineer). Key Takeaways Think ...

Next Presentation

Data Science for Connected Vehicles

Hear about data science techniques used by the data science team at Pivotal Software to create predictive m...

Continuous Integration for Data Science

Motivation

The Problem

Continuous Integration

CI Example for a Smart Application Project

Conclusion

About the Author

Previous

Next Presentation

Related content in this Stream

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Wondering what the White House’s executive order on artificial intelligence means for your business? This blog summarizes what you need to know and provides ideas for how to get started.

Synchronizing artificial intelligence and data science to multiple facets of the application life cycle aids enterprises with generating more business value from their applications.

At VMware Explore in Barcelona, we’re announcing new artificial intelligence and machine learning offerings in the VMware Tanzu portfolio that can help organizations drive business innovation.

A new vector database introduced in VMware Tanzu GemFire enables organizations to unlock the full potential of generative AI.

There are differences between working on a traditional software product and one that incorporates data science. Successfully folding data science into a product team is a little like hunting a bear.

In this playbook you’ll find our advice for effective ways to bring this capability—and the humans who drive it—closer into your fold.

Securing Cloud Applications demystifies complex security protocols, algorithms, and patterns, and demonstrates how to put them into practice in everyday development.

DKube on VMware Tanzu enables you to save time, resources, and cost with IT and data science teams collaborating with best-in-class model operations and infrastructure management.

With Domino Data Lab and VMware Tanzu, code-first data science teams can accelerate research, increase collaboration, and deploy models across an optimized multi-cloud infrastructure.

Hi, Spring fans! In this installment, Josh Long (@starbuxman) talks to webassembly, IoT, data science, and Java guru Brian Sletten (@bsletten).

Greenplum is open-source software for massively parallel database used for reporting, analytics, machine learning, artificial intelligence, and high concurrency SQL. Greenplum database is...

author：Hans Zeller Optimizing joins is the core part of any query optimizer. It consists of picking a good join order, the right join algorithms (hash join, nested loop join, etc.) and various...