How Data Scientists Can Tame Jupyter Notebooks for Use in Production Systems

July 12, 2018 Timothy Kopp

Uncounted pixels have been spilled about how great Jupyter Notebooks are (shameless plug: I've spilled some of those pixels myself). Jupyter Notebooks allow data scientists to quickly iterate as we explore data sets, try different models, visualize trends, and perform many other tasks. We can execute code out-of-order, preserving context as we tweak our programs. We can even convert our notebooks into documents or slides to present to our stakeholders.

Jupyter Notebooks help us work through a project from its earliest stages to a point where we can say a great deal. "Yes, we now know which demographics are most responsive to your advertisements." "Yes, we can build a model and expect it to give you useful predictions." But what happens when we want to say, "Here is an artifact that will generate these predictions when I am gone"? Or, "Here is a model that you can integrate with your other analytics systems"? Because of their interactive nature, Jupyter Notebooks require a person to drive them. While Jupyter has built-in facilities to convert a notebook to an executable script, this is rarely sufficient in practice.

In this post I'll present a tool I’ve created that allows one to use Jupyter Notebooks to create and modify production-ready code for data science applications.

Command-line Arguments: A Motivating Example

A common task when productionalizing code originally developed in a notebook is integrating with the environment in which the code is to be run. Often we want our program to be executed on the command-line so that it can be run by tools like cron and Concourse, which almost always involves accepting, parsing, and reporting errors on command-line arguments. Most languages have built-in utilities for doing this, such as Python's argparse, but one usually doesn't write a Jupyter Notebook expecting it to accept command-line arguments.

A solution to this problem is to maintain two versions of our code. We convert our notebook to a script using the built-in utility, and add in the command-line boilerplate. When we change the code in the notebook, we copy those changes over to our script version, a process prone to human error and forgetfulness. If we're tracking changes in the script version with a version control system like git (something that's messy to do with a notebook, given Jupyter's JSON file format), we have to manually inspect the different commits to know the version to which our notebook corresponds. If we want to match our notebook up to a different version, we have to manually copy the changes over.

Manually maintaining a command-line version and a notebook version of a codebase isn't the worst thing in the world. But what happens when the difference between notebook and production isn't being command-line executable, but instead is interfacing with a database or servicing an API? Maintaining separate slightly-different "production" and "development/notebook" versions of our program quickly becomes a nightmare.

Solution: Automate the Conversion with nbconvert

It would be great if we could maintain both versions of the code, both being necessary for our workflow, in the same file. Ideally, we could execute the code both as a notebook and as a standalone script, sharing common code and documentation but selectively behaving differently depending on the manner in which it was run. This was my goal when writing a pair of twin scripts I've named notebook-tools, which use the nbformat library. While these tools are specifically for Python notebooks, the idea is easily applied to other programming languages.

We define a special additional syntax for the Python language, which is valid Python, but which our tool can parse in order to convert between Python scripts and Jupyter Notebooks. The provided tools can convert a Jupyter Notebook of Python code into this syntax, and vice-versa. Since the syntax is just Python, it can be executed in a standard Python interpreter. This syntax consists of four elements:

markdown cell
general code cell
Jupyter code cell
script code cell

Markdown cell

A markdown cell is denoted by a multiline Python string. Since the Python interpreter executes string literals as a no-op without a method call or preceding assignment, encoding markdown cells in this way has no impact on the effect of executing the program.

# This is a Markdown cell
It can encode \LaTeX and everything!

General Code Cell

A normal code cell is denoted by "#>". The lines following it are treated as Python to be executed no matter the context.

#>
print("This code will be executed in Jupyter and when run as a script")
    
#>
print("So will this, but in a notebook, it will be in its own cell")

Jupyter Code Cell

This is the tricky one. The start of a Jupyter code cell is "#nb>" ("nb" stands for "notebook"). Every line following it intended to be in the same cell should start with a "#", i.e. a Python comment character. This is because notebook-only cells should not be run when executed as a script.

#nb>
#print("I'll only be executed when converted to a Jupyter notebook")

Script Code Cell

Script code cells are the complement of Jupyter code cells. They are executed when run as a script but not in a notebook. We denote a script code cell with "#py>". The tool comments out all of the code in the script code cell when converted to a notebook. This way, the cell can be viewed and even executed, but none of the effects of the code’s execution take place. This is important if you're accustomed to mindlessly executing cells in a row until you reach the one in which you are interested.

#py>
print("I will execute when run as a script, but my notebook cell will be commented out")

Using These Tools in a Production Workflow

These tools make developing a data science application that runs in production much easier. We can seamlessly switch between the notebook and script formats. One moment we're debugging in Jupyter, the next we're submitting a dozen long-running jobs via the command-line.

All of this is enabled with two scripts:

# Convert notebook to executable Python script
$ to-script my-cool-notebook.ipynb my-production-script.py   

# Convert a script enriched with the specified format
# to a notebook
$ to-notebook my-production-script.py my-notebook-for-debugging.ipynb

A motivated data scientist who buys into this workflow completely could even automate the conversion between the formats with git hooks to perform the conversion each time a particular git command is run.

Jupyter Notebooks are a boon to data scientists, helping us quickly get from the exploratory stages of a project to a proof-of-concept. By leveraging the nbformat library, we can continue to use this tool effectively as we transition our project into a production data science application. Even better, we can develop our applications with a mind for production right from the start.

About the Author

Tim Kopp is a senior data scientist at Pivotal, where he works with customers to build and deploy machine learning models to leverage their data. He holds a PhD in computer science from the University of Rochester. As a researcher, Tim developed algorithms for inference in statistical-relational machine learning models.

Making Data Science Accessible to Developers: Real-Time Multi-Person, Human Pose Estimation with Spring Cloud Data Flow and TensorFlow

In this post, we examine emerging techniques for analyzing and interpreting human body language and posture...

Next Presentation

Using Data Science to Build an End-to-End Recommendation System

We get recommendations everyday: Facebook recommends people we should connect with; Amazon recommends produ...

How Data Scientists Can Tame Jupyter Notebooks for Use in Production Systems

Command-line Arguments: A Motivating Example

Solution: Automate the Conversion with nbconvert

Markdown cell

General Code Cell

Jupyter Code Cell

Script Code Cell

Using These Tools in a Production Workflow

About the Author

Previous

Next Presentation

Related content in this Stream

Wondering what the White House’s executive order on artificial intelligence means for your business? This blog summarizes what you need to know and provides ideas for how to get started.

Synchronizing artificial intelligence and data science to multiple facets of the application life cycle aids enterprises with generating more business value from their applications.

At VMware Explore in Barcelona, we’re announcing new artificial intelligence and machine learning offerings in the VMware Tanzu portfolio that can help organizations drive business innovation.

A new vector database introduced in VMware Tanzu GemFire enables organizations to unlock the full potential of generative AI.

There are differences between working on a traditional software product and one that incorporates data science. Successfully folding data science into a product team is a little like hunting a bear.

In this playbook you’ll find our advice for effective ways to bring this capability—and the humans who drive it—closer into your fold.

Securing Cloud Applications demystifies complex security protocols, algorithms, and patterns, and demonstrates how to put them into practice in everyday development.

DKube on VMware Tanzu enables you to save time, resources, and cost with IT and data science teams collaborating with best-in-class model operations and infrastructure management.

With Domino Data Lab and VMware Tanzu, code-first data science teams can accelerate research, increase collaboration, and deploy models across an optimized multi-cloud infrastructure.

Hi, Spring fans! In this installment, Josh Long (@starbuxman) talks to webassembly, IoT, data science, and Java guru Brian Sletten (@bsletten).

Greenplum is open-source software for massively parallel database used for reporting, analytics, machine learning, artificial intelligence, and high concurrency SQL. Greenplum database is...

author：Hans Zeller Optimizing joins is the core part of any query optimizer. It consists of picking a good join order, the right join algorithms (hash join, nested loop join, etc.) and various...

In a previous post, we discussed the advantages of running JupyterHub on Kubernetes. We also showed you how to install a local Kubernetes cluster using kind on your Mac, as well as how to install...

Provisioning environments for data scientists and analysts to run simulations, test new models, or experiment with new datasets can be time-consuming and error-prone. Python is a popular choice...

Author: Jared Ruckle Every enterprise is refining their AI strategy. So it’s only fitting that the final installment of Greenplum Summit 2020 focused on how artificial intelligence and neural...

Simplify your migration to the cloud with Tanzu Data Services, a portfolio of on-demand caching, messaging, and database software on VMware Tanzu for development teams building modern applications.

Co-Authored by Ji Lim and Maurice Martin On April 2nd, 2020 VMware Tanzu Data and Amazon Web Services (AWS) participated in a joint webinar detailing the capabilities and benefits of running...