Data Science with Python & JupyterHub on Kubernetes - Part 2

January 7, 2021

In a previous post, we discussed the advantages of running JupyterHub on Kubernetes. We also showed you how to install a local Kubernetes cluster using kind on your Mac, as well as how to install the JupyterHub Helm chart on a Kubernetes cluster.

In this post, we will focus on the experience of the developers, who are going to be leveraging our service to develop new models using scikit-learn or perform calculations and transformations of large datasets using pandas. To illustrate the value that Jupyter Notebooks and JupyterHub provide in a multiuser environment, we will clone a Git repository containing two example Jupyter Notebooks that we can work with.

Using JupyterHub

Each user that accesses JupyterHub will have their own workspace complete with a single-user Jupyter Notebook server, which uses the JupyterLab Interface. To demonstrate the capabilities of JupyterHub and Python, we will check out the following sample notebooks that we have written and executed:

industry-revenue-analysis.ipynb – Analysis of historical financial data organized by industry that leverages the pandas library
ml-stock-predictor-knn-v4.ipynb – Machine learning (ML) based on revenue data from public financial statements that leverages the scikit-learn library for Python

Note: Each time a user logs into the JupyterHub web page, an additional pod will be instantiated for that user and a 10GB persistent volume will be mapped to the user’s home directory.

In Part 1, we showed you how to expose the JupyterHub proxy-public Kubernetes service locally on your Mac using port forwarding. If you have not already done this, execute the following command:

❯ kubectl port-forward -n jupyter svc/proxy-public 8080:80 &
[2] 39859
Forwarding from 127.0.0.1:8080 -> 8000
Forwarding from [::1]:8080 -> 8000

To access the JupyterHub web UI, from your Mac you would use a web browser and enter the localhost IP (127.0.0.1) or localhost with the port specified above (8080), so http://localhost:8080/. JupyterHub is running with a default dummy authenticator, so entering any username and password combination will let you enter the hub.

Jupyter login screen

Once you are authenticated, you will be brought to the workspaces screen, which will allow you to open a local terminal inside the container, run interactive Python in a console, or create a new Python3 Jupyter notebook. You can also get interactive help or browse the local container’s directory structure.

JupyterHub UI

Click the Terminal button under Other on the main workspace screen. At the Terminal prompt, type the following command:

git clone https://github.com/tkrausjr/data-science-demos.git data-science

Running git clone in a terminal in JupyterHub

Running the Machine Learning Notebook

To jumpstart your data science learning, two sample notebooks are in the data science directory created by your git clone operation in the previous step. The repos and sample datasets are here: https://github.com/tkrausjr/finance-analysis.git.

Open the /data-science/jupyter-hub directory.

To run the machine learning notebook, double-click the ml-stock-predictor-knn-v4.ipynb file. You can run each cell individually by clicking in the cell and then hitting the >| Run button or using keyboard shortcut Shift + Enter. To run all cells from top to bottom (in other words, the entire program) you can go to Menu –> Run –> Run All Cells.

This notebook aims to classify companies as high growth or low growth according to their historical annual revenue growth, using supervised learning with the k-nearest neighbor (KNN) classification algorithm. As such, it will:

Using the pandas library, create a data frame from a CSV containing real public companies’ annual sales figures and stock price data.
Clean up the data frame and calculate and create new columns representing year-over-year sales growth. In our classification problem, the annual revenue growth becomes the features or attributes that will be put into our ML model.
Filter out the highest-growth companies by labeling them with a 0 or 1 to represent low growth or high growth, respectively. In supervised learning we have to label our data so the algorithm can determine upon which answers it should base its learning.
Visualize and explore the data with the pandas and seaborn libraries.
Train, test, and split the data with the scikit-learn library.
Run a KNN ML model against the dataset and evaluate it using the scikit-learn library.

The output of the classification report in cell 36 shows us that although the model seems to be working, it is actually just choosing the value for the majority class, which is “0” every time. To make this model work as a classification problem, we would need to do some additional work, namely to undersample from the majority class or oversample from the minority class in order to create a more balanced dataset.

Running the Data Analytics Notebook

To run the pandas data analysis notebook, double-click the industry-revenue-analysis.ipynb file. You can run each cell individually by clicking in the cell and then hitting the >| Run button or using keyboard shortcut Shift + Enter. To run all cells from top to bottom (in other words, the entire program) you can go to Menu –> Run –> Run All Cells.

This notebook will do the following:

Using the pandas library, create two data frames from two separate CSVs containing real public companies’ annual sales figures and industry sector data.
Merge the two data frames using “ticker” as the key value.
Aggregate the historical revenue data by industry and year and create a pandas pivot table.
Calculate the annual percentage change in revenue per industry and sort from highest to lowest for the most recent period.

Multiuser Support

When another data scientist wants to use the platform, they just need to login to JupyterHub and it will take care of the rest.

When a second data scientist (in this case, myself, Thomas) logs into JupyterHub, a second persistent volume claim, persistent volume, and pod will be provisioned for the new user.

$ kubectl get pvc -n jupyter
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
claim-admin Bound pvc-6e02d65b-c47a-4cf4-a1ad-489dc7cc49de 10Gi RWO standard 45m
claim-john Bound pvc-3c1ee9a3-68ab-4d71-a1f4-8cf8c04fb975 10Gi RWO standard 109s
hub-db-dir Bound pvc-b0a0657d-a9f8-4a17-9002-2a3c8f2cade6 3Gi RWO standard 88m
$ kubectl get po -n jupyter -l component=singleuser-server
NAME READY STATUS RESTARTS AGE
jupyter-admin 1/1 Running 0 73m
jupyter-john 1/1 Running 0 2m15s

Summary

In this post, we have shown how deploying JupyterHub on top of Kubernetes provides various benefits, including:

An isolated workspace where each data scientist will have their own dedicated CPU and memory guaranteed instance of a Jupyter Notebook server that will be spawned on demand and run on a shared Kubernetes platform.
Efficient usage of resources, including the culling of unused Jupyter instances with the ability to shut down (defaults to an hour) unused Jupyter instances and autosave any additions or changes that have been made to running Notebooks. Users can log back in and be reconnected to their Jupyter Notebook instance with all their previous work thanks to the persistent volumes allocated to each Jupyter pod.
Data persistence for data scientists’ work to be persisted to a dynamically created persistent volume so they either can log out and back in to JupyterHub and resume where they left off or have their work survive a pod failure.
A way to provide data scientists with a lower-cost alternative to more expensive compute options for certain tasks, such as data cleanup, exploration, and model development.

Hopefully you now better understand how a JupyterHub implementation running on Kubernetes can provide a scalable, simple, and powerful platform for data science teams to work with. If you are new to Jupyter Notebooks in general, I would recommend experimenting with the sample notebooks used in this post or walking through a Jupyter Notebook tutorial that goes into more detail on developing in Jupyter Notebooks with pandas, numpy, or scikit-learn.

The Hate for YAML: The Hammer or the Nail?

Those four letters that strike dread in the hearts of every Kubernetes user. That short acronym that pierce...

Data Science with Python & JupyterHub on Kubernetes - Part 1

Provisioning environments for data scientists and analysts to run simulations, test new models, or experime...

Data Science with Python & JupyterHub on Kubernetes - Part 2

Using JupyterHub

Running the Machine Learning Notebook

Running the Data Analytics Notebook

Multiuser Support

Summary

Previous

Next

Data Science with Python & JupyterHub on Kubernetes - Part 2

Using JupyterHub

Running the Machine Learning Notebook

Running the Data Analytics Notebook

Multiuser Support

Summary

Previous

Next

Related content in this Stream

I just want to get it out there—once I grokked Spring Data REST, I was sad it took me this long to find it. Hook up Java Persistence Annotation (JPA) and Hibernate to your Spring Boot app, add the...

Dig, if you will, this real world example… In which some engineers believe that they can better meet their product’s service level objectives (SLOs) by tuning their database connection pooling....

Good communication is one of the hardest things to be effective at. In relationships, in business, in teams, communication has and continues to be at the top of the list when it comes to what...

In the first part of this series, we introduced some concepts specific to Cloud Native security, in particular the “4 C’s of Cloud Native security”. We gave a number of leads and insights about...

This blog is the first part of a three-piece blog that comes out of a (much more condensed and lighter) talk done at Open Source Summit and Cloud Native Security Con at KubeCon. You can find a...

We have found the most effective way to scale practices across an organization is through the cultivation of vibrant communities of practice where learning, growth, and mastery of the craft are...

Come to the VMware booth during expo hours October 26–28, 2022, to try out some short workshops using open source software! We have workshops for everyone. Engineers will be at the booth to help...

There are two choices for a product development repository: a multiple 12-factor codebase or a multi-module project in a single repository (monorepo). Problems with 12-factor codebase For me, the...

Nowadays, web developers seem to be mainly focused on the cloud, microservices, containers, and scale. In reality, some just work with an old legacy system, struggling with it from one day to the...

What Is Tanzu Build Service? Tanzu Build Service, TBS for short, is a tool that will take a code source whether it be a local file system, a blob store (eg S3) or a git repository and build that...

The article below came to life in early 2019, when few people talked about marrying jobs to personas. More importantly, the example product is quite technical, an application platform, which is...

In this blog, we’ll demonstrate how VMware Tanzu Data Service Postgres for Kubernetes installed in Cluster 1 can be consumed by VMware Tanzu Application Platform Services Toolkit via secret, as...

What Is RabbitMQ? At a high level, RabbitMQ is an open source message broker. A message broker accepts messages from a producer (the message sender) and holds it in a queue so that a consumer...

When the community first saw .NET 6 there was a little bit of uproar, myself included, about how the way we structured web applications had changed. In earlier versions of .NET Core, we had grown...

Tanzu Observability enables you to monitor your Kubernetes infrastructure metrics (e.g., containers, pods, etc.) from Wavefront dashboards, as well as create alerts from those dashboards. You can...

Originally published on the Bitnami blog A new release of Kubeapps is out, and it introduces major changes that mark a milestone in the history of this tool. We are thrilled to announce that the...

The whirlwind that was .NET Beyond 2022 just wrapped up. If you had a chance to attend, we hope you learned a lot, and had some fun in the process. If you couldn’t attend, check out some talk...

In this previous blog, you learnt how to implement an API gateway in .NET 6 with YARP, a reverse proxy library. All external traffic can be routed through the API gateway to the backend services,...

Consider an API gateway to be a virtual “garden gate” to all your backend services. Implementing one means that all external traffic must pass through the gateway. This is great as it increases...

On December 9, a vulnerability in one of the most popular Java libraries was revealed. Log4j (version 2) was affected by a zero-day exploit that resulted in Remote Code Execution (RCE), allowing...