From Notebook to Kubeflow Pipelines with MiniKF and Kale

1. Introduction

Kubeflow is a machine learning toolkit for Kubernetes. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

What does a Kubeflow deployment look like?

A Kubeflow deployment is:

Portable - Works on any Kubernetes cluster, whether it lives on Google Cloud Platform (GCP), on premises, or across providers.
Scalable - Can utilize fluctuating resources and is only constrained by the number of resources allocated to the Kubernetes cluster.
Composable - Enhanced with service workers to work offline or on low-quality networks.

It is a means of organizing loosely-coupled microservices as a single unit and deploying them to a variety of locations, whether that's a laptop or the cloud.

This codelab will walk you through creating your own Kubeflow deployment using MiniKF, and running a Kubeflow Pipelines workflow from inside a Jupyter Notebook.

What you'll build

In this codelab, you will build a complex data science pipeline with Kubeflow Pipelines, without using any CLI commands or SDKs. You don't need to have any Kubernetes or Docker knowledge. Upon completion, your infrastructure will contain:

A MiniKF (Mini Kubeflow) VM that automatically installs:
Kubernetes (using Minikube)
Kubeflow
Kale, a tool to convert general purpose Jupyter Notebooks to Kubeflow Pipelines workflows ( GitHub)
Arrikto Rok for data versioning and reproducibility

What you'll learn

How to install Kubeflow with MiniKF
How to convert your Jupyter Notebooks to Kubeflow Pipelines without using any CLI commands or SDKs
How to run Kubeflow Pipelines from inside a Notebook with the click of a button
How to automatically version your data in a Notebook and in every pipeline step

What you'll need

An active GCP project for which you have Owner permissions

This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.

2. Set up the environment

Set your GCP project ID and cluster name

To find your project ID, visit the GCP Console's Home panel, found in the hamburger menu at the top left. If the screen is empty, click on Yes at the prompt to create a dashboard.

Open the GCP Console

If the project is not already selected, click on Select a project:

And select your project. You should only have one:

3. Install MiniKF

Create a Compute instance

In the GCP Marketplace, search for "MiniKF".

Select the MiniKF virtual machine by Arrikto.

Click the Launch on Compute Engine button and select your project.

In the Configure & Deploy window, choose a name for your MiniKF instance and leave the default options. Then click the Deploy button.

Wait for the MiniKF Compute instance to boot.

Log in to MiniKF

When the MiniKF VM is up, connect and login by clicking on the SSH button. Follow the on-screen instructions to run the command minikf, which will start the deployment of Minikube, Kubeflow, and Rok. This will take a few minutes to complete.

Log in to Kubeflow

Once installation is complete and all pods are ready, visit the MiniKF dashboard. Log in to Kubeflow using the MiniKF username and password.

Chrome users will see this screen:

Firefox users will see this screen:

Safari users will see this screen:

Log in to Rok

After logging in to Kubeflow, open the left menu by clicking on the hamburger icon. Navigate to the Snapshot Store and log in to Rok using the MiniKF username and password.

Congratulations! You have successfully deployed MiniKF on GCP! You can now create Notebooks, write your ML code, and run Kubeflow Pipelines. Use Rok for data versioning and reproducibility.

4. Run a Pipeline from inside your Notebook

During this section, you will run the Titanic example, a Kaggle competition that predicts which passengers survived the Titanic shipwreck.

Create a Notebook Server

Navigate to the Notebook Servers link on the Kubeflow central dashboard.

Click on New Server.

Specify a name for your Notebook Server.

Make sure you have selected this image:

gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop

Add a new, empty Data Volume of size 5GB and name it data.

Click Launch to create the notebook server.

When the notebook server is available, click Connect to connect to it.

Download the data and notebook

A new tab will open up with the JupyterLab landing page. Create a new Terminal in JupyterLab.

In the Terminal window, run these commands to navigate to the data folder and download the notebook and the data that you will use for the remainder of the lab.

cd data/
git clone -b kubecon-workshop https://github.com/kubeflow-kale/examples

This repository contains a series of curated examples with data and annotated Notebooks. Navigate to the folder data/examples/titanic-ml-dataset/ in the sidebar and open the notebook titanic_dataset_ml.ipynb.

Explore the ML code of the Titanic challenge

Run the notebook step-by-step. Note that the code fails because a library is missing.

Go back to the Terminal and install the missing library.

pip3 install --user seaborn

Restart the notebook kernel by clicking on the Refresh icon.

Run the cell again with the correct libraries installed and watch it succeed.

Convert your notebook to a Kubeflow Pipeline

Enable Kale by clicking on the Kubeflow icon in the left pane.

Explore per-cell dependencies. See how multiple cells can be part of a single pipeline step, and how a pipeline step may depend on previous steps.

Click the Compile and Run button.

Watch the progress of the snapshot.

Watch the progress of the Pipeline Run.

Click the link to go to the Kubeflow Pipelines UI and view the run.

Wait for it to complete.

Congratulations! You just ran an end-to-end Kubeflow Pipeline starting from your notebook!

5. Reproducibility with Volume Snapshots

Examine the results

Have a look at the logs for the second-to-last pipeline step Results. Notice that all the predictors show a score of 100%. An experienced data scientist would immediately find this suspicious. This is a good indication that our models are not generalizing and are instead overfitting on the training data set. This is likely caused by an issue with the data consumed by the models.

Reproduce prior state

Fortunately, Rok takes care of data versioning and reproducing the whole environment as it was the time you clicked the Compile and Run button. This way, you have a time machine for your data and code. So let's resume the state of the pipeline before training one of the models and see what is going on. Take a look at the randomforest step, then click on Artifacts.

Follow the steps in the Markdown, i.e. view the snapshot in the Rok UI by clicking on the corresponding link.

Copy the Rok URL.

Navigate to the Notebook Servers link.

Click on New Server.

Paste the Rok URL you copied previously and click the Autofill button.

Specify a name for your notebook.

Make sure you have selected this image:

gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop

Click Launch to create the notebook server.

When the notebook server is available, click Connect to connect to it.

Note that the notebook opens at the exact cell of the pipeline step you have spawned.

In the background, Kale has resumed the Notebook's state by importing all the libraries and loading the variables from the previous steps.

Debug prior state

Add a print command to this cell:

print(acc_random_forest)

Run the active cell by pressing Shift + Return to retrain the random forest and print the score. It is 100.

Now it's time to see if there is something strange in the training data. To explore and fix this issue, add a cell above the Random Forest markdown by selecting the previous cell and clicking the plus icon (+).

Add the following text and execute the cell to print the training set.

train_df

Oops! The column with training labels ("Survived") has mistakenly been included as input features! The model has learned to focus on the "Survived" feature and ignore the rest, polluting the input. This column exactly matches the model's goal and is not present during prediction, so it needs to be removed from the training dataset to let the model learn from the other features.

Add a bugfix

To remove this column, edit the cell to add this command:

train_df.drop('Survived', axis=1, inplace=True)
train_df

Enable Kale and ensure that the cell that removes the Survived labels is part of the featureengineering pipeline step (it should have the same outline color).

Run the pipeline again by clicking on the Compile and Run button.