From Notebook to Kubeflow Pipelines with MiniKF and Kale

1. Introduction

Kubeflow is a machine learning toolkit for Kubernetes. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

What does a Kubeflow deployment look like?

A Kubeflow deployment is:

  • Portable - Works on any Kubernetes cluster, whether it lives on Google Cloud Platform (GCP), on premises, or across providers.
  • Scalable - Can utilize fluctuating resources and is only constrained by the number of resources allocated to the Kubernetes cluster.
  • Composable - Enhanced with service workers to work offline or on low-quality networks.

It is a means of organizing loosely-coupled microservices as a single unit and deploying them to a variety of locations, whether that's a laptop or the cloud.

This codelab will walk you through creating your own Kubeflow deployment using MiniKF, and running a Kubeflow Pipelines workflow from inside a Jupyter Notebook.

What you'll build

In this codelab, you will build a complex data science pipeline with Kubeflow Pipelines, without using any CLI commands or SDKs. You don't need to have any Kubernetes or Docker knowledge. Upon completion, your infrastructure will contain:

  • A MiniKF (Mini Kubeflow) VM that automatically installs:
  • Kubernetes (using Minikube)
  • Kubeflow
  • Kale, a tool to convert general purpose Jupyter Notebooks to Kubeflow Pipelines workflows ( GitHub)
  • Arrikto Rok for data versioning and reproducibility

What you'll learn

  • How to install Kubeflow with MiniKF
  • How to convert your Jupyter Notebooks to Kubeflow Pipelines without using any CLI commands or SDKs
  • How to run Kubeflow Pipelines from inside a Notebook with the click of a button
  • How to automatically version your data in a Notebook and in every pipeline step

What you'll need

  • An active GCP project for which you have Owner permissions

This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.

2. Set up the environment

Set your GCP project ID and cluster name

To find your project ID, visit the GCP Console's Home panel, found in the hamburger menu at the top left. If the screen is empty, click on Yes at the prompt to create a dashboard.

Open the GCP Console

3fdc4329995406a0.png

If the project is not already selected, click on Select a project:

e8952c0b96067dea.png

And select your project. You should only have one:

fe25c1925487142.png

3. Install MiniKF

Create a Compute instance

In the GCP Marketplace, search for "MiniKF".

Select the MiniKF virtual machine by Arrikto.

d6b423c1911ea85a.png

Click the Launch on Compute Engine button and select your project.

b5eeba43053db4bd.png

In the Configure & Deploy window, choose a name for your MiniKF instance and leave the default options. Then click the Deploy button.

dc401e2bb5a884d9.png

Wait for the MiniKF Compute instance to boot.

5228086caadc44c6.png

Log in to MiniKF

When the MiniKF VM is up, connect and login by clicking on the SSH button. Follow the on-screen instructions to run the command minikf, which will start the deployment of Minikube, Kubeflow, and Rok. This will take a few minutes to complete.

774e83c3e96cf7b3.png

Log in to Kubeflow

Once installation is complete and all pods are ready, visit the MiniKF dashboard. Log in to Kubeflow using the MiniKF username and password.

251b0bcdbf6d3c71.png

9d49d899bb0b5bd1.png

Chrome users will see this screen:

6258e0f09e46a6c2.png

Firefox users will see this screen:

8cff90ce2f0670bd.png

Safari users will see this screen:

1c6fd768d71c0a92.png

Log in to Rok

After logging in to Kubeflow, open the left menu by clicking on the hamburger icon. Navigate to the Snapshot Store and log in to Rok using the MiniKF username and password.

a683198ac4ba900d.png

80aad6ba5d298a7e.png

Congratulations! You have successfully deployed MiniKF on GCP! You can now create Notebooks, write your ML code, and run Kubeflow Pipelines. Use Rok for data versioning and reproducibility.

4. Run a Pipeline from inside your Notebook

During this section, you will run the Titanic example, a Kaggle competition that predicts which passengers survived the Titanic shipwreck.

Create a Notebook Server

Navigate to the Notebook Servers link on the Kubeflow central dashboard.

4115cac8d8474d73.png

Click on New Server.

f9303c0a182e47f5.png

Specify a name for your Notebook Server.

a2343f30bc9522ab.png

Make sure you have selected this image:

gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop

Add a new, empty Data Volume of size 5GB and name it data.

8544d9b05826b316.png

Click Launch to create the notebook server.

28c024bcc55cc70a.png

When the notebook server is available, click Connect to connect to it.

2f06041475f45d3.png

Download the data and notebook

A new tab will open up with the JupyterLab landing page. Create a new Terminal in JupyterLab.

2482011174f7bc75.png

In the Terminal window, run these commands to navigate to the data folder and download the notebook and the data that you will use for the remainder of the lab.

cd data/
git clone -b kubecon-workshop https://github.com/kubeflow-kale/examples

This repository contains a series of curated examples with data and annotated Notebooks. Navigate to the folder data/examples/titanic-ml-dataset/ in the sidebar and open the notebook titanic_dataset_ml.ipynb.

c85baf68b36c63b2.png

Explore the ML code of the Titanic challenge

Run the notebook step-by-step. Note that the code fails because a library is missing.

bf2451fd7407e334.png

Go back to the Terminal and install the missing library.

pip3 install --user seaborn

d90593b21425dd12.png

Restart the notebook kernel by clicking on the Refresh icon.

a21f5f563b36ce4d.png

Run the cell again with the correct libraries installed and watch it succeed.

Convert your notebook to a Kubeflow Pipeline

Enable Kale by clicking on the Kubeflow icon in the left pane.

3f4f9c93b187b105.png

Explore per-cell dependencies. See how multiple cells can be part of a single pipeline step, and how a pipeline step may depend on previous steps.

15cca32444c1f12e.png

Click the Compile and Run button.

bde5cef34f00e258.png

Watch the progress of the snapshot.

9408f46abb2493f5.png

Watch the progress of the Pipeline Run.

9edbde68032f5e4b.png

Click the link to go to the Kubeflow Pipelines UI and view the run.

a81646a22584e1b9.png

Wait for it to complete.

44bee7dc0d24ec21.png

d377b6d574a4970.png

Congratulations! You just ran an end-to-end Kubeflow Pipeline starting from your notebook!

5. Reproducibility with Volume Snapshots

Examine the results

Have a look at the logs for the second-to-last pipeline step Results. Notice that all the predictors show a score of 100%. An experienced data scientist would immediately find this suspicious. This is a good indication that our models are not generalizing and are instead overfitting on the training data set. This is likely caused by an issue with the data consumed by the models.

2a594032c2dd6ff6.png

Reproduce prior state

Fortunately, Rok takes care of data versioning and reproducing the whole environment as it was the time you clicked the Compile and Run button. This way, you have a time machine for your data and code. So let's resume the state of the pipeline before training one of the models and see what is going on. Take a look at the randomforest step, then click on Artifacts.

4f25ca4560711b23.png

Follow the steps in the Markdown, i.e. view the snapshot in the Rok UI by clicking on the corresponding link.

e533bc781da9355a.png

Copy the Rok URL.

d155d19731b5cedd.png

Navigate to the Notebook Servers link.

aafeab01f3ef0863.png

Click on New Server.

f2265a64e8f9d094.png

Paste the Rok URL you copied previously and click the Autofill button.

9ba4d4128a3bdeea.png

Specify a name for your notebook.

7685c3bf35fc74b2.png

Make sure you have selected this image:

gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop

Click Launch to create the notebook server.

28c024bcc55cc70a.png

When the notebook server is available, click Connect to connect to it.

34955a64ae316de1.png

Note that the notebook opens at the exact cell of the pipeline step you have spawned.

a1f7c81f349e0364.png

In the background, Kale has resumed the Notebook's state by importing all the libraries and loading the variables from the previous steps.

Debug prior state

Add a print command to this cell:

print(acc_random_forest)

Run the active cell by pressing Shift + Return to retrain the random forest and print the score. It is 100.

e2a8a3b5465fcb5d.png

Now it's time to see if there is something strange in the training data. To explore and fix this issue, add a cell above the Random Forest markdown by selecting the previous cell and clicking the plus icon (+).

d1077f32dff9620f.png

Add the following text and execute the cell to print the training set.

train_df

2854798ff01aed4e.png

Oops! The column with training labels ("Survived") has mistakenly been included as input features! The model has learned to focus on the "Survived" feature and ignore the rest, polluting the input. This column exactly matches the model's goal and is not present during prediction, so it needs to be removed from the training dataset to let the model learn from the other features.

Add a bugfix

To remove this column, edit the cell to add this command:

train_df.drop('Survived', axis=1, inplace=True)
train_df

9e76c16a862b566.png

Enable Kale and ensure that the cell that removes the Survived labels is part of the featureengineering pipeline step (it should have the same outline color).

Run the pipeline again by clicking on the Compile and Run button.

Click the link to go to the Kubeflow Pipelines UI and view the run.

Wait for the results step to complete and view the logs to see the final results. You now have realistic prediction scores!

8c6a9676b49e5be8.png

6. Clean up

Destroy the MiniKF VM

Navigate to Deployment Manager in the GCP Console and delete the minikf-1 deployment.

7. Congratulations

Congratulations, you have successfully run an end-to-end data science workflow using Kubeflow (MiniKF), Kale, and Rok!

What's next?

Join the Kubeflow Community:

Further reading