From Notebook to Kubeflow Pipelines with HP Tuning: A Data Science Journey

Kubeflow is a machine learning toolkit for Kubernetes. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

What does a Kubeflow deployment look like?

A Kubeflow deployment is:

  • Portable - Works on any Kubernetes cluster, whether it lives on Google Cloud Platform (GCP), on premises, or across providers.
  • Scalable - Can use fluctuating resources and is only constrained by the number of resources allocated to the Kubernetes cluster.
  • Composable - Enables you to configure independent steps into a full ML workflow, choosing from a curated set of ML frameworks and libraries.

Kubeflow gives you the ability to organize loosely-coupled microservices as a single unit and deploy them to a variety of locations, including on a laptop, on-premises, or in the cloud.

This codelab walks you through creating your own Kubeflow deployment using MiniKF, then running a Kubeflow Pipelines workflow with hyperparameter tuning to train and serve a model. You do all that from inside a Jupyter Notebook.

What you'll build

In this codelab, you will build a complex data science pipeline with hyperparameter tuning on Kubeflow Pipelines, without using any CLI commands or SDKs. You don't need to have any Kubernetes or Docker knowledge. Upon completion, your infrastructure will contain:

  • A MiniKF (Mini Kubeflow) VM that automatically installs:
  • Kubernetes (using Minikube)
  • Kubeflow
  • Kale, a tool to convert general purpose Jupyter Notebooks to Kubeflow Pipelines workflows ( GitHub)
  • Arrikto Rok for data versioning and reproducibility

What you'll learn

  • How to install Kubeflow with MiniKF
  • How to convert your Jupyter Notebooks to Kubeflow Pipelines without using any CLI commands or SDKs
  • How to run Kubeflow Pipelines with hyperparameter tuning from inside a notebook with the click of a button
  • How to automatically version your data in a notebook and in every pipeline step

What you'll need

  • An active GCP project for which you have Owner permissions

This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.

Set up your GCP project

Follow the steps below to create a GCP project or configure your existing GCP project. If you plan to use an existing GCP project, make sure that the project meets the minimum requirements described below. The first step is to open the resource manager in the GCP Console.

Open the GCP resource manager

Create a new project or select an existing project:

99b103929d928576.png

Check the following minimum requirements:

  • Make sure that you have the owner role for the project.
  • Make sure that billing is enabled for your project.
  • If you are using the GCP Free Tier or the 12-month trial period with $300 credit, note that you can't run the default GCP installation of MiniKF, because the free tier does not offer enough resources. You need to upgrade to a paid account.

For more help with setting up a GCP project, see the GCP documentation.

After setting up your GCP project, go directly to the instructions for installing MiniKF.

Open your pre-allocated GCP project

To open your pre-allocated GCP project, click the button below to visit the GCP Console and open the Home panel, found in the hamburger menu at the top left. If the screen is empty, click on Yes at the prompt to create a dashboard.

Open the GCP Console

3fdc4329995406a0.png

If the project is not already selected, click Select a project:

e8952c0b96067dea.png

Select your project. You should only have one:

fe25c1925487142.png

Create a Compute instance including MiniKF

In the GCP Marketplace, search for "MiniKF".

Open the GCP Marketplace

Select the MiniKF virtual machine by Arrikto:

d6b423c1911ea85a.png

Click the LAUNCH button and select your project:

7d07439db939b61c.png

In the Configure & Deploy window, choose a name and a zone for your MiniKF instance and leave the default options. Then click on the Deploy button:

7d5f7d17a80a1930.png

Wait for the MiniKF Compute instance to boot up:

5228086caadc44c6.png

Log in to MiniKF

When the MiniKF VM is up, connect and log in by clicking on the SSH button. Follow the on-screen instructions to run the command minikf, which will start the deployment of Minikube, Kubeflow, and Rok. This will take a few minutes to complete.

774e83c3e96cf7b3.png

Log in to Kubeflow

When installation is complete and all pods are ready, visit the MiniKF dashboard. Log in to Kubeflow using the MiniKF username and password:

251b0bcdbf6d3c71.png

9d49d899bb0b5bd1.png

Chrome users will see this screen:

6258e0f09e46a6c2.png

Firefox users will see this screen:

8cff90ce2f0670bd.png

Safari users will see this screen:

1c6fd768d71c0a92.png

Log in to Rok

After logging in to Kubeflow, open the left menu by clicking on the hamburger icon. Click on Snapshots and log in to Rok using the MiniKF username and password.

16171f35a935a9af.png

80aad6ba5d298a7e.png

Congratulations! You have successfully deployed MiniKF on GCP. You can now create notebooks, write your ML code, run Kubeflow Pipelines, and use Rok for data versioning and reproducibility.

During this section, you will run the Dog Breed Identification example, a project in the Udacity AI Nanodegree. Given an image of a dog, the final model will provide an estimate of the dog's breed.

Create a notebook server in your Kubeflow cluster

Navigate to the Notebooks link on the Kubeflow central dashboard.

60825e935fd0f39b.png

Click on New Server.

f9303c0a182e47f5.png

Specify a name for your notebook server.

a2343f30bc9522ab.png

Make sure you have selected the following Docker image (Note that the image tag may differ):

gcr.io/arrikto/jupyter-kale:f20978e

Add a new, empty data volume of size 5GB and name it data.

8544d9b05826b316.png

Click on Launch to create the notebook server.

28c024bcc55cc70a.png

When the notebook server is available, click on Connect to connect to it.

52f1f8234988ceaa.png

Download the data and notebook

A new tab will open up with the JupyterLab landing page. Create a new terminal in JupyterLab.

ab9ac96f1a1f0d09.png

In the terminal window, run these commands to navigate to the data folder and download the notebook and the data that you will use for the remainder of the lab:

cd data/
git clone https://github.com/kubeflow-kale/kale

The cloned repository contains a series of curated examples with data and annotated notebooks.

In the sidebar, navigate to the folder data/kale/examples/dog-breed-classification/ and open the notebook dog-breed.ipynb.

2bc436465522f65b.png

Explore the ML code of the Dog Breed Identification example

For the time being, don't run the cells that download the datasets, because you are going to use some smaller datasets included in the repository that you just cloned. If you are running this example at your own pace from home, feel free to download the datasets.

Run the imports cell to import all the necessary libraries. Note that the code fails because a library is missing:

5e2b97ab2512f139.png

Normally, you should create a new Docker image to be able to run this Notebook as a Kubeflow pipeline, to include the newly installed libraries. Fortunately, Rok and Kale make sure that any libraries you install during development will find their way to your pipeline, thanks to Rok's snapshotting technology and Kale mounting those snapshotted volumes into the pipeline steps.

Run the next cell to install the missing library:

c483da77943a6f01.png

Restart the notebook kernel by clicking on the Restart icon:

376b5203209c2c91.png

Run the imports cell again with the correct libraries installed and watch it succeed.

Convert your notebook to a pipeline in Kubeflow Pipelines

Enable Kale by clicking on the Kubeflow icon in the left pane of the notebook:

7b96241f2ab6c389.png

Enable Kale by clicking on the slider in the Kale Deployment Panel:

804cfbf9d76b7e23.png

Explore the per-cell dependencies within the notebook. See how multiple notebook cells can be part of a single pipeline step, as indicated by color bars on the left of the cells, and how a pipeline step may depend on previous steps, as indicated by depends on labels above the cells. For example, the image below shows multiple cells that are part of the same pipeline step. They have the same red color and they depend on a previous pipeline step.

fcd0fb351cdfb359.png

Click on the Compile and Run button:

18f2f6f0e12393d5.png

Now Kale takes over and builds your notebook, by converting it to a Kubeflow Pipelines pipeline. Also, because Kale integrates with Rok to take snapshots of the current notebook's data volume, you can watch the progress of the snapshot. Rok takes care of data versioning and reproducing the whole environment as it was when you clicked the Compile and Run button. This way, you have a time machine for your data and code, and your pipeline will run in the same environment where you have developed your code, without needing to build new docker images.

de1b88af76df1a9a.png

The pipeline was compiled and uploaded to Kubeflow Pipelines. Now click the link to go to the Kubeflow Pipelines UI and view the run.

e0b467e2e7034b5d.png

The Kubeflow Pipelines UI opens in a new tab. Wait for the run to finish.

21a9d54a57f3e20c.png

39e6fa39516d2773.png

Congratulations! You just ran an end-to-end pipeline in Kubeflow Pipelines, starting from your notebook!

Examine the results

Take a look at the logs of the cnn-from-scratch step. (Click on the step in the graph on the Kubeflow Pipelines UI, then click on the Logs tab.) This is the step where you trained a convolutional neural network (CNN) from scratch. Notice that the trained model has a very low accuracy and, on top of that, this step took a long time to complete.

62bf0835e9896c67.png

Take a look at the logs of the cnn-vgg16 step. In this step, you used transfer learning on the pre-trained VGG-16 model—a neural network trained by the Visual Geometry Group (VGG). The accuracy is much higher than the previous model, but we can still do better.

2b45072da65e20ae.png

Now, take a look at the logs of the cnn-resnet50 step. In this step, you used transfer learning on the pre-trained ResNet-50 model. The accuracy is much higher. This is therefore the model you will use for the rest of this codelab.

a1dc84ea48a87820.png

Hyperparameter tuning

Go back to the notebook server in your Kubeflow UI, and open the notebook named dog-breed-katib.ipynb (at path data/kale/examples/dog-breed-classification/). In this notebook, you are going to run some hyperparameter tuning experiments on the ResNet-50 model, using Katib. Notice that you have one cell in the beginning of the notebook to declare parameters:

87b9f6c98dc1823e.png

In the left pane of the notebook, enable HP Tuning with Katib to run hyperparameter tuning:

a518eba74d341139.png

Then click on Set up Katib Job to configure Katib:

f4e34fff6a93aa60.png

Define the search space for each parameter, and define a goal:

cfc6b7bcdc685a02.png

Click on the Compile and Run Katib Job button:

f9c1ab0a6a3c5e8d.png

Watch the progress of the Katib experiment:

f3514011876564db.png

Click on View to see the Katib experiment:

ab2f5a5edd48e8dc.png

Click on Done to see the runs in Kubeflow Pipelines (KFP):

410a843b6f044a4b.png

In the Katib experiment page you will see the new trials:

a511dca519580133.png

And in the KFP UI you will see the new runs:

43dd34ee2b75018d.png

Let's unpack what just happened. Previously, Kale produced a pipeline run from a notebook and now it is creating multiple pipeline runs, where each one is fed with a different combination of arguments.

Katib is Kubeflow's component to run general purpose hyperparameter tuning jobs. Katib does not know anything about the jobs that it is actually running (called trials in the Katib jargon). All that Kale cares about is the search space, the optimization algorithm, and the goal. Katib supports running simple Jobs (that is, Pods) as trials, but Kale implements a shim to have the trials actually run pipelines in Kubeflow Pipelines, and then collect the metrics from the pipeline runs.

As the Katib experiment is producing trials, you will see more trials in the Katib UI:

3e854d3d4bb766c.png

And more runs in the KFP UI:

ffd30dcefa739962.png

When the Katib experiment is completed, you can view all the trials in the Katib UI:

9096ae9caa77e42a.png

And all the runs in the KFP UI:

7acc64dfee4f35a3.png

If you go back to the Notebook, you will see an info button right next to the Katib experiment inside the Kale panel:

95b092180d71dc80.png

Click on it and you will see the best result and the parameters that produced it:

3b0ce47e548e5afb.png

Destroy the MiniKF VM

Navigate to Deployment Manager in the GCP Console and delete the minikf-on-gcp deployment.

Congratulations, you have successfully run an end-to-end data science workflow using Kubeflow (MiniKF), Kale, and Rok!

What's next?

Join the Kubeflow Community:

Further reading