Kubeflow is a machine learning toolkit for Kubernetes. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.
What does a Kubeflow deployment look like?
A Kubeflow deployment is:
- Portable - Works on any Kubernetes cluster, whether it lives on Google Cloud Platform (GCP), on premises, or across providers.
- Scalable - Can use fluctuating resources and is only constrained by the number of resources allocated to the Kubernetes cluster.
- Composable - Enables you to configure independent steps into a full ML workflow, choosing from a curated set of ML frameworks and libraries.
Kubeflow gives you the ability to organize loosely-coupled microservices as a single unit and deploy them to a variety of locations, including on a laptop, on-premises, or in the cloud.
This codelab walks you through creating your own Kubeflow deployment using MiniKF, then running a Kubeflow Pipelines workflow with hyperparameter tuning to train and serve a model. You do all that from inside a Jupyter Notebook.
What you'll build
In this codelab, you will build a complex data science pipeline with hyperparameter tuning on Kubeflow Pipelines, without using any CLI commands or SDKs. You don't need to have any Kubernetes or Docker knowledge. Upon completion, your infrastructure will contain:
- A MiniKF (Mini Kubeflow) VM that automatically installs:
- Kubernetes (using Minikube)
- Kale, a tool to convert general purpose Jupyter Notebooks to Kubeflow Pipelines workflows ( GitHub)
- Arrikto Rok for data versioning and reproducibility
What you'll learn
- How to install Kubeflow with MiniKF
- How to convert your Jupyter Notebooks to Kubeflow Pipelines without using any CLI commands or SDKs
- How to run Kubeflow Pipelines with hyperparameter tuning from inside a notebook with the click of a button
- How to automatically version your data in a notebook and in every pipeline step
What you'll need
- An active GCP project for which you have Owner permissions
This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.
Set up your GCP project
Follow the steps below to create a GCP project or configure your existing GCP project. If you plan to use an existing GCP project, make sure that the project meets the minimum requirements described below. The first step is to open the resource manager in the GCP Console.
Create a new project or select an existing project:
Check the following minimum requirements:
- Make sure that you have the owner role for the project.
- Make sure that billing is enabled for your project.
- If you are using the GCP Free Tier or the 12-month trial period with $300 credit, note that you can't run the default GCP installation of MiniKF, because the free tier does not offer enough resources. You need to upgrade to a paid account.
For more help with setting up a GCP project, see the GCP documentation.
After setting up your GCP project, go directly to the instructions for installing MiniKF.
Open your pre-allocated GCP project
To open your pre-allocated GCP project, click the button below to visit the GCP Console and open the Home panel, found in the hamburger menu at the top left. If the screen is empty, click on Yes at the prompt to create a dashboard.
If the project is not already selected, click Select a project:
Select your project. You should only have one:
Create a Compute instance including MiniKF
In the GCP Marketplace, search for "MiniKF".
Select the MiniKF virtual machine by Arrikto:
Click the LAUNCH button and select your project:
In the Configure & Deploy window, choose a name and a zone for your MiniKF instance and leave the default options. Then click on the Deploy button:
Wait for the MiniKF Compute instance to boot up:
Log in to MiniKF
When the MiniKF VM is up, connect and log in by clicking on the SSH button. Follow the on-screen instructions to run the command
minikf, which will start the deployment of Minikube, Kubeflow, and Rok. This will take a few minutes to complete.
Log in to Kubeflow
When installation is complete and all pods are ready, visit the MiniKF dashboard. Log in to Kubeflow using the MiniKF username and password:
Chrome users will see this screen:
Firefox users will see this screen:
Safari users will see this screen:
Log in to Rok
After logging in to Kubeflow, open the left menu by clicking on the hamburger icon. Click on Snapshots and log in to Rok using the MiniKF username and password.
Congratulations! You have successfully deployed MiniKF on GCP. You can now create notebooks, write your ML code, run Kubeflow Pipelines, and use Rok for data versioning and reproducibility.
During this section, you will run the Dog Breed Identification example, a project in the Udacity AI Nanodegree. Given an image of a dog, the final model will provide an estimate of the dog's breed.
Create a notebook server in your Kubeflow cluster
Navigate to the Notebooks link on the Kubeflow central dashboard.
Click on New Server.
Specify a name for your notebook server.
Make sure you have selected the following Docker image (Note that the image tag may differ):
Add a new, empty data volume of size 5GB and name it data.
Click on Launch to create the notebook server.
When the notebook server is available, click on Connect to connect to it.
Download the data and notebook
A new tab will open up with the JupyterLab landing page. Create a new terminal in JupyterLab.
In the terminal window, run these commands to navigate to the data folder and download the notebook and the data that you will use for the remainder of the lab:
cd data/ git clone https://github.com/kubeflow-kale/kale
The cloned repository contains a series of curated examples with data and annotated notebooks.
In the sidebar, navigate to the folder
data/kale/examples/dog-breed-classification/ and open the notebook
Explore the ML code of the Dog Breed Identification example
For the time being, don't run the cells that download the datasets, because you are going to use some smaller datasets included in the repository that you just cloned. If you are running this example at your own pace from home, feel free to download the datasets.
imports cell to import all the necessary libraries. Note that the code fails because a library is missing:
Normally, you should create a new Docker image to be able to run this Notebook as a Kubeflow pipeline, to include the newly installed libraries. Fortunately, Rok and Kale make sure that any libraries you install during development will find their way to your pipeline, thanks to Rok's snapshotting technology and Kale mounting those snapshotted volumes into the pipeline steps.
Run the next cell to install the missing library:
Restart the notebook kernel by clicking on the Restart icon:
imports cell again with the correct libraries installed and watch it succeed.
Convert your notebook to a pipeline in Kubeflow Pipelines
Enable Kale by clicking on the Kubeflow icon in the left pane of the notebook:
Enable Kale by clicking on the slider in the Kale Deployment Panel:
Explore the per-cell dependencies within the notebook. See how multiple notebook cells can be part of a single pipeline step, as indicated by color bars on the left of the cells, and how a pipeline step may depend on previous steps, as indicated by depends on labels above the cells. For example, the image below shows multiple cells that are part of the same pipeline step. They have the same red color and they depend on a previous pipeline step.
Click on the Compile and Run button:
Now Kale takes over and builds your notebook, by converting it to a Kubeflow Pipelines pipeline. Also, because Kale integrates with Rok to take snapshots of the current notebook's data volume, you can watch the progress of the snapshot. Rok takes care of data versioning and reproducing the whole environment as it was when you clicked the Compile and Run button. This way, you have a time machine for your data and code, and your pipeline will run in the same environment where you have developed your code, without needing to build new docker images.
The pipeline was compiled and uploaded to Kubeflow Pipelines. Now click the link to go to the Kubeflow Pipelines UI and view the run.
The Kubeflow Pipelines UI opens in a new tab. Wait for the run to finish.
Congratulations! You just ran an end-to-end pipeline in Kubeflow Pipelines, starting from your notebook!
Examine the results
Take a look at the logs of the cnn-from-scratch step. (Click on the step in the graph on the Kubeflow Pipelines UI, then click on the Logs tab.) This is the step where you trained a convolutional neural network (CNN) from scratch. Notice that the trained model has a very low accuracy and, on top of that, this step took a long time to complete.
Take a look at the logs of the cnn-vgg16 step. In this step, you used transfer learning on the pre-trained VGG-16 model—a neural network trained by the Visual Geometry Group (VGG). The accuracy is much higher than the previous model, but we can still do better.
Now, take a look at the logs of the cnn-resnet50 step. In this step, you used transfer learning on the pre-trained ResNet-50 model. The accuracy is much higher. This is therefore the model you will use for the rest of this codelab.
Go back to the notebook server in your Kubeflow UI, and open the notebook named
dog-breed-katib.ipynb (at path
data/kale/examples/dog-breed-classification/). In this notebook, you are going to run some hyperparameter tuning experiments on the ResNet-50 model, using Katib. Notice that you have one cell in the beginning of the notebook to declare parameters:
In the left pane of the notebook, enable HP Tuning with Katib to run hyperparameter tuning:
Then click on Set up Katib Job to configure Katib:
Define the search space for each parameter, and define a goal:
Click on the Compile and Run Katib Job button:
Watch the progress of the Katib experiment:
Click on View to see the Katib experiment:
Click on Done to see the runs in Kubeflow Pipelines (KFP):
In the Katib experiment page you will see the new trials:
And in the KFP UI you will see the new runs:
Let's unpack what just happened. Previously, Kale produced a pipeline run from a notebook and now it is creating multiple pipeline runs, where each one is fed with a different combination of arguments.
Katib is Kubeflow's component to run general purpose hyperparameter tuning jobs. Katib does not know anything about the jobs that it is actually running (called trials in the Katib jargon). All that Kale cares about is the search space, the optimization algorithm, and the goal. Katib supports running simple Jobs (that is, Pods) as trials, but Kale implements a shim to have the trials actually run pipelines in Kubeflow Pipelines, and then collect the metrics from the pipeline runs.
As the Katib experiment is producing trials, you will see more trials in the Katib UI:
And more runs in the KFP UI:
When the Katib experiment is completed, you can view all the trials in the Katib UI:
And all the runs in the KFP UI:
If you go back to the Notebook, you will see an info button right next to the Katib experiment inside the Kale panel:
Click on it and you will see the best result and the parameters that produced it:
Destroy the MiniKF VM
Navigate to Deployment Manager in the GCP Console and delete the
Congratulations, you have successfully run an end-to-end data science workflow using Kubeflow (MiniKF), Kale, and Rok!
Join the Kubeflow Community:
- Kubeflow Slack
- Weekly community call, Slack, and other community details