1. Introduction
Kubeflow is a machine learning toolkit for Kubernetes. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. |
What does a Kubeflow deployment look like?
A Kubeflow deployment is:
- Portable - Works on any Kubernetes cluster, whether it lives on Google Cloud Platform (GCP), on premises, or across providers.
- Scalable - Can utilize fluctuating resources and is only constrained by the number of resources allocated to the Kubernetes cluster.
- Composable - Enhanced with service workers to work offline or on low-quality networks.
It is a means of organizing loosely-coupled microservices as a single unit and deploying them to a variety of locations, whether that's a laptop or the cloud.
This codelab will walk you through creating your own Kubeflow deployment using MiniKF, and running a Kubeflow Pipelines workflow from inside a Jupyter Notebook.
What you'll build
In this codelab, you will build a complex data science pipeline with Kubeflow Pipelines, without using any CLI commands or SDKs. You don't need to have any Kubernetes or Docker knowledge. Upon completion, your infrastructure will contain:
- A MiniKF (Mini Kubeflow) VM that automatically installs:
- Kubernetes (using Minikube)
- Kubeflow
- Kale, a tool to convert general purpose Jupyter Notebooks to Kubeflow Pipelines workflows ( GitHub)
- Arrikto Rok for data versioning and reproducibility
What you'll learn
- How to install Kubeflow with MiniKF
- How to convert your Jupyter Notebooks to Kubeflow Pipelines without using any CLI commands or SDKs
- How to run Kubeflow Pipelines from inside a Notebook with the click of a button
- How to automatically version your data in a Notebook and in every pipeline step
What you'll need
- An active GCP project for which you have Owner permissions
This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.
2. Set up the environment
Set your GCP project ID and cluster name
To find your project ID, visit the GCP Console's Home panel, found in the hamburger menu at the top left. If the screen is empty, click on Yes at the prompt to create a dashboard.
If the project is not already selected, click on Select a project:
And select your project. You should only have one:
3. Install MiniKF
Create a Compute instance
In the GCP Marketplace, search for "MiniKF".
Select the MiniKF virtual machine by Arrikto.
Click the Launch on Compute Engine button and select your project.
In the Configure & Deploy window, choose a name for your MiniKF instance and leave the default options. Then click the Deploy button.
Wait for the MiniKF Compute instance to boot.
Log in to MiniKF
When the MiniKF VM is up, connect and login by clicking on the SSH button. Follow the on-screen instructions to run the command minikf
, which will start the deployment of Minikube, Kubeflow, and Rok. This will take a few minutes to complete.
Log in to Kubeflow
Once installation is complete and all pods are ready, visit the MiniKF dashboard. Log in to Kubeflow using the MiniKF username and password.
Chrome users will see this screen:
Firefox users will see this screen:
Safari users will see this screen:
Log in to Rok
After logging in to Kubeflow, open the left menu by clicking on the hamburger icon. Navigate to the Snapshot Store and log in to Rok using the MiniKF username and password.
Congratulations! You have successfully deployed MiniKF on GCP! You can now create Notebooks, write your ML code, and run Kubeflow Pipelines. Use Rok for data versioning and reproducibility.
4. Run a Pipeline from inside your Notebook
During this section, you will run the Titanic example, a Kaggle competition that predicts which passengers survived the Titanic shipwreck.
Create a Notebook Server
Navigate to the Notebook Servers link on the Kubeflow central dashboard.
Click on New Server.
Specify a name for your Notebook Server.
Make sure you have selected this image:
gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop
Add a new, empty Data Volume of size 5GB and name it data.
Click Launch to create the notebook server.
When the notebook server is available, click Connect to connect to it.
Download the data and notebook
A new tab will open up with the JupyterLab landing page. Create a new Terminal in JupyterLab.
In the Terminal window, run these commands to navigate to the data folder and download the notebook and the data that you will use for the remainder of the lab.
cd data/ git clone -b kubecon-workshop https://github.com/kubeflow-kale/examples
This repository contains a series of curated examples with data and annotated Notebooks. Navigate to the folder data/examples/titanic-ml-dataset/
in the sidebar and open the notebook titanic_dataset_ml.ipynb
.
Explore the ML code of the Titanic challenge
Run the notebook step-by-step. Note that the code fails because a library is missing.
Go back to the Terminal and install the missing library.
pip3 install --user seaborn
Restart the notebook kernel by clicking on the Refresh icon.
Run the cell again with the correct libraries installed and watch it succeed.
Convert your notebook to a Kubeflow Pipeline
Enable Kale by clicking on the Kubeflow icon in the left pane.
Explore per-cell dependencies. See how multiple cells can be part of a single pipeline step, and how a pipeline step may depend on previous steps.
Click the Compile and Run button.
Watch the progress of the snapshot.
Watch the progress of the Pipeline Run.
Click the link to go to the Kubeflow Pipelines UI and view the run.
Wait for it to complete.
Congratulations! You just ran an end-to-end Kubeflow Pipeline starting from your notebook!
5. Reproducibility with Volume Snapshots
Examine the results
Have a look at the logs for the second-to-last pipeline step Results. Notice that all the predictors show a score of 100%. An experienced data scientist would immediately find this suspicious. This is a good indication that our models are not generalizing and are instead overfitting on the training data set. This is likely caused by an issue with the data consumed by the models.
Reproduce prior state
Fortunately, Rok takes care of data versioning and reproducing the whole environment as it was the time you clicked the Compile and Run button. This way, you have a time machine for your data and code. So let's resume the state of the pipeline before training one of the models and see what is going on. Take a look at the randomforest step, then click on Artifacts.
Follow the steps in the Markdown, i.e. view the snapshot in the Rok UI by clicking on the corresponding link.
Copy the Rok URL.
Navigate to the Notebook Servers link.
Click on New Server.
Paste the Rok URL you copied previously and click the Autofill button.
Specify a name for your notebook.
Make sure you have selected this image:
gcr.io/arrikto-public/tensorflow-1.14.0-notebook-cpu:kubecon-workshop
Click Launch to create the notebook server.
When the notebook server is available, click Connect to connect to it.
Note that the notebook opens at the exact cell of the pipeline step you have spawned.
In the background, Kale has resumed the Notebook's state by importing all the libraries and loading the variables from the previous steps.
Debug prior state
Add a print command to this cell:
print(acc_random_forest)
Run the active cell by pressing Shift + Return to retrain the random forest and print the score. It is 100.
Now it's time to see if there is something strange in the training data. To explore and fix this issue, add a cell above the Random Forest markdown by selecting the previous cell and clicking the plus icon (+).
Add the following text and execute the cell to print the training set.
train_df
Oops! The column with training labels ("Survived") has mistakenly been included as input features! The model has learned to focus on the "Survived" feature and ignore the rest, polluting the input. This column exactly matches the model's goal and is not present during prediction, so it needs to be removed from the training dataset to let the model learn from the other features.
Add a bugfix
To remove this column, edit the cell to add this command:
train_df.drop('Survived', axis=1, inplace=True) train_df
Enable Kale and ensure that the cell that removes the Survived labels is part of the featureengineering pipeline step (it should have the same outline color).
Run the pipeline again by clicking on the Compile and Run button.
Click the link to go to the Kubeflow Pipelines UI and view the run.
Wait for the results step to complete and view the logs to see the final results. You now have realistic prediction scores!
6. Clean up
Destroy the MiniKF VM
Navigate to Deployment Manager in the GCP Console and delete the minikf-1
deployment.
7. Congratulations
Congratulations, you have successfully run an end-to-end data science workflow using Kubeflow (MiniKF), Kale, and Rok!
What's next?
Join the Kubeflow Community:
- github.com/kubeflow
- Kubeflow Slack
- kubeflow-discuss@googlegroups.com
- Community call on Tuesdays