Kubeflow is a machine learning toolkit for Kubernetes. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

What does a Kubeflow deployment look like?

A Kubeflow deployment is:

It is a means of organizing loosely-coupled microservices as a single unit and deploying them to a variety of locations, whether that's a laptop or the cloud. This codelab will walk you through creating your own Kubeflow deployment.

What you'll build

In this codelab, you will build a web app that summarizes GitHub issues using Kubeflow Pipelines to train and serve a model. It is based on the walkthrough provided in the Kubeflow Examples repo. Upon completion, your infrastructure will contain:

What you'll learn

The pipeline you will build trains a Tensor2Tensor model on GitHub issue data, learning to predict issue titles from issue bodies. It then exports the trained model and deploys the exported model using Tensorflow Serving. The final step in the pipeline launches a web app, which interacts with the TF-Serving instance in order to get model predictions.

What you'll need

This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.

Cloud Shell

Visit the GCP Console in the browser and log in with your project credentials:

Open the GCP Console

Then click the "Activate Cloud Shell" icon in the top right of the console to start up a Cloud Shell.

Set your GitHub token

This codelab calls the GitHub API to retrieve publicly available data. To prevent rate-limiting, especially at events where a large number of anonymized requests are sent to the GitHub APIs, set up an access token with no permissions. This is simply to authorize you as an individual rather than anonymous user.

  1. Navigate to https://github.com/settings/tokens and generate a new token with no scopes.
  2. Save it somewhere safe. If you lose it, you will need to delete and create a new one.
  3. Set the GITHUB_TOKEN environment variable:
export GITHUB_TOKEN=<token>

Set your GCP project ID and cluster name

To find your project ID, visit the GCP Console's Home panel. If the screen is empty, click on Yes at the prompt to create a dashboard.

In the Cloud Shell terminal, run these commands to set the cluster name and project ID. We'll indicate which zone to use at the workshop.

export DEPLOYMENT_NAME=kubeflow
export PROJECT_ID=<your_project_id>
export ZONE=<your-zone>
gcloud config set project ${PROJECT_ID}
gcloud config set compute/zone ${ZONE}

Create a storage bucket

Create a Cloud Storage bucket for storing pipeline files. Fill in a new, unique bucket name and issue the "mb" (make bucket) command:

export BUCKET_NAME=kubeflow-${PROJECT_ID}
gsutil mb gs://${BUCKET_NAME}

Alternatively, you can create a bucket via the GCP Console.

Install the Kubeflow Pipelines SDK

Run the following command to install the Kubeflow Pipelines SDK:

pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.7/kfp.tar.gz --upgrade

Pin useful dashboards

In the GCP console, pin the Kubernetes Engine and Storage dashboards for easier access.

Create a cluster

Create a managed Kubernetes cluster on Kubernetes Engine by visiting the Kubeflow Click-to-Deploy site in your browser and signing in with your GCP account:

Open Kubeflow Click-to-Deploy

Fill in the following values in the resulting form:

Set up kubectl to use your new cluster's credentials

When the cluster has been instantiated, connect your environment to the Kubernetes Engine cluster by running the following command in your Cloud Shell:

gcloud container clusters get-credentials ${DEPLOYMENT_NAME} \
  --project ${PROJECT_ID} \
  --zone ${ZONE}

This configures your kubectl context so that you can interact with your cluster. To verify the connection, run the following command:

kubectl get nodes -o wide

You should see two nodes listed, both with a status of "Ready", and other information about node age, version, external IP address, OS image, kernel version, and container runtime.

Add a GPU node pool to the cluster

Run the following command to create the accelerator node pool:

gcloud container node-pools create accel \
  --project ${PROJECT_ID} \
  --zone ${ZONE} \
  --cluster ${DEPLOYMENT_NAME} \
  --accelerator type=nvidia-tesla-k80,count=4 \
  --num-nodes 1 \
  --machine-type n1-highmem-8 \
  --disk-size=220GB \
  --scopes cloud-platform \
  --verbosity error

Install Nvidia drivers on these nodes by applying a daemonset:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml

View the Kubeflow central dashboard

Once the cluster setup is complete, port-forward to view the Kubeflow central dashboard. Click the "Cloud Shell" button in the launcher:

This will open a tab showing your new cluster's services details. Click the Port Forwarding button towards the bottom of the page, then click the Run in Cloud Shell button in the resulting popup window.

A Cloud Shell window will start up, with the command to port-forward pre-populated at the prompt. Hit ‘return' to actually run the command in the Cloud Shell, then click the Open in web preview button that will appear in the services page.

This will launch the Kubeflow dashboard in a new browser tab.

Pipeline description

The pipeline you will run has three steps:

  1. It starts by training a Tensor2Tensor model using preprocessed data. (More accurately, this step starts from an existing model checkpoint, then trains for a few more hundred steps-- it would take too long to fully train it). When it finishes, it exports the model in a form suitable for serving by TensorFlow serving.
  2. The next step in the pipeline deploys a TensorFlow-serving instance using that model.
  3. The last step launches a web app for interacting with the served model to retrieve predictions.

Download and compile the pipeline

To download the script containing the pipeline definition, execute this command:

curl -O https://raw.githubusercontent.com/amygdala/code-snippets/master/ml/kubeflow-pipelines/samples/kubeflow-tf/gh_summ.py

Compile the pipeline definition file by running it:

python3 gh_summ.py

You will see the file gh_summ.py.tar.gz appear as a result.

Upload the compiled pipeline

From the Kubeflow dashboard, click the Pipeline Dashboard link to navigate to the Kubeflow Pipelines web UI. Click on Upload pipeline, and select Import by URL. Copy, then paste in the following URL, which points to the same pipeline that you just compiled. (It's a few extra steps to upload a file from Cloud Shell, so we're taking a shortcut).

https://github.com/amygdala/code-snippets/raw/master/ml/kubeflow-pipelines/samples/kubeflow-tf/gh_summ.py.tar.gz

Give the pipeline a name (e.g. gh_summ).

Run the pipeline

Click on the uploaded pipeline in the list —this lets you view the pipeline's static graph— then click on Start an experiment to create a new Experiment using the pipeline.

Give the Experiment a name (e.g. the same name as the pipeline, gh_summ), then click Next to create it.

An Experiment is composed of multiple Runs. In Cloud Shell, execute these commands to gather the values to enter into the UI as parameters for the first Run:

gcloud config get-value project
echo ${GITHUB_TOKEN}
echo "gs://${BUCKET_NAME}/codelab"

Give the Run a name (e.g. gh_summ-1) and fill in three parameter fields:

After filling in the fields, click Create.

Once the pipeline run is launched, you can click on an individual step in the run to get more information about it, including viewing its pod logs.

View the pipeline definition

While the pipeline is running, take a closer look at how it is put together and what it is doing.

View TensorBoard

The first step in the pipeline performs training and generates a model. Once this step is complete, view Artifacts and click the blue Start TensorBoard button, then once it's ready, click Open Tensorboard.

View the web app and make some predictions

The last step in the pipeline deploys a web app, which provides a UI for querying the trained model - served via TF Serving - to make predictions. After the pipeline completes, connect to the web app by visiting the Kubeflow central dashboard page, and appending /webapp/ at the end of the URL. (The trailing slash is required).

You should see something like this:

Click the Populate Random Issue button to retrieve a block of text. Click on Generate TItle to call the trained model and display a prediction.

If you have trouble setting up a GPU node pool or running the training pipeline

If you have any trouble running the training pipeline, or if you had any issues setting up a GPU node pool, try this shorter pipeline. It uses an already-exported TensorFlow model, skips the training step, and takes only a minute or so to run. Download the Python pipeline definition here:

https://raw.githubusercontent.com/amygdala/code-snippets/master/ml/kubeflow-pipelines/samples/kubeflow-tf/gh_summ_serve.py

or the compiled version of the pipeline here:

https://github.com/amygdala/code-snippets/blob/master/ml/kubeflow-pipelines/samples/kubeflow-tf/gh_summ_serve.py.tar.gz?raw=true

Create a JupyterHub instance

You can also interactively define and run Kubeflow Pipelines from a Jupyter notebook. To create a notebook, navigate to the JupyterHub link on the central Kubeflow dashboard.

The first time you visit JupyterHub, you'll first be asked to log in. You can use any username/password you like (remember what it was).
Then you will be prompted to spawn an instance. Select the TensorFlow 1.12 CPU image from the pulldown menu as shown below. Then click the Spawn button, which generates a new pod in your cluster.

Download a notebook

Once JupyterHub becomes available, open a terminal.

In the Terminal window, run:

cd work
curl -O https://raw.githubusercontent.com/amygdala/code-snippets/master/ml/kubeflow-pipelines/samples/kubeflow-tf/pipelines-kubecon.ipynb

Return to the JupyterHub home screen, navigate to the work folder, and open the notebook you just downloaded.

Execute the notebook

In the Setup section, find the second command cell (starts with import kfp). Fill in your own values for the environment variables WORKING_DIR, PROJECT_NAME, and GITHUB_TOKEN, then execute the notebook one step at a time.

Follow the instructions in the notebook for the remainder of the lab.

Destroy the cluster

To remove all resources created by Click-to-Deploy, navigate to Deployment Manager in the GCP Console and delete the $DEPLOYMENT_NAME deployment.

Remove the GitHub token

Navigate to https://github.com/settings/tokens and remove the generated token.