As datasets continue to expand and models grow become complex, distributing machine learning (ML) workloads across multiple nodes is becoming more attractive. Unfortunately, breaking up and distributing a workload can add both computational overhead, and a great deal more complexity to the system. Data scientists should be able to focus on ML problems, not DevOps.

Fortunately, distributed workloads are becoming easier to manage, thanks to Kubernetes. Kubernetes is a mature, production ready platform that gives developers a simple API to deploy programs to a cluster of machines as if they were a single piece of hardware. Using Kubernetes, computational resources can be added or removed as desired, and the same cluster can be used to both train and serve ML models.

This codelab will serve as an introduction to Kubeflow, an open-source project which aims to make running ML workloads on Kubernetes simple, portable and scalable. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. It also extends the Kubernetes API by adding new Custom Resource Definitions (CRDs) to your cluster, so machine learning workloads can be treated as first-class citizens by Kubernetes.

What You'll Build

This codelab will describe how to train and serve a TensorFlow model, and then how to deploy a web interface to allow users to interact with the model over the public internet. You will build a classic handwritten digit recognizer using the MNIST dataset.

The purpose of this codelab is to get a brief overview of how to interact with Kubeflow. To keep things simple, the model we'll deploy will use CPU-only training, and only make use of a single node for training. Kubeflow's documentation has more information when you are ready to explore further.

What You'll Learn

What You'll Need

Downloading the Project Files

The first step is to download a copy of the Kubeflow examples repository, which hosts the code we will be deploying. This codelab can be completed on a local machine, or through Google Cloud Shell:

Download in Google Cloud Shell

Download locally

Enabling Boost Mode (Cloud Shell Only)

If you are running this codelab out of Cloud Shell, you'll need to enable Boost Mode for ksonnet to run properly. It can be enabled through the settings dropdown

Setting Environment Variables

Before we can start, we should set up a few environment variables we will be using through the course of the codelab. The first is the project ID, which denotes which GCP project we will be using

// available project ids can be listed with the following command:
// gcloud projects list

gcloud config set project $PROJECT_ID

We also need to provide the zone we want to use


Next, we will set the Kubeflow deployment name. For this codelab, we will simply use "mnist-deployment"


We will be working out of the "mnist" directory of the repository, so change to the proper directory and set an environment variable

cd ~/examples/mnist

Enabling the API

Before using Google Kubernetes Engine (GKE), you must enable the API for your project through the Google Cloud Platform Console.

Setting up a Kubeflow Cluster

The simplest way to deploy a Kubeflow enabled cluster is through the Kubeflow Click to Deploy web interface at Simply enter your project ID, deployment name, and zone, and then press the "Create Deployment" button. All necessary resources will be provisioned automatically

Open Kubeflow Click-to-Deploy

Fill in the following values in the resulting form:

The resources created here will be controlled by the GCP Deployment Manager. Here, you can see the current status of the deployment and manage everything in one place. It may take up to 10 minutes before the cluster is ready to use.

When the cluster is fully set up, you can connect your local kubectl session to it. The following command should output "kubeconfig entry generated" when run successfully, letting you know that your GCP credentials were added to your kubeconfig

gcloud container clusters get-credentials \
    $DEPLOYMENT_NAME --zone $ZONE --project $PROJECT_ID

Now, you should now be able to interact with your cluster through the kubectl command. Switch to the kubeflow namespace to see the resources that were pre-installed on the Kubeflow cluster

kubectl config set-context $(kubectl config current-context) --namespace=kubeflow

If you query the resources running on your cluster, you should see that Kubeflow has automatically provisioned everything you'll need to run training and serving jobs.

kubectl get all

For more information on the components that you see listed here, check out the official documentation

Creating a ksonnet Project

Kubeflow makes use of ksonnet to help manage deployments. ksonnet is a templating engine that acts as another layer on top of kubectl. While Kubernetes is typically managed with static YAML files, ksonnet allows you to create parameters that can be swapped out for different environments, which is a useful feature for complex machine learning workloads

If you don't have ksonnet's ks command installed, download it and add it to your path (if it's already installed on your system, you can skip this step)

// download ksonnet for linux (including Cloud Shell)
// for macOS, use ks_0.13.0_darwin_amd64

//download tar of ksonnet
wget --no-check-certificate \$KS_VER.tar.gz

//unpack file
tar -xvf $KS_VER.tar.gz

//add ks command to path

Ksonnet resources are managed in a single project directory, just like git. To create our ksonnet project directory, we will use ks init:

ks init $KS_NAME

If you look inside the new my_ksonnet_app project directory, you should see an app.yaml file, along with four directories. One directory is environments, which was automatically populated with information about how to attach to your Kubernetes cluster. You can list information about the default environment with the following command

ks env list

Another folder within your ksonnet project is components, which holds a set of jsonnet files that represent Kubernetes resources that can be deployed to the cluster. For now it is mostly empty. For the purpose of the codelab, we will add some pre-written components to train and serve a Tensorflow model:

cp $WORKING_DIR/ks_app/components/* $WORKING_DIR/$KS_NAME/components

You will now have a number of ksonnet components that are ready to be customized and deployed. You can list them using the ks command

ks component list

Now, add some Kubeflow resources to your local ksonnet project

ks registry add kubeflow \${VERSION}/kubeflow
ks pkg install kubeflow/tf-serving@${VERSION}

The code for our Tensorflow project can be found in the file in the examples repository. defines a fairly straight-forward Tensorflow training program, with no special modifications for Kubeflow. After training is complete, it will attempt to upload the trained model to a path we input. For the purpose of this codelab, we will create and use a Google Cloud Storage (GCS) bucket to hold the trained model.

Setting up a Storage Bucket

Our next step is to create a storage bucket on Google Cloud Storage to hold our trained model. Note that the name you choose for your bucket must be unique across all of GCS.

// bucket name can be anything, but must be unique across all projects

// create the GCS bucket
gsutil mb gs://$BUCKET_NAME/

Building the Container

To deploy our code to Kubernetes, we have to first build our local project into a container:

//set the path on GCR you want to push the image to$PROJECT_ID/kubeflow-train

//build the tensorflow model into a container
//container is tagged with its eventual path on GCR, but it stays local for now
docker build $WORKING_DIR -t $TRAIN_PATH -f $WORKING_DIR/Dockerfile.model

Now, test the new container image locally to make sure everything is working as expected

docker run -it $TRAIN_PATH

You should see training logs start appearing in your console:

If you're seeing logs, that means training is working and you can terminate the container with Ctrl+c. Now that you know that the container can run locally, you can safely upload it to Google Container Registry (GCR) so you can run it on your cluster.

//allow docker to access our GCR registry
gcloud auth configure-docker --quiet

//push container to GCR
docker push $TRAIN_PATH

You should now see your new container listed on the GCR console

Training on the Cluster

Finally, we can run the training job on the cluster. We can do this using the train component we added to our ksonnet project earlier. Before we can deploy it, we must set some parameters to point to our training image and storage bucket

//set the parameters for this job
ks param set train image $TRAIN_PATH
ks param set train name "my-train-1"
ks param set train modelDir gs://${BUCKET_NAME}
ks param set train exportDir gs://${BUCKET_NAME}/export

One thing to keep in mind is that our python training code has to have permissions to read/write to the storage bucket we set up. Kubeflow solves this by creating a service account within your project as a part of the deployment. You can verify this by listing your service accounts:

gcloud --project=$PROJECT_ID iam service-accounts list | grep $DEPLOYMENT_NAME

This service account should be automatically granted the right permissions to read and write to our storage bucket. Kubeflow also added a Kubernetes secret called "user-gcp-sa" to our cluster, containing the credentials needed to authenticate as this service account within our cluster:

kubectl describe secret user-gcp-sa

To access our storage bucket from inside our train container, we just need to set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the json file contained in the secret. Luckily, the train.jsonnet component is already set up to do this for us, we just have to set two more parameters:

ks param set train secret user-gcp-sa=/var/secrets
ks param set train envVariables \

Now that all the parameters are set, we can deploy the training job to the cluster:

ks apply default -c train

After applying the component, there should be a new tf-job on the cluster called my-train-1-chief-0. You can use kubectl to query some information about the job, including its current state.

kubectl describe tfjob

For even more information, you can retrieve the python logs from the pod that's running the container itself (after the container has finished initializing):

kubectl logs -f my-train-1-chief-0

When training is complete, you should see the model data pushed into your GCS bucket.

Now that you have a trained model, it's time to put it in a server so it can be used to handle requests. To do this, we'll use two more components from the repository, called mnist-deploy-gcp and mnist-service

The mnist-deploy-gcp component contains a TensorFlow Serving implementation. We simply need to point the component to our GCS bucket where the model data is stored, and it will spin up a server to handle requests. Unlike the tf-job, no custom container is required for the server process. Instead, all the information the server needs is stored in the model file

ks param set mnist-deploy-gcp modelBasePath gs://${BUCKET_NAME}/export
ks param set mnist-deploy-gcp modelName mnist
ks apply default -c mnist-deploy-gcp

To verify the server started successfully, you can check its logs. You should see that it found your bucket and is waiting for requests

kubectl logs -l app=mnist

Although we now have a server running as a deployment within the cluster, it's inaccessible to other pods without adding an associated service. We can do so by deploying the mnist-serve component, which simply creates a ClusterIP service associated with the mnist-deploy-gcp deployment.

ks apply default -c mnist-service

If you describe the new service, you'll see it's listening for connections within the cluster on port 9000

kubectl describe service mnist-service

Now that we have a trained model in our bucket, and a Tensorflow server hosting it, we can deploy the final piece of our system: a web interface to interact with our model. The code for this is stored in the web-ui directory of the repository.

The web page for this task is fairly basic; it consists of a simple flask server hosting HTML/CSS/Javascript files. The flask script makes use of, which contains the following python code to interact with the TensorFlow server through gRPC:

from grpc.beta import implementations
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2 as psp

# create gRPC stub
channel = implementations.insecure_channel(server_host, server_port)
stub = psp.beta_create_PredictionService_stub(channel)

# build request
request = predict_pb2.PredictRequest() = server_name
request.model_spec.signature_name = 'serving_default'
    tf.contrib.util.make_tensor_proto(image, shape=image.shape))

# retrieve results
result = stub.Predict(request, timeout)
resultVal = result.outputs["classes"].int_val[0]
scores = result.outputs['predictions'].float_val
version = result.outputs["classes"].int_val[0]

Building the Container

Like in the training step, we have to build a container from our code before we can deploy it on the cluster:

// set the path on GCR you want to push the image to$PROJECT_ID/kubeflow-web-ui

// build the web-ui directory
docker build $WORKING_DIR/web-ui -t $UI_PATH

// allow docker to access our GCR registry
gcloud auth configure-docker --quiet

// push the container to GCR
docker push $UI_PATH

Set parameters and deploy to the cluster

// set parameters
ks param set web-ui image $UI_PATH
ks param set web-ui type LoadBalancer

// apply to cluster
ks apply default -c web-ui

Accessing the UI

The web-ui service is deployed using the type LoadBalancer, unlike our previous mnist-service, which was ClusterIP. This means that while mnist-service is only accessible to other pods within our cluster, web-ui is exposed to the public internet.

You can find the IP address assigned to the service using kubectl

kubectl get service web-ui

If you enter the IP address in a web browser, you should be presented with the web interface.

Keep in mind that the web interface doesn't do much on its own, it's simply a basic HTML/JS wrapper around the Tensorflow Serving component, which performs the actual predictions. To emphasize this, the web interface allows you to manually connect with the serving instance located in the cluster. It has three fields:

Model Name: mnist

Server Address: mnist-service

Port: 9000

These three fields uniquely define your model server. If you deploy multiple serving components, you should be able to switch between them using the web interface. Feel free to experiment

When you're done with the codelab, it's a good idea to remove the resources you created to avoid any charges:

Delete the cluster and other resources provisioned by Kubeflow:

gcloud deployment-manager deployments delete $DEPLOYMENT_NAME

Delete the Google Cloud Storage bucket:

gsutil rm -r gs://$BUCKET_NAME

Delete the container images uploaded to Google Container Registry:

//find the digest id for each container image
gcloud container images list-tags$PROJECT_ID/kubeflow-train
gcloud container images list-tags$PROJECT_ID/kubeflow-web-ui

//delete each image
gcloud container images delete$PROJECT_ID/kubeflow-web-ui:$DIGEST_ID
gcloud container images delete$PROJECT_ID/kubeflow-train:$DIGEST_ID

Resources can also be deleted directly through the Google Cloud Console UI.