As datasets continue to expand and models grow become complex, distributing machine learning (ML) workloads across multiple nodes is becoming more attractive. Unfortunately, breaking up and distributing a workload can add both computational overhead, and a great deal more complexity to the system. Data scientists should be able to focus on ML problems, not DevOps.

Fortunately, distributed workloads are becoming easier to manage, thanks to Kubernetes. Kubernetes is a mature, production ready platform that gives developers a simple API to deploy programs to a cluster of machines as if they were a single piece of hardware. Using Kubernetes, computational resources and be added or removed as desired, and the same cluster can be used to both train and serve ML models.

This codelab will serve as an introduction to Kubeflow, an open-source project which aims to make running ML workloads on Kubernetes simple, portable and scalable. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. It also extends the Kubernetes API by adding new Custom Resource Definitions (CRDs) to your cluster, so machine learning workloads can be treated as first-class citizens by Kubernetes.

What You'll Build

This codelab will describe how to train and serve a TensorFlow model, and then how to deploy a web interface to allow users to interact with the model over the public internet. You will build a classic handwritten digit recognizer using the MNIST dataset.

The purpose of this codelab is to get a brief overview of how to interact with Kubeflow. To keep things simple, use CPU-only training, and only make use of a single node for training. Kubeflow's user guide has more information when you are ready to explore further.

What You'll Learn

What You'll Need

Downloading the Project Files

The first step is to download a copy of the code we will be deploying. This codelab can be completed on a local machine, or through the Google Cloud Shell:

Download in Google Cloud Shell

Download locally

Enabling the API

Before using Google Kubernetes Engine (GKE), you must enable the API for your project through the Google Cloud Platform Console.

Setting up a Cluster

Next, we have to set up a Kubernetes cluster for our project through GKE. The following command will create a new cluster named "kubeflow-codelab", located in the zone us-central1-a

//list all projects for your Google Cloud Platform account
gcloud projects list
//set your active GCP project
gcloud config set project $PROJECT_ID

//create a cluster
gcloud container clusters create kubeflow-codelab \
      --zone us-central1-a --machine-type n1-standard-2

Connect our local environment to the cluster so we can interact with it locally using the Kubernetes CLI tool, kubectl:

gcloud container clusters get-credentials kubeflow-codelab --zone us-central1-a

Change the permissions on the cluster to allow kubeflow to run properly:

kubectl create clusterrolebinding default-admin \
      --clusterrole=cluster-admin --user=$(gcloud config get-value account)

Now, you should have a running cluster in the cloud ready to run your code! You should be able to interact with the cluster using kubectl, or through the Google Cloud Console web interface.

Creating a ksonnet Project

Kubeflow makes use of ksonnet to help manage deployments. ksonnet acts as another layer on top of kubectl. While Kubernetes is typically managed with static YAML files, ksonnet adds a further abstraction that is closer to standard OOP objects. Resources are managed as prototypes with empty parameters, which can be instantiated into components by defining values for the parameters. This system makes it easier to deploy slightly different resources to different clusters at the same time, making it easy to maintain different environments for staging and production. Components in ksonnet can either be exported as standard Kubernetes YAML files with ks show, or they can be directly deployed to the cluster with ks apply.

If you don't have ksonnet's ks command installed, download it and add it to your path (if it's already installed on your system, you can skip this step)

//to download ksonnet for linux (including Cloud Shell)

//to download ksonnet for macOS

//download tar of ksonnet

//unpack file
tar -xvf $KS_VER.tar.gz

//add ks command to path

Ksonnet resources are managed in a single project directory, just like git. To create our ksonnet project directory, we will use ks init:

ks init ksonnet-kubeflow
cd ksonnet-kubeflow

If you look inside the new ksonnet-kubeflow project directory, you should see an app.yaml

folder, along with four directories. The most important of these is components, which holds the resources that are deployed to the cluster, and environments, which holds information about the clusters associated with the project.

The next step is to add our cluster as an available ksonnet environment:

ks env add cloud

This command will look at our local Kubernetes configuration, and use that to find our cluster. If you now look in the environments directory, it should have a new "cloud" environment that points to our GKE cluster

Adding Kubeflow Packages

Ksonnet is a generic tool used to deploy software to Kubernetes. Now, we need to add kubeflow to our ksonnet project. This is managed by installing a ksonnet package. First, we have to add the kubeflow repository to our project, and then we can pull the Kubeflow packages down into our local project:

ks registry add kubeflow${VERSION}/kubeflow
ks pkg install kubeflow/core@${VERSION}
ks pkg install kubeflow/tf-serving@${VERSION}
ks pkg install kubeflow/tf-job@${VERSION}

After running these commands, kubeflow packages will be installed into your local ksonnet project. You should see the installed packages listed in your app.yaml file, and in the vendor directory. Now, we can apply the standard Kubeflow components to our cluster:

// generate the kubeflow-core component from its prototype
ks generate core kubeflow-core --name=kubeflow-core --cloud=gke

//apply component to our cluster
ks apply cloud -c kubeflow-core

If you now query your cluster with kubectl, you should see a list of resources provisioned by kubeflow-core.

kubectl get all 

The code for our Tensorflow project can be found in the kubeflow-introduction/tensorflow-model folder. It contains a python file called that contains TensorFlow code, and a Dockerfile to build it into a container image. defines a fairly straight-forward program. First it defines a simple feed-forward neural network with two hidden layers. Next, it defines tensor ops to train and evaluate the model's weights. Finally, after a number of training cycles, it saves the trained model up to a Google Cloud Storage (GCS) bucket. Of course, before we can use the storage bucket, we need to create it.

Setting up a Storage Bucket

Our next step is to create two things: the storage bucket that will hold our trained model, and a key file that can give our container access to it. The following command will create the bucket. Note that the name you choose must be unique across all of GCS.

//choose a unique name for your model's bucket

//create the GCS bucket
gsutil mb gs://$BUCKET_NAME/

Now that we have a bucket, it's time to create our key. First, create a service account to act as a virtual user in our GCP account. The service account will act as a virtual user with restricted access to write data to the bucket on our behalf:

//move back to the kubeflow-introduction project directory
cd ..

//create service account
gcloud iam service-accounts create kubeflow-codelab --display-name kubeflow-codelab

Now, grant the service account permissions to read/write to the bucket:

//get the email associated with the new service account

//allow the service account to upload data to the bucket
gsutil acl ch -u $IAM_EMAIL:O gs://$BUCKET_NAME

Finally, we can download the key file that lets us authenticate as the service account:

gcloud iam service-accounts keys create ./tensorflow-model/key.json \

After running the previous command, you should see a new file called key.json in the tensorflow-model directory. When we build the container, key.json will be added in along with the python code, allowing the Tensorflow program to authenticate to GCS and save model data to your bucket.

Building the Container

To deploy our code to Kubernetes, we have to first build our local project into a container:

//the version number associated with this model
//here, we are tagging each run with the current unix timestamp
VERSION_TAG=$(date +%s)

//set the path on GCR you want to push the image to$PROJECT_ID/kubeflow-train:$VERSION_TAG

//build the tensorflow-model directory
//container is tagged with its eventual path on GCR, but it stays local for now
docker build -t $TRAIN_PATH ./tensorflow-model \
      --build-arg version=$VERSION_TAG --build-arg bucket=$BUCKET_NAME

If everything went well, your program should be encapsulated in a new container. First, let's test it locally to make sure everything is working:

docker run -it $TRAIN_PATH

You should see training logs start appearing in your console

If you're seeing logs, that means training is working and you can terminate the container with Ctrl+c. Now that you know that the container can run locally, you can safely upload it to Google Container Registry (GCR) so you can run it on your cluster.

//allow docker to access our GCR registry
gcloud auth configure-docker

//push container to GCR
docker push $TRAIN_PATH

You should now see your new container listed on the GCR console. If you make changes to your code and want to push up a new version, simply update the VERSION_TAG and follow through the previous steps again.

Training on the Cluster

Finally, we can run the training job on the cluster. Our first step is to generate a ksonnet component out of the corresponding tf-job prototype.

//move back into ksonnet directory
cd ksonnet-kubeflow

//generate component from prototype
ks generate tf-job train

You should now see a new file in the kubeflow-ksonnet/components directory. Now, we can customize the component's parameters to point to our container on GCR as the training image:

//set the parameters for this job
ks param set train image $TRAIN_PATH
ks param set train name "train-"$VERSION_TAG

Apply the container to the cluster:

ks apply cloud -c train

After applying the component, there should be a new tf-job on the cluster called "train". You can use kubectl to query some information about the job, including its current state.

kubectl describe tfjob

For even more information, you can retrieve the python logs from the pod that's running the container itself (after the container has finished initializing):

POD_NAME=$(kubectl get pods --selector=tf_job_name=train-$VERSION_TAG \
      --template '{{range .items}}{{}}{{"\n"}}{{end}}')

kubectl logs -f $POD_NAME

If you prefer to use a GUI, the same python logs can be accessed through the GKE Cloud Console page

When training is complete, you should see the model data pushed into your GCS bucket, tagged with the same version number as the container that generated it.

Of course, it's likely that you will need to train more than once. Kubeflow gives us a simple deploy pipeline we can use to train new versions of our model repeatedly. When you have a new version to push, you simply build a new container (with a new version tag), modify your tf-job parameters, and re-apply it to the cluster. You don't need to regenerate the tf-job component every time, just set the parameters to point to the new version of your container. New model versions will appear in appropriately tagged directories inside the GCS bucket.

Now that you have a trained model, it's time to put it in a server so it can be used to handle requests. This task is handled by the tf-serving prototype, which is the Kubeflow implementation of TensorFlow Serving. Unlike the tf-job, no custom container is required for the server process. Instead, all the information the server needs is stored in the model file. We simply need to point the server component to our GCS bucket where the model data is stored, and it will spin up to handle requests:

//create a ksonnet component from the prototype
ks generate tf-serving serve --name=mnist-serve

//set the parameters and apply to the cluster
ks param set serve modelPath gs://$BUCKET_NAME/
ks apply cloud -c serve

One interesting detail to note is that you don't need to add a VERSION_TAG, even though you may have multiple versions of your model saved in your bucket. Instead, the serving component will pick up on the most recent tag, and serve it.

Like during training, you can check the logs of the running server pod to ensure everything is working as expected:

POD_NAME=$(kubectl get pods --selector=app=mnist-serve \
      --template '{{range .items}}{{}}{{"\n"}}{{end}}')

kubectl logs $POD_NAME

Finally, we can deploy the final piece of our system: a web interface that can interact with our

trained model server. This code is stored in the kubeflow-introduction/web-ui directory.

The web page for this task is fairly basic; it consists of a simple flask server hosting HTML/CSS/Javascript files. The flask server makes use of, which contains a python function that directly interacts with the TensorFlow server.

Like in the training step, we first have to build a container from our code:

//move back to the kubeflow-introduction project directory

cd ..

//set the path on GCR you want to push the image to$PROJECT_ID/kubeflow-web-ui

//build the web-ui directory
docker build -t $UI_PATH ./web-ui

//allow docker to access our GCR registry
gcloud auth configure-docker

//push the container to GCR
docker push $UI_PATH

To deploy the web UI to the cluster, you must again create a ksonnet component. This time, we will use the ksonnet built-in deployed-service prototype. This component will create the deployment and LoadBalancer for us so we can connect with our flask server from outside the cluster.

//move back into ksonnet project directory
cd ksonnet-kubeflow

//create generate the component from its prototype
ks generate deployed-service web-ui --name=web-ui --image=$UI_PATH \
      --type=LoadBalancer --containerPort=5000 --servicePort=80 

//apply component to our cluster
ks apply cloud -c web-ui

Now, there should be a new web UI running in the cluster. To access it through your web browser, you must find the external IP address of the service, and paste it into your URL bar. Note that it may take a couple minutes for the IP address to appear.

kubectl get service web-ui

To use the UI, we must enter the details of the Tensorflow server running in the cluster to connect to. We need three things: a name, an address and a port.

If you punch these values into the web UI, it should find your server in your cluster, and display classification results.

When you're done with the codelab, it's a good idea to remove the resources you created to avoid any charges:

To delete the cluster:

gcloud container clusters delete kubeflow-codelab --zone us-central1-a

To delete the Google Cloud Storage bucket:

gsutil rm -r gs://$BUCKET_NAME

To delete the Service Account:

gcloud iam service-accounts delete $IAM_EMAIL

To delete the container images uploaded to Google Container Registry:

//find the digest id for each container image
gcloud container images list-tags$PROJECT_ID/kubeflow-train
gcloud container images list-tags$PROJECT_ID/kubeflow-web-ui

//delete each image
gcloud container images delete$PROJECT_ID/kubeflow-web-ui:$DIGEST_ID
gcloud container images delete$PROJECT_ID/kubeflow-train:$DIGEST_ID

Resources can also be deleted directly through the Google Cloud Console UI.