As datasets continue to expand and models grow become complex, distributing machine learning (ML) workloads across multiple nodes is becoming more attractive. Unfortunately, breaking up and distributing a workload can add both computational overhead, and a great deal more complexity to the system. Data scientists should be able to focus on ML problems, not DevOps.

Fortunately, distributed workloads are becoming easier to manage, thanks to Kubernetes. Kubernetes is a mature, production ready platform that gives developers a simple API to deploy programs to a cluster of machines as if they were a single piece of hardware. Using Kubernetes, computational resources can be added or removed as desired, and the same cluster can be used to both train and serve ML models.

This codelab will serve as an introduction to Kubeflow, an open-source project which aims to make running ML workloads on Kubernetes simple, portable and scalable. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. It also extends the Kubernetes API by adding new Custom Resource Definitions (CRDs) to your cluster, so machine learning workloads can be treated as first-class citizens by Kubernetes.

What You'll Build

This codelab will describe how to train and serve a TensorFlow model, and then how to deploy a web interface to allow users to interact with the model over the public internet. You will build a classic handwritten digit recognizer using the MNIST dataset.

The purpose of this codelab is to get a brief overview of how to interact with Kubeflow. To keep things simple, the model we'll deploy will use CPU-only training, and only make use of a single node for training. Kubeflow's documentation has more information when you are ready to explore further.

What You'll Learn

What You'll Need

Downloading the Project Files

The first step is to download a copy of the Kubeflow examples repository, which hosts the code we will be deploying. This codelab can be completed on a local machine, or through Google Cloud Shell:

Download in Google Cloud Shell

Download locally

Setting Environment Variables

Before we can start, we should set up a few environment variables we will be using through the course of the codelab. The first is the project ID, which denotes which GCP project we will be using

// available project ids can be listed with the following command:
// gcloud projects list
PROJECT_ID=<YOUR_CHOSEN_PROJECT_ID>

gcloud config set project $PROJECT_ID

We also need to provide the zone we want to use

ZONE=us-central1-c

Next, we will set the Kubeflow deployment name. For this codelab, we will simply use "mnist-deployment"

DEPLOYMENT_NAME=mnist-deployment

We will be working out of the "mnist" directory of the repository, so change to the proper directory and set an environment variable

cd ./mnist
WORKING_DIR=$(pwd)

Installing Kustomize

Kubeflow uses a tool called Kustomize to manage deployments. This is a way to setup our application so that the same code can be deployed across different environments.

You can install Kustomize into your Cloud Shell instance by downloading the binary from Github:

mkdir $WORKING_DIR/bin
wget https://github.com/kubernetes-sigs/kustomize/releases/download/v2.0.3/kustomize_2.0.3_linux_amd64 \
    -O $WORKING_DIR/bin/kustomize
chmod +x $WORKING_DIR/bin/kustomize
PATH=$PATH:${WORKING_DIR}/bin

To install Kustomize on other platforms, check out its installation guide.

Enabling the API

Before using Google Kubernetes Engine (GKE), you must enable the API for your project through the Google Cloud Platform Console.

Setting up a Kubeflow Cluster

The simplest way to deploy a Kubeflow enabled cluster is through the Kubeflow Click to Deploy web interface at deploy.kubeflow.cloud. Simply enter your project ID, deployment name, and zone, and then press the "Create Deployment" button. All necessary resources will be provisioned automatically.

Open Kubeflow Click-to-Deploy

Fill in the following values in the resulting form:

The resources created here will be controlled by the GCP Deployment Manager. Here, you can see the current status of the deployment and manage everything in one place. It may take up to 10 minutes before the cluster is ready to use.

When the cluster is fully set up, you can connect your local kubectl session to it. The following command should output "kubeconfig entry generated" when run successfully, letting you know that your GCP credentials were added to your kubeconfig.

gcloud container clusters get-credentials \
    $DEPLOYMENT_NAME --zone $ZONE --project $PROJECT_ID

Now, you should now be able to interact with your cluster through the kubectl command. Switch to the kubeflow namespace to see the resources that were pre-installed on the Kubeflow cluster.

kubectl config set-context $(kubectl config current-context) --namespace=kubeflow

If you query the resources running on your cluster, you should see that Kubeflow has automatically provisioned everything you'll need to run training and serving jobs.

kubectl get all

For more information on the components that you see listed here, check out the official documentation.

The code for our TensorFlow project can be found in the model.py file in the examples repository. model.py defines a fairly straight-forward TensorFlow training program, with no special modifications for Kubeflow. After training is complete, it will attempt to upload the trained model to a path we provide. For the purpose of this codelab, we will create and use a Google Cloud Storage (GCS) bucket to hold the trained model.

Setting up a Storage Bucket

Our next step is to create a storage bucket on Google Cloud Storage to hold our trained model. Note that the name you choose for your bucket must be unique across all of GCS.

// bucket name can be anything, but must be unique across all projects
BUCKET_NAME=${DEPLOYMENT_NAME}-${PROJECT_ID}

// create the GCS bucket
gsutil mb gs://${BUCKET_NAME}/

Building the Container

To deploy our code to Kubernetes, we have to first build our local project into a container image:

//set the path on GCR you want to push the image to
IMAGE_PATH=us.gcr.io/$PROJECT_ID/kubeflow-train

//build the tensorflow model into a container image
//image is tagged with its eventual path on GCR, but it stays local for now
docker build $WORKING_DIR -t $IMAGE_PATH -f $WORKING_DIR/Dockerfile.model

Now, test the new container image locally to make sure everything is working as expected

docker run -it $IMAGE_PATH

You should see training logs start appearing in your console:

If you're seeing logs, that means training is working and you can terminate the container with Ctrl+c. Now that you know that the container can run locally, you can safely upload it to Google Container Registry (GCR) so you can run it on your cluster.

//allow docker to access our GCR registry
gcloud auth configure-docker --quiet

//push container to GCR
docker push $IMAGE_PATH

You should now see your new container listed on the GCR console

Training on the Cluster

Finally, we can run the training job on the cluster. First, move into the training directory

cd $WORKING_DIR/training/GCS

If you look around this directory, you will see a number of YAML files. We can now use kustomize to configure the manifests. First, set a unique name for the training run

kustomize edit add configmap mnist-map-training \
    --from-literal=name=my-train-1

Next, set some default training parameters (number of training steps, batch size and learning rate)

kustomize edit add configmap mnist-map-training \
    --from-literal=trainSteps=200
kustomize edit add configmap mnist-map-training \
    --from-literal=batchSize=100
kustomize edit add configmap mnist-map-training \
    --from-literal=learningRate=0.01

Now, configure the manifest to use our custom bucket and training image

kustomize edit set image training-image=${IMAGE_PATH}:latest
kustomize edit add configmap mnist-map-training \
    --from-literal=modelDir=gs://${BUCKET_NAME}/
kustomize edit add configmap mnist-map-training \
    --from-literal=exportDir=gs://${BUCKET_NAME}/export

One thing to keep in mind is that our python training code has to have permissions to read/write to the storage bucket we set up. Kubeflow solves this by creating a service account within your project as a part of the deployment. You can verify this by listing your service accounts:

gcloud --project=$PROJECT_ID iam service-accounts list | grep $DEPLOYMENT_NAME

This service account should be automatically granted the right permissions to read and write to our storage bucket. Kubeflow also added a Kubernetes secret called "user-gcp-sa" to our cluster, containing the credentials needed to authenticate as this service account within our cluster:

kubectl describe secret user-gcp-sa

To access our storage bucket from inside our training container, we just need to set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the JSON file contained in the secret. We can do this by setting a few more Kustomize parameters:

kustomize edit add configmap mnist-map-training \
    --from-literal=secretName=user-gcp-sa
kustomize edit add configmap mnist-map-training \
    --from-literal=secretMountPath=/var/secrets
kustomize edit add configmap mnist-map-training \
    --from-literal=GOOGLE_APPLICATION_CREDENTIALS=/var/secrets/user-gcp-sa.json

Now that all the parameters are set, we can use Kustomize to build the new customized YAML file:

kustomize build . 

We can pipe this YAML manifest to kubectl to deploy the training job to the cluster:

kustomize build . | kubectl apply -f -

After applying the component, there should be a new tf-job on the cluster called my-train-1-chief-0. You can use kubectl to query some information about the job, including its current state.

kubectl describe tfjob

For even more information, you can retrieve the python logs from the pod that's running the container itself (after the container has finished initializing):

kubectl logs -f my-train-1-chief-0

When training is complete, you can query your bucket's data using gsutil. You should see the model data added to your bucket:

gsutil ls -r gs://${BUCKET_NAME}/export

Alternatively, you can check the contents of your bucket through the GCP Cloud Console.

Now that you have a trained model, it's time to put it in a server so it can be used to handle requests. To do this, we'll use some files in the "serving/GCS" directory

cd $WORKING_DIR/serving/GCS

The Kubeflow manifests in this directory contain a TensorFlow Serving implementation. We simply need to point the component to our GCS bucket where the model data is stored, and it will spin up a server to handle requests. Unlike the tf-job, no custom container is required for the server process. Instead, all the information the server needs is stored in the model file.

Like before, we will set some parameters to customize the deployment for our use. First we'll set the name for the service:

kustomize edit add configmap mnist-map-serving \
    --from-literal=name=mnist-service

Next, point the server at the trained model in our GCP bucket:

kustomize edit add configmap mnist-map-serving \
    --from-literal=modelBasePath=gs://${BUCKET_NAME}/export

Now, deploy the server to the cluster:

kustomize build . | kubectl apply -f -

If you describe the new service, you'll see it's listening for connections within the cluster on port 9000:

kubectl describe service mnist-service

Now that we have a trained model in our bucket, and a TensorFlow server hosting it, we can deploy the final piece of our system: a web interface to interact with our model. The code for this is stored in the web-ui directory of the repository.

The web page for this task is fairly basic; it consists of a simple flask server hosting HTML/CSS/Javascript files. The flask script makes use of mnist_client.py, which contains the following python code to interact with the TensorFlow server through gRPC:

from grpc.beta import implementations
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2 as psp

# create gRPC stub
channel = implementations.insecure_channel(server_host, server_port)
stub = psp.beta_create_PredictionService_stub(channel)

# build request
request = predict_pb2.PredictRequest()
request.model_spec.name = server_name
request.model_spec.signature_name = 'serving_default'
request.inputs['x'].CopyFrom(
    tf.contrib.util.make_tensor_proto(image, shape=image.shape))

# retrieve results
result = stub.Predict(request, timeout)
resultVal = result.outputs["classes"].int_val[0]
scores = result.outputs['predictions'].float_val
version = result.outputs["classes"].int_val[0]

Deploying the Web UI

Change to the directory containing the front end manifests:

cd $WORKING_DIR/front

Unlike the other steps, this manifest requires no customization. It can be applied directly:

kustomize build . | kubectl apply -f -

The service added is of type ClusterIP, meaning it can't be accessed from outside the cluster. In order to load the web UI in your web browser, you have to set up a direct connection to the cluster:

kubectl port-forward svc/web-ui 8080:80

Now in your Cloud Shell interface, press the web preview button and select "Preview on port 8080" to open the web interface in your browser:

You should now see the MNIST Web UI:

Keep in mind that the web interface doesn't do much on its own, it's simply a basic HTML/JS wrapper around the TensorFlow Serving component, which performs the actual predictions. To emphasize this, the web interface allows you to manually connect with the serving instance located in the cluster. It has three fields:

Model Name: mnist

Server Address: mnist-service

Port: 9000

These three fields uniquely define your model server. If you deploy multiple serving components, you should be able to switch between them using the web interface. Feel free to experiment!

When you're done with the codelab, it's a good idea to remove the resources you created to avoid unnecessary charges.

Delete the cluster and other resources provisioned by Kubeflow:

gcloud deployment-manager deployments delete $DEPLOYMENT_NAME

Delete the Google Cloud Storage bucket:

gsutil rm -r gs://$BUCKET_NAME

Delete the container image uploaded to Google Container Registry:

gcloud container images delete us.gcr.io/$PROJECT_ID/kubeflow-train

Resources can also be deleted directly through the Google Cloud Console UI.