Kubeflow Pipelines - GitHub Issue Summarization

1. Introduction

Kubeflow is a Machine Learning toolkit for Kubernetes. The project is dedicated to making deployments of Machine Learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

A machine learning workflow can involve many steps with dependencies on each other, from data preparation and analysis, to training, to evaluation, to deployment, and more. It's hard to compose and track these processes in an ad-hoc manner—for example, in a set of notebooks or scripts—and things like auditing and reproducibility become increasingly problematic.Kubeflow Pipelines (KFP) helps solve these issues by providing a way to deploy robust, repeatable machine learning pipelines along with monitoring, auditing, version tracking, and reproducibility. Cloud AI Pipelines makes it easy to set up a KFP installation.

What you'll build

In this codelab, you will build a web app that summarizes GitHub issues using Kubeflow Pipelines to train and serve a model. It is based on an example in the Kubeflow Examples repo. Upon completion, your infrastructure will contain:

What you'll learn

The pipeline you will build trains a Tensor2Tensor model on GitHub issue data, learning to predict issue titles from issue bodies. It then exports the trained model and deploys the exported model using Tensorflow Serving. The final step in the pipeline launches a web app, which interacts with the TF-Serving instance in order to get model predictions.

  • How to install Kubeflow Pipelines on a GKE cluster
  • How to build and run ML workflows using Kubeflow Pipelines
  • How to define and run pipelines from an AI Platform Notebook

What you'll need

2. Setup

Cloud Shell

Visit the GCP Console in the browser and log in with your project credentials:

Click "Select a project" if needed, so that you're working with your codelab project.

4f23e1fe87a47cb2.png

Then click the "Activate Cloud Shell" icon in the top right of the console to start up a Cloud Shell.

ecf212797974dd31.png

When you start up the Cloud Shell, it will tell you the name of the project it's set to use. Check that this setting is correct.

To find your project ID, visit the GCP Console's Home panel. If the screen is empty, click on ‘Yes' at the prompt to create a dashboard.

115cdf745978ad.png

Then, in the Cloud Shell terminal, run these commands if necessary to configure gcloud to use the correct project:

export PROJECT_ID=<your_project_id>
gcloud config set project ${PROJECT_ID}

Create a storage bucket

Create a Cloud Storage bucket for storing pipeline files. You'll need to use a globally unique ID, so it is convenient to define a bucket name that includes your project ID. Create the bucket using the gsutil mb (make bucket) command:

export PROJECT_ID=<your_project_id>
export BUCKET_NAME=kubeflow-${PROJECT_ID}
gsutil mb gs://${BUCKET_NAME}

Alternatively, you can create a bucket via the GCP Console.

Optional**: Create a GitHub token**

This codelab calls the GitHub API to retrieve publicly available data. To prevent rate-limiting, especially at events where a large number of anonymized requests are sent to the GitHub APIs, set up an access token with no permissions. This is simply to authorize you as an individual rather than anonymous user.

  1. Navigate to https://github.com/settings/tokens and generate a new token with no scopes.
  2. Save it somewhere safe. If you lose it, you will need to delete and create a new one.

If you skip this step, the lab will still work – you will just be a bit more limited in your options for generating input data to test your model.

Optional: Pin useful dashboards

In the GCP console, pin the Kubernetes Engine and Storage dashboards for easier access.

2a50622902d75f6a.png

Create an AI Platform Pipelines (Hosted Kubeflow Pipelines) installation

Follow the instructions in the ‘Before you begin' and ‘Set up your instance' sections here to set up a GKE instance with KFP installed. Be sure to check the Allow access to the following Cloud APIs box as indicated in the documentation. (If you don't, the example pipeline won't run successfully). Leave the installation namespace as default.

You'll need to pick a zone that supports Nvidia k80s. You can use us-central1-a or us-central1-c as defaults.

Note the GKE cluster name and zone listed for your installation in the AI Pipelines dashboard once installation is complete, and for convenience set environment variables to these values.

6f0729a4fdee88ac.png

export ZONE=<your zone>
export CLUSTER_NAME=<your cluster name>

Set up kubectl to use your new GKE cluster's credentials

After the GKE cluster has been created, configure kubectl to use the credentials of the new cluster by running the following command in your Cloud Shell:

gcloud container clusters get-credentials ${CLUSTER_NAME} \
  --project ${PROJECT_ID} \
  --zone ${ZONE}

Alternatively, click on the name of the cluster in the AI Pipelines dashboard to visit its GKE page, then click "Connect" at the top of the page. From the popup, paste the command into your Cloud Shell.

This configures your kubectl context so that you can interact with your cluster. To verify the config, run the following command:

kubectl get nodes -o wide

You should see nodes listed with a status of "Ready", and other information about node age, version, external IP address, OS image, kernel version, and container runtime.

Configure the cluster to install the Nvidia driver on gpu-enabled node pools

Next, we'll apply a daemonset to the cluster, which will install the Nvidia driver on any GPU-enabled cluster nodes:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Then run the following command, which gives the KFP components permission to create new Kubernetes resources:

kubectl create clusterrolebinding sa-admin --clusterrole=cluster-admin --serviceaccount=kubeflow:pipeline-runner

Create a GPU node pool

Then, we'll set up a GPU node pool with a size of 1:

gcloud container node-pools create gpu-pool \
    --cluster=${CLUSTER_NAME} \
    --zone ${ZONE} \
    --num-nodes=1 \
    --machine-type n1-highmem-8 \
    --scopes cloud-platform --verbosity error \
    --accelerator=type=nvidia-tesla-k80,count=1

3. Run a pipeline from the Pipelines dashboard

Open the Pipelines dashboard

In the Cloud Console, visit the Pipelines panel if you're not already there. Then click on "OPEN PIPELINES DASHBOARD" for your installation, and click on Pipelines in the left menu bar. If you get a load error, refresh the tab. You should see a new page like this:

7bb5a9cf0773c3bc.png

Pipeline description

The pipeline you will run has several steps (see the Appendix of this codelab for details):

  1. An existing model checkpoint is copied to your bucket.
  2. A Tensor2Tensor model is trained using preprocessed data.
  • Training starts from the existing model checkpoint copied in the first step, then trains for a few more hundred steps. (It would take too long to fully train it during the codelab).
  • When training finishes, the pipeline step exports the model in a form suitable for serving by TensorFlow serving.
  1. A TensorFlow-serving instance is deployed using that model.
  2. A web app is launched for interacting with the served model to retrieve predictions.

Download and compile the pipeline

In this section, we'll see how to compile a pipeline definition. The first thing we need to do is install the KFP SDK. Run the following in the Cloud Shell:

pip3 install -U kfp

To download the pipeline definition file, execute this command from the Cloud Shell:

curl -O https://raw.githubusercontent.com/amygdala/kubeflow-examples/ghsumm/github_issue_summarization/pipelines/example_pipelines/gh_summ_hosted_kfp.py

Then compile the pipeline definition file by running it like this:

python3 gh_summ_hosted_kfp.py

You will see the file gh_summ_hosted_kfp.py.tar.gz appear as a result.

Upload the compiled pipeline

In the Kubeflow Pipelines web UI, click on Upload pipeline, and select Import by URL. Copy, then paste in the following URL, which points to the same pipeline that you just compiled. (It's a few extra steps to upload a file from Cloud Shell, so we're taking a shortcut).

https://storage.googleapis.com/aju-dev-demos-codelabs/KF/compiled_pipelines/gh_summ_hosted_kfp.py.tar.gz

Give the pipeline a name (e.g. gh_summ).

867fdbe248d13bab.png

Run the pipeline

Click on the uploaded pipeline in the list —this lets you view the pipeline's static graph— then click on Create experiment to create a new Experiment using the pipeline. An Experiment is a way to group together semantically related runs.

d4b5b1a043d32d4a.png

Give the Experiment a name (e.g. the same name as the pipeline, gh_summ), then click Next to create.

d9f7d2177efad53.png

This will bring up a page where you can enter the parameters for a Run and start it off.

You may want to execute the following commands in Cloud Shell to help fill in the parameters.

gcloud config get-value project
echo "gs://${BUCKET_NAME}/codelab"

The Run name will be auto-filled, but you can give it a different name if you like.

Then fill in three parameter fields:

  • project
  • (optional) github-token
  • working-dir

For the working-dir, enter some path under the GCS bucket you created. Include the ‘gs://' prefix. For the github-token field, enter either the token that you optionally generated earlier, or leave the placeholder string as is if you did not generate a token.

8676afba6fd32ac1.png

After filling in the fields, click Start, then click on the listed run to view its details. While a given pipeline step is running, you can click on it to get more information about it, including viewing its pod logs. (You can also view the logs for a pipeline step via the link to its Cloud Logging (Stackdriver) logs, even if the cluster node has been torn down).

db2dc819ac0f5c1.png

View the pipeline definition

While the pipeline is running, you may want to take a closer look at how it is put together and what it is doing. There is more detail in the Appendix section of the codelab.

View model training information in TensorBoard

Once the training step is complete, select its Visualizations tab and click the blue Start TensorBoard button, then once it's ready, click Open Tensorboard.

6cb511540a64b9e5.png

d55eb03c4d04f64d.png

Explore the Artifacts and Executions dashboard

Kubeflow Pipelines automatically logs metadata about the pipeline steps as a pipeline executes. Both Artifact and Execution information is recorded. Click these entries in the left nav bar of the dashboard to explore further.

3002c4055cc8960b.png

For Artifacts, you can view both an overview panel and a Lineage Explorer panel.

7885776e899d1183.png

40c4f7e5b6545dec.png

Bring up the web app created by the pipeline and make some predictions

The last step in the pipeline deploys a web app, which provides a UI for querying the trained model — served via TF Serving — to make predictions.

After the pipeline completes, connect to the web app by port-forwarding to its service (we're port-forwarding because, for this codelab, the webapp service is not set up to have an external endpoint).

Find the service name by running this command in the Cloud Shell:

kubectl get services

Look for a service name like this: ghsumm-*-webappsvc in the list.

Then, in the Cloud Shell, port-forward to that service as follows, changing the following command to use the name of your webappsvc:

kubectl port-forward svc/ghsumm-xxxxx-webappsvc 8080:80

Once port-forwarding is running, click on the ‘preview' icon above the Cloud Shell pane, and in the dropdown, click "Preview on port 8080".

65572bb3b12627cc.png

You should see a page like this come up in a new tab:

902ad2d555281508.png

Click the Populate Random Issue button to retrieve a block of text. Click on Generate TItle to call the trained model and display a prediction.

b7c39ce51ee603bd.png

If your pipeline parameters included a valid GitHub token, you can alternately try entering a GitHub URL in the second field, then clicking "Generate Title". If you did not set up a valid GitHub token, use only the "Populate Random Issue" field.

4. Run a pipeline from an AI Platform Notebook

You can also interactively define and run Kubeflow Pipelines from a Jupyter notebook using the KFP SDK. AI Platform Notebooks, which we'll use for this codelab, makes this very straightforward.

Create a notebook instance

We'll create a notebook instance from the Cloud Shell using its API. (Alternatively, you can create a notebook via the Cloud Console. See the documentation for more information).

Set the following environment variables in the Cloud Shell:

export INSTANCE_NAME="kfp-ghsumm"
export VM_IMAGE_PROJECT="deeplearning-platform-release"
export VM_IMAGE_FAMILY="tf2-2-3-cpu"
export MACHINE_TYPE="n1-standard-4"
export LOCATION="us-central1-c"

Then, from the Cloud Shell, run the command to create the notebook instance:

gcloud beta notebooks instances create $INSTANCE_NAME \
  --vm-image-project=$VM_IMAGE_PROJECT \
  --vm-image-family=$VM_IMAGE_FAMILY \
  --machine-type=$MACHINE_TYPE --location=$LOCATION

When you first run this command, you may be asked to enable the notebooks API for your project. Reply ‘y' if so.

After a few minutes, your notebook server will be up and running. You can see your Notebook instances listed in the Cloud Console.

206adf3905413dfa.png

Upload the codelab notebook

After the notebook instance is created, click this link to upload the codelab's Jupyter notebook. Select the notebook instance to use. The notebook will be automatically opened.

Execute the notebook

Follow the instructions in the notebook for the remainder of the lab. Note that in the "Setup" part of the notebook, you will need to fill in your own values before running the rest of the notebook.

(If you're using your own project, don't forget to return and do the "Clean up" section of this lab).

5. Clean up

You don't need to do this if you're using a temporary codelab account, but you may wish to take down your Pipelines installation and Notebook if you're using your own project.

Take down the Pipelines GKE cluster

You can delete the Pipelines cluster from the Cloud Console. (You have the option of just deleting the Pipelines installation if you want to reuse the GKE cluster).

Delete the AI Notebook instance

If you ran the "Notebook" part of the codelab, you can DELETE or STOP the notebook instance from the Cloud Console.

Optional: Remove the GitHub token

Navigate to https://github.com/settings/tokens and remove the generated token.

6. Appendices

A look at the code

Defining the pipeline

The pipeline used in this codelab is defined here.

Let's take a look at how it is defined, as well as how its components (steps) are defined. We'll cover some highlights, but see the documentation for more details.

Kubeflow Pipeline steps are container-based. When you're building a pipeline, you can use pre-built components, with already-built container images, or build your own components. For this codelab, we've built our own.

Four of the pipeline steps are defined from reusable components, accessed via their component definition files. In this first code snippet, we're accessing these component definition files via their URL, and using these definitions to create ‘ops' that we'll use to create a pipeline step.

import kfp.dsl as dsl
import kfp.gcp as gcp
import kfp.components as comp

...

copydata_op = comp.load_component_from_url(
  'https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/components/t2t/datacopy_component.yaml'
  )

train_op = comp.load_component_from_url(
  'https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/components/t2t/train_component.yaml'
  )

Below is one of the component definitions, for the training op, in yaml format. You can see that its inputs, outputs, container image, and container entrypoint args are defined.

name: Train T2T model
description: |
  A Kubeflow Pipeline component to train a Tensor2Tensor
  model
metadata:
  labels:
    add-pod-env: 'true'
inputs:
  - name: train_steps
    description: '...'
    type: Integer
    default: 2019300
  - name: data_dir
    description: '...'
    type: GCSPath
  - name: model_dir
    description: '...'
    type: GCSPath
  - name: action
    description: '...'
    type: String
  - name: deploy_webapp
    description: '...'
    type: String
outputs:
  - name: launch_server
    description: '...'
    type: String
  - name: train_output_path
    description: '...'
    type: GCSPath
  - name: MLPipeline UI metadata
    type: UI metadata
implementation:
  container:
    image: gcr.io/google-samples/ml-pipeline-t2ttrain:v3ap
    args: [
      --data-dir, {inputValue: data_dir},
      --action, {inputValue: action},
      --model-dir, {inputValue: model_dir},
      --train-steps, {inputValue: train_steps},
      --deploy-webapp, {inputValue: deploy_webapp},
      --train-output-path, {outputPath: train_output_path}
    ]
    env:
      KFP_POD_NAME: "{{pod.name}}"
    fileOutputs:
      launch_server: /tmp/output
      MLPipeline UI metadata: /mlpipeline-ui-metadata.json

You can also define a pipeline step via the dsl.ContainerOp constructor, as we will see below.

Below is the bulk of the pipeline definition. We're defining the pipeline inputs (and their default values). Then we define the pipeline steps. For most we're using the ‘ops' defined above, but we're also defining a ‘serve' step inline via ContainerOp, specifying the container image and entrypoint arguments directly.

You can see that the train, log_model, and serve steps are accessing outputs from previous steps as inputs. You can read more about how this is specified here.

@dsl.pipeline(
 name='Github issue summarization',
 description='Demonstrate Tensor2Tensor-based training and TF-Serving'
)
def gh_summ(  #pylint: disable=unused-argument
 train_steps: 'Integer' = 2019300,
 project: str = 'YOUR_PROJECT_HERE',
 github_token: str = 'YOUR_GITHUB_TOKEN_HERE',
 working_dir: 'GCSPath' = 'gs://YOUR_GCS_DIR_HERE',
 checkpoint_dir: 'GCSPath' = 'gs://aju-dev-demos-codelabs/kubecon/model_output_tbase.bak2019000/',
 deploy_webapp: str = 'true',
 data_dir: 'GCSPath' = 'gs://aju-dev-demos-codelabs/kubecon/t2t_data_gh_all/'
 ):


 copydata = copydata_op(
   data_dir=data_dir,
   checkpoint_dir=checkpoint_dir,
   model_dir='%s/%s/model_output' % (working_dir, dsl.RUN_ID_PLACEHOLDER),
   action=COPY_ACTION,
   )


 train = train_op(
   data_dir=data_dir,
   model_dir=copydata.outputs['copy_output_path'],
   action=TRAIN_ACTION, train_steps=train_steps,
   deploy_webapp=deploy_webapp
   )

 serve = dsl.ContainerOp(
     name='serve',
     image='gcr.io/google-samples/ml-pipeline-kubeflow-tfserve:v6',
     arguments=["--model_name", 'ghsumm-%s' % (dsl.RUN_ID_PLACEHOLDER,),
         "--model_path", train.outputs['train_output_path']
         ]
     )

 train.set_gpu_limit(1)

Note that we're requiring the ‘train' step to run on a node in the cluster that has at least 1 GPU available.

  train.set_gpu_limit(1)

The final step in the pipeline— also defined inline— is conditional. It will run after the ‘serve' step is finished, only if the training step launch_server output is the string ‘true'. It launches the ‘prediction web app', that we used to request issue summaries from the trained T2T model.

 with dsl.Condition(train.outputs['launch_server'] == 'true'):
   webapp = dsl.ContainerOp(
       name='webapp',
       image='gcr.io/google-samples/ml-pipeline-webapp-launcher:v1',
       arguments=["--model_name", 'ghsumm-%s' % (dsl.RUN_ID_PLACEHOLDER,),
           "--github_token", github_token]

       )
   webapp.after(serve)

The component container image definitions

The Kubeflow Pipeline documentation describes some best practices for building your own components. As part of this process, you will need to define and build a container image. You can see the component steps for this codelab's pipeline here. The Dockerfile definitions are in the containers subdirectories, e.g. here.

Use preemptible VMs with GPUs for training

Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. The pricing of preemptible VMs is lower than that of standard Compute Engine VMs.

With Google Kubernetes Engine (GKE), it is easy to set up a cluster or node pool that uses preemptible VMs. You can set up such a node pool with GPUs attached to the preemptible instances. These work the same as regular GPU-enabled nodes, but the GPUs persist only for the life of the instance.

You can set up a preemptible, GPU-enabled node pool for your cluster by running a command similar to the following, editing the following command with your cluster name and zone, and adjusting the accelerator type and count according to your requirements. You can optionally define the node pool to autoscale based on current workloads.

gcloud container node-pools create preemptible-gpu-pool \
    --cluster=<your-cluster-name> \
    --zone <your-cluster-zone> \
    --enable-autoscaling --max-nodes=4 --min-nodes=0 \
    --machine-type n1-highmem-8 \
    --preemptible \
    --node-taints=preemptible=true:NoSchedule \
    --scopes cloud-platform --verbosity error \
    --accelerator=type=nvidia-tesla-k80,count=4

You can also set up a node pool via the Cloud Console.

Defining a Kubeflow Pipeline that uses the preemptible GKE nodes

If you're running Kubeflow on GKE, it is now easy to define and run Kubeflow Pipelines in which one or more pipeline steps (components) run on preemptible nodes, reducing the cost of running a job. For use of preemptible VMs to give correct results, the steps that you identify as preemptible should either be idempotent (that is, if you run a step multiple times, it will have the same result), or should checkpoint work so that the step can pick up where it left off if interrupted.

When you're defining a Kubeflow Pipeline, you can indicate that a given step should run on a preemptible node by modifying the op like this:

your_pipelines_op.apply(gcp.use_preemptible_nodepool())

See the documentation for details.

You'll presumably also want to retry the step some number of times if the node is preempted. You can do this as follows— here, we're specifying 5 retries.

your_pipelines_op.set_gpu_limit(1).apply(gcp.use_preemptible_nodepool()).set_retry(5)

Try editing the Kubeflow pipeline we used in this codelab to run the training step on a preemptible VM.

Change the following line in the pipeline specification to additionally use a preemptible nodepool (make sure you have created one as indicated above) above, and to retry 5 times:

  train.set_gpu_limit(1)

Then, recompile the pipeline, upload the new version (give it a new name), and then run the new version of the pipeline.