Prototype to Production: Training custom models with Vertex AI

94 mins remaining

Prototype to Production:

Training custom models with Vertex AI

About this codelab

Last updated Nov 2, 2023

Written by Nikita Namjoshi

1. Overview

In this lab, you'll use Vertex AI to run a custom training job.

This lab is part of the Prototype to Production video series. You'll build an image classification model using the Flowers dataset. You can watch the accompanying video to learn more:

What you learn

You'll learn how to:

Create a Vertex AI Workbench managed notebook
Configure and launch a custom training job from the Vertex AI UI
Configure and launch a custom training job with the Vertex AI Python SDK

The total cost to run this lab on Google Cloud is about $1.

This lab uses the newest AI product offering available on Google Cloud. Vertex AI integrates the ML offerings across Google Cloud into a seamless development experience. Previously, models trained with AutoML and custom models were accessible via separate services. The new offering combines both into a single API, along with other new products. You can also migrate existing projects to Vertex AI.

Vertex AI includes many different products to support end-to-end ML workflows. This lab will focus on the products highlighted below: Training and Workbench

Vertex product overview

3. Set up your environment

You'll need a Google Cloud Platform project with billing enabled to run this codelab. To create a project, follow the instructions here.

Step 1: Enable the Compute Engine API

Navigate to Compute Engine and select Enable if it isn't already enabled.

Step 2: Enable the Artifact Registry API

Navigate to Artifact Registry and select Enable if it isn't already. You'll use this to create a container for your custom training job.

Step 3: Enable the Vertex AI API

Navigate to the Vertex AI section of your Cloud Console and click Enable Vertex AI API.

Vertex AI dashboard

Step 4: Create a Vertex AI Workbench instance

From the Vertex AI section of your Cloud Console, click on Workbench:

Enable the Notebooks API if it isn't already.

Notebook_api

Once enabled, click MANAGED NOTEBOOKS:

Notebooks_UI

Then select NEW NOTEBOOK.

new_notebook

Give your notebook a name, and under Permission select Service account

create_notebook

Select Advanced Settings.

Under Security select "Enable terminal" if it is not already enabled.

enable_terminal

You can leave all of the other advanced settings as is.

Next, click Create. The instance will take a couple minutes to be provisioned.

Once the instance has been created, select OPEN JUPYTERLAB.

open_jupyterlab

4. Containerize training application code

You'll submit this training job to Vertex AI by putting your training application code in a Docker container and pushing this container to Google Artifact Registry. Using this approach, you can train a model built with any framework.

To start, from the Launcher menu, open a Terminal window in your notebook instance:

Open terminal in notebook

Step 1: Create a Cloud Storage bucket

In this training job, you'll export the trained TensorFlow model to a Cloud Storage Bucket. You'll also store the data for training in a Cloud Storage bucket.

From your Terminal, run the following to define an env variable for your project, making sure to replace your-cloud-project with the ID of your project:

PROJECT_ID='your-cloud-project'

Next, run the following in your Terminal to create a new bucket in your project.

BUCKET="gs://${PROJECT_ID}-bucket"
gsutil mb -l us-central1 $BUCKET

Step 2: Copy data to Cloud Storage bucket

We need to get our flowers dataset into Cloud Storage. For demonstration purposes, you'll first download the dataset to this Workbench instance, and then copy it to a bucket.

Download and untar the data.

wget https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
tar xvzf flower_photos.tgz

Then copy it to the bucket you just created. We add the -r because we want to copy the entire directory, and -m to perform a multi-processing copy, which will speed things up.

gsutil -m cp -r flower_photos $BUCKET

Step 3: Write training code

Create a new directory called flowers and cd into it:

mkdir flowers
cd flowers

Run the following to create a directory for the training code and a Python file where you'll add the code.

mkdir trainer
touch trainer/task.py

You should now have the following in your flowers/ directory:

+ trainer/
    + task.py

For more details on how to structure your training application code, check out the docs.

Next, open the task.py file you just created and copy the code below.

You'll need to replace {your-gcs-bucket} with the name of the Cloud Storage bucket you just created.

Through the Cloud Storage FUSE tool, training jobs on Vertex AI Training can access data on Cloud Storage as files in the local file system. When you start a custom training job, the job sees a directory /gcs which contains all your Cloud Storage buckets as subdirectories. That's why the data paths in the training code start with /gcs.

import tensorflow as tf
import numpy as np
import os

## Replace {your-gcs-bucket} !!
BUCKET_ROOT='/gcs/{your-gcs-bucket}'

# Define variables
NUM_CLASSES = 5
EPOCHS=10
BATCH_SIZE = 32

IMG_HEIGHT = 180
IMG_WIDTH = 180

DATA_DIR = f'{BUCKET_ROOT}/flower_photos'

def create_datasets(data_dir, batch_size):
  '''Creates train and validation datasets.'''
  
  train_dataset = tf.keras.utils.image_dataset_from_directory(
    data_dir,
    validation_split=0.2,
    subset="training",
    seed=123,
    image_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=batch_size)

  validation_dataset = tf.keras.utils.image_dataset_from_directory(
    data_dir,
    validation_split=0.2,
    subset="validation",
    seed=123,
    image_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=batch_size)

  train_dataset = train_dataset.cache().shuffle(1000).prefetch(buffer_size=tf.data.AUTOTUNE)
  validation_dataset = validation_dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

  return train_dataset, validation_dataset


def create_model():
  '''Creates model.'''

  model = tf.keras.Sequential([
    tf.keras.layers.Resizing(IMG_HEIGHT, IMG_WIDTH),
    tf.keras.layers.Rescaling(1./255, input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
  ])
  return model

# CREATE DATASETS
train_dataset, validation_dataset = create_datasets(DATA_DIR, BATCH_SIZE)

# CREATE/COMPILE MODEL
model = create_model()
model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              metrics=['accuracy'])

# TRAIN MODEL
history = model.fit(
  train_dataset,
  validation_data=validation_dataset,
  epochs=EPOCHS
)

# SAVE MODEL
model.save(f'{BUCKET_ROOT}/model_output')

Step 4: Create a Dockerfile

To containerize your code, you'll need to create a Dockerfile. In the Dockerfile you'll include all the commands needed to run the image. It'll install all the necessary libraries and set up the entry point for the training code.

From your Terminal, create an empty Dockerfile in the root of your flowers directory:

touch Dockerfile

You should now have the following in your flowers/ directory:

+ Dockerfile
+ trainer/
    + task.py

Open the Dockerfile and copy the following into it:

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-8

WORKDIR /

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

Let's review the commands in this file.

The FROM command specifies the base image, which is the parent image that the image you create will be built on. As the base image, you'll use the use the Deep Learning Container TensorFlow Enterprise 2.8 GPU Docker image. The Deep Learning Containers on Google Cloud come with many common ML and data science frameworks pre-installed.

The WORKDIR command specifies the directory on the image where subsequent instructions are run.

The COPY command copies the trainer code to the Docker image. Note that in this example we only have one python file in our trainer directory, but for a more realistic example you would likely have additional files. Maybe one called data.py, which handles data preprocessing, and one called model.py, which just has the model code, etc. For more complex training code, check out the Python docs on packaging Python projects.

If you wanted to add any additional libraries, you could use the RUN command to pip install (ex: RUN pip install -r requirements.txt). But we don't need anything additional for our example here.

Lastly, the ENTRYPOINT command sets up the entry point to invoke the trainer. This is what will run when we start our training job. In our case that is executing the task.py file.

You can learn more about writing Dockerfiles for Vertex AI Training here.

Step 4: Build the container

From the terminal of your Workbench notebook, run the following to define an env variable for your project, making sure to replace your-cloud-project with the ID of your project:

PROJECT_ID='your-cloud-project'

Create a repo in Artifact Registry

REPO_NAME='flower-app'

gcloud artifacts repositories create $REPO_NAME --repository-format=docker \
--location=us-central1 --description="Docker repository"

Define a variable with the URI of your container image in Google Artifact Registry:

IMAGE_URI=us-central1-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/flower_image:latest

Configure docker

gcloud auth configure-docker \
    us-central1-docker.pkg.dev

Then, build the container by running the following from the root of your flower directory:

docker build ./ -t $IMAGE_URI

Lastly, push it to Artifact Registry:

docker push $IMAGE_URI

With the container pushed to Artifact Registry, you're now ready to kick off a the training job.

5. Run a custom training job on Vertex AI

This lab uses custom training via a custom container on Google Artifact Registry, but you can also run a training job with the Pre-built containers.

To start, navigate to the Training section in the Vertex section of your Cloud console:

Step 1: Configure training job

Click Create to enter the parameters for your training job.

create_training

Under Dataset, select No managed dataset
Then select Custom training (advanced) as your training method and click Continue.
Select Train new model then enter flowers-model (or whatever you'd like to call your model) for Model name
Click Continue

In the Container settings step, select Custom container:

Custom container option

In the first box (Container image), enter the value of your IMAGE_URI variable from the previous section. It should be: us-central1-docker.pkg.dev/{PROJECT_ID}/flower-app/flower_image:latest, with your own project ID. Leave the rest of the fields blank and click Continue.

Skip the Hyperparameters step by clicking Continue again.

Step 2: Configure compute cluster

Configure Worker pool 0 as follows:

worker_pool_0

You'll skip step 6 for now and configure the prediction container in the next lab in this series.

Click START TRAINING to kick off the training job. In the Training section of your console under the TRAINING PIPELINES tab you'll see your newly launched job:

Training jobs

🎉 Congratulations! 🎉

You've learned how to use Vertex AI to:

Launch a custom training job for training code provided in a custom container. You used a TensorFlow model in this example, but you can train a model built with any framework using custom or built-in containers.

To learn more about different parts of Vertex, check out the documentation.

6. [Optional] Use the Vertex AI Python SDK

The previous section showed how to launch the training job via the UI. In this section, you'll see an alternative way to submit the training job by using the Vertex AI Python SDK.

Return to your notebook instance, and create a create a TensorFlow 2 notebook from the Launcher:

new_notebook

Import the Vertex AI SDK.

from google.cloud import aiplatform

Then create a CustomContainerTrainingJob. You'll need to replace {PROJECT_ID} in the container_uri with the name of your project, and you'll need to replace {BUCKET} in staging_bucket with the bucket you created earlier.

my_job = aiplatform.CustomContainerTrainingJob(display_name='flower-sdk-job',
                                               container_uri='us-central1-docker.pkg.dev/{PROJECT_ID}/flower-app/flower_image:latest',
                                               staging_bucket='gs://{BUCKET}')

Then, run the job.

my_job.run(replica_count=1,
           machine_type='n1-standard-8',
           accelerator_type='NVIDIA_TESLA_V100',
           accelerator_count=1)

For demonstration purposes, this job has been configured to run on a larger machine than compared to the previous section. Additionally, we are running with a GPU. If you don't specify the machine-type, accelerator_type, or accelerator_count, the job will run by default on an n1-standard-4.

In the Training section of your console under the CUSTOM JOBS tab you'll see your training job.

7. Cleanup

Because Vertex AI Workbench managed notebooks have an idle shutdown feature, we don't need to worry about shutting the instance down. If you would like to manually shut down the instance, click the Stop button on the Vertex AI Workbench section of the console. If you'd like to delete the notebook entirely, click the Delete button.

Stop instance

To delete the Storage Bucket, using the Navigation menu in your Cloud Console, browse to Storage, select your bucket, and click Delete:

Delete storage

Report a mistake