Vertex AI: Use autopackaging to fine tune Bert with Hugging Face on Vertex AI Training

1. Overview

In this lab, you'll learn how to run a custom training job on Vertex AI Training with the autopackaging feature. Custom training jobs on Vertex AI use containers. If you do not want to build your own image, you can use auotpackaging, which will build a custom Docker image based on your code, push the image to Container Registry, and start a CustomJob based on the image.

What you learn

You'll learn how to:

The total cost to run this lab on Google Cloud is about $2.

2. Use Case Overview

Using libraries from Hugging Face, you'll fine tune a Bert model on the IMDB dataset. The model will predict whether a movie review are positive or negative. The dataset will be downloaded from the Hugging Face datasets library, and the Bert model from the Hugging Face transformers library.

3. Intro to Vertex AI

This lab uses the newest AI product offering available on Google Cloud. Vertex AI integrates the ML offerings across Google Cloud into a seamless development experience. Previously, models trained with AutoML and custom models were accessible via separate services. The new offering combines both into a single API, along with other new products. You can also migrate existing projects to Vertex AI. If you have any feedback, please see the support page.

Vertex AI includes many different products to support end-to-end ML workflows. This lab will focus on Training and Workbench.

Vertex product overview

4. Set up your environment

You'll need a Google Cloud Platform project with billing enabled to run this codelab. To create a project, follow the instructions here.

Step 1: Enable the Compute Engine API

Navigate to Compute Engine and select Enable if it isn't already enabled.

Step 2: Enable the Vertex AI API

Navigate to the Vertex AI section of your Cloud Console and click Enable Vertex AI API.

Vertex AI dashboard

Step 3: Enable the Container Registry API

Navigate to the Container Registry and select Enable if it isn't already. You'll use this to create a container for your custom training job.

Step 4: Create a Vertex AI Workbench instance

From the Vertex AI section of your Cloud Console, click on Workbench:

Vertex AI menu

From there, click MANAGED NOTEBOOKS:

Notebooks_UI

Then select NEW NOTEBOOK.

new_notebook

Give your notebook a name, and then click Advanced Settings.

create_notebook

Under Advanced Settings, enable idle shutdown and set the number of minutes to 60. This means your notebook will shutdown automatically when not in use so you don't incur unnecessary costs.

idle_timeout

You can leave all of the other advanced settings as is.

Next, click Create.

Once the instance has been created, select Open JupyterLab.

open_jupyterlab

The first time you use a new instance, you'll be asked to authenticate.

authenticate

5. Write training code

To start, from the Launcher menu, open a Terminal window in your notebook instance:

launcher_terminal

Create a new directory called autopkg-codelab and cd into it.

mkdir autopkg-codelab
cd autopkg-codelab

From your Terminal, run the following to create a directory for the training code and a Python file where you'll add the code:

mkdir trainer
touch trainer/task.py

You should now have the following in your autopkg-codelab/ directory:

+ trainer/
    + task.py

Next, open the task.py file you just created and copy the code below.

import argparse

import tensorflow as tf
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification

CHECKPOINT = "bert-base-cased"

def get_args():
  '''Parses args.'''

  parser = argparse.ArgumentParser()
  parser.add_argument(
      '--epochs',
      required=False,
      default=3,
      type=int,
      help='number of epochs')
  parser.add_argument(
      '--job_dir',
      required=True,
      type=str,
      help='bucket to store saved model, include gs://')
  args = parser.parse_args()
  return args


def create_datasets():
    '''Creates a tf.data.Dataset for train and evaluation.'''

    raw_datasets = load_dataset('imdb')
    tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
    tokenized_datasets = raw_datasets.map((lambda examples: tokenize_function(examples, tokenizer)), batched=True)

    # To speed up training, we use only a portion of the data.
    # Use full_train_dataset and full_eval_dataset if you want to train on all the data.
    small_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(1000))
    small_eval_dataset = tokenized_datasets['test'].shuffle(seed=42).select(range(1000))
    full_train_dataset = tokenized_datasets['train']
    full_eval_dataset = tokenized_datasets['test']

    tf_train_dataset = small_train_dataset.remove_columns(['text']).with_format("tensorflow")
    tf_eval_dataset = small_eval_dataset.remove_columns(['text']).with_format("tensorflow")

    train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
    train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
    train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)

    eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
    eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
    eval_tf_dataset = eval_tf_dataset.batch(8)

    return train_tf_dataset, eval_tf_dataset


def tokenize_function(examples, tokenizer):
    '''Tokenizes text examples.'''

    return tokenizer(examples['text'], padding='max_length', truncation=True)


def main():
    args = get_args()
    train_tf_dataset, eval_tf_dataset = create_datasets()
    model = TFAutoModelForSequenceClassification.from_pretrained(CHECKPOINT, num_labels=2)

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=tf.metrics.SparseCategoricalAccuracy(),
    )

    model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=args.epochs)
    model.save(f'{args.job_dir}/model_output')


if __name__ == "__main__":
    main()

A few things to note about the code:

  • CHECKPOINT is the model we want to fine tune. In this case, we use Bert.
  • The TFAutoModelForSequenceClassification method will load the specified language model architecture + weights in TensorFlow and add a classification head on top with randomly initialized weights. In this case, we have a binary classification problem (positive or negative) so we specify num_labels=2 for this classifier.

6. Containerize and run training code locally

You can use the gcloud ai custom-jobs local-run command to build a Docker container image based on your training code and run the image as a container on your local machine. Running a container locally executes your training code in a similar way to how it runs on Vertex AI Training, and can help you debug problems with your code before you perform custom training on Vertex AI.

In our training job, we'll export our trained model to a Cloud Storage Bucket. From your Terminal, run the following to define an env variable for your project, making sure to replace your-cloud-project with the ID of your project:

PROJECT_ID='your-cloud-project'

Then, create a bucket. If you have an existing bucket, feel free to use that instead.

BUCKET_NAME="gs://${PROJECT_ID}-bucket"
gsutil mb -l us-central1 $BUCKET_NAME

When we run the custom training job on Vertex AI Training, we'll make use of a GPU. But since we did not specify our Workbench instance with GPUs, we'll use a CPU based image for local testing. In this example, we use a Vertex AI Training pre-built container.

Run the following to set the URI of a Docker image to use as the base of the container.

BASE_CPU_IMAGE=us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-7:latest

Then set a name for the resulting Docker image built by the local run command.

OUTPUT_IMAGE=$PROJECT_ID-local-package-cpu:latest

Our training code uses the Hugging Face datasets and transformers libraries. These libraries are not included in the image we have selected as our base image, so we will need to provide them as requirements. To do this, we will create a requirements.txt file in our autopkg-codelab directory.

Ensure you are in the autopkg-codelab directory and type the following in your terminal.

touch requirements.txt

You should now have the following in your autopkg-codelab directory:

+ requirements.txt
+ trainer/
    + task.py

Open up the requirements file and paste in the following

datasets==1.18.2
transformers==4.16.2

Finally, execute the gcloud ai custom-jobs local-run command to kick of training on our Workbench managed instance.

gcloud ai custom-jobs local-run \
--executor-image-uri=$BASE_CPU_IMAGE \
--python-module=trainer.task \
--output-image-uri=$OUTPUT_IMAGE \
-- \
--job_dir=$BUCKET_NAME

You should see the Docker image being built. The dependencies we added to the requirements.txt file will be pip installed. This may take a few minutes to complete the first time you execute this command. Once the image is built, thetask.py file will start running and you'll see the model training. You should see something like this:

local_training

Because we are not using a GPU locally, model training will take a long time. You can Ctrl+c and cancel local training instead of waiting for the job to complete.

Note that if you wanted to do further testing, you can also directly run the image built above, without repackaging.

gcloud beta ai custom-jobs local-run \
--executor-image-uri=$OUTPUT_IMAGE \
-- \
--job_dir=$BUCKET_NAME \
--epochs=1

7. Create a custom job

Now that we have tested out local mode, we'll use the autopackaging feature to launch our custom training job on Vertex AI Training. With a single command, this feature will:

  • Build a custom Docker image based on your code.
  • Push the image to Container Registry.
  • Start a CustomJob based on the image.

Return to the terminal and cd up one level above your autopkg-codelab directory.

+ autopkg-codelab
  + requirements.txt
  + trainer/
      + task.py

Specify the Vertex AI Training pre-built TensorFlow GPU image as the base image for the custom training job.

BASE_GPU_IMAGE=us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-7:latest

Next, execute the gcloud ai custom-jobs create command. First, this command will build a custom Docker image based on the training code. The base image is the Vertex AI Training pre-built container we set as BASE_GPU_IMAGE. The autopackaging feature will then pip install the datasets and transformers libraries as specified in our requirements.txt file.

gcloud ai custom-jobs create \
--region=us-central1 \
--display-name=fine_tune_bert \
--args=--job_dir=$BUCKET_NAME \
--worker-pool-spec=machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=autopkg-codelab,python-module=trainer.task

Let's take a look at the worker-pool-spec argument. This defines the worker pool configuration used by the custom job. You can specify multiple worker pool specs in order to create a custom job with multiple worker pools for distributed training. In this example, we only specify a single worker pool, as our training code is not configured for distributed training.

Here are some of the key fields of this spec:

  • machine-type (Required): The type of the machine. Click here for supported types.
  • replica-count: The number of worker replicas to use for this worker pool, by default the value is 1.
  • accelerator-type: The type of GPUs. Click here for supported types. In this example, we specified one NVIDIA Tesla V100 GPU.
  • accelerator-count: The number of GPUs for each VM in the worker pool to use, by default the value is 1.
  • executor-image-uri: The URI of a container image that will run the provided package. This is set to our base image.
  • local-package-path: The local path of a folder that contains training code.
  • python-module: The Python module name to run within the provided package.

Similar to when you ran the local command, you will see the Docker image being built and then the training job kick off. Except instead of seeing the output of the training job, you'll see the following message confirming that your training job has launched. Note that the first time you run the custom-jobs create command, it may take a few minutes for the image to be built and pushed.

training_started

Return to the Vertex AI Training section of the cloud console and under CUSTOM JOBS you should see your job running.

training_job

The job will take around 20 minutes to complete.

Once complete, you should see the following saved model artifacts in the model_output directory in your bucket.

model_output

🎉 Congratulations! 🎉

You've learned how to use Vertex AI to:

  • Containerize and run training code locally
  • Submit training jobs to Vertex AI Training with autopackaging

To learn more about different parts of Vertex AI, check out the documentation.

8. Cleanup

Because we configured the notebook to time out after 60 idle minutes, we don't need to worry about shutting the instance down. If you would like to manually shut down the instance, click the Stop button on the Vertex AI Workbench section of the console. If you'd like to delete the notebook entirely, click the Delete button.

delete

To delete the Storage Bucket, using the Navigation menu in your Cloud Console, browse to Storage, select your bucket, and click Delete:

Delete storage