1. Overview
In this lab, you'll learn how to run a custom training job on Vertex AI Training with the autopackaging feature. Custom training jobs on Vertex AI use containers. If you do not want to build your own image, you can use auotpackaging, which will build a custom Docker image based on your code, push the image to Container Registry, and start a CustomJob
based on the image.
What you learn
You'll learn how to:
- Use local mode to test your code.
- Configure and launch a custom training job with autopackaging.
The total cost to run this lab on Google Cloud is about $2.
2. Use Case Overview
Using libraries from Hugging Face, you'll fine tune a Bert model on the IMDB dataset. The model will predict whether a movie review are positive or negative. The dataset will be downloaded from the Hugging Face datasets library, and the Bert model from the Hugging Face transformers library.
3. Intro to Vertex AI
This lab uses the newest AI product offering available on Google Cloud. Vertex AI integrates the ML offerings across Google Cloud into a seamless development experience. Previously, models trained with AutoML and custom models were accessible via separate services. The new offering combines both into a single API, along with other new products. You can also migrate existing projects to Vertex AI. If you have any feedback, please see the support page.
Vertex AI includes many different products to support end-to-end ML workflows. This lab will focus on Training and Workbench.
4. Set up your environment
You'll need a Google Cloud Platform project with billing enabled to run this codelab. To create a project, follow the instructions here.
Step 1: Enable the Compute Engine API
Navigate to Compute Engine and select Enable if it isn't already enabled.
Step 2: Enable the Vertex AI API
Navigate to the Vertex AI section of your Cloud Console and click Enable Vertex AI API.
Step 3: Enable the Container Registry API
Navigate to the Container Registry and select Enable if it isn't already. You'll use this to create a container for your custom training job.
Step 4: Create a Vertex AI Workbench instance
From the Vertex AI section of your Cloud Console, click on Workbench:
From there, click MANAGED NOTEBOOKS:
Then select NEW NOTEBOOK.
Give your notebook a name, and then click Advanced Settings.
Under Advanced Settings, enable idle shutdown and set the number of minutes to 60. This means your notebook will shutdown automatically when not in use so you don't incur unnecessary costs.
You can leave all of the other advanced settings as is.
Next, click Create.
Once the instance has been created, select Open JupyterLab.
The first time you use a new instance, you'll be asked to authenticate.
5. Write training code
To start, from the Launcher menu, open a Terminal window in your notebook instance:
Create a new directory called autopkg-codelab
and cd into it.
mkdir autopkg-codelab
cd autopkg-codelab
From your Terminal, run the following to create a directory for the training code and a Python file where you'll add the code:
mkdir trainer
touch trainer/task.py
You should now have the following in your autopkg-codelab/
directory:
+ trainer/
+ task.py
Next, open the task.py
file you just created and copy the code below.
import argparse
import tensorflow as tf
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification
CHECKPOINT = "bert-base-cased"
def get_args():
'''Parses args.'''
parser = argparse.ArgumentParser()
parser.add_argument(
'--epochs',
required=False,
default=3,
type=int,
help='number of epochs')
parser.add_argument(
'--job_dir',
required=True,
type=str,
help='bucket to store saved model, include gs://')
args = parser.parse_args()
return args
def create_datasets():
'''Creates a tf.data.Dataset for train and evaluation.'''
raw_datasets = load_dataset('imdb')
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
tokenized_datasets = raw_datasets.map((lambda examples: tokenize_function(examples, tokenizer)), batched=True)
# To speed up training, we use only a portion of the data.
# Use full_train_dataset and full_eval_dataset if you want to train on all the data.
small_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets['test'].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets['train']
full_eval_dataset = tokenized_datasets['test']
tf_train_dataset = small_train_dataset.remove_columns(['text']).with_format("tensorflow")
tf_eval_dataset = small_eval_dataset.remove_columns(['text']).with_format("tensorflow")
train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)
eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
eval_tf_dataset = eval_tf_dataset.batch(8)
return train_tf_dataset, eval_tf_dataset
def tokenize_function(examples, tokenizer):
'''Tokenizes text examples.'''
return tokenizer(examples['text'], padding='max_length', truncation=True)
def main():
args = get_args()
train_tf_dataset, eval_tf_dataset = create_datasets()
model = TFAutoModelForSequenceClassification.from_pretrained(CHECKPOINT, num_labels=2)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=tf.metrics.SparseCategoricalAccuracy(),
)
model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=args.epochs)
model.save(f'{args.job_dir}/model_output')
if __name__ == "__main__":
main()
A few things to note about the code:
CHECKPOINT
is the model we want to fine tune. In this case, we use Bert.- The
TFAutoModelForSequenceClassification
method will load the specified language model architecture + weights in TensorFlow and add a classification head on top with randomly initialized weights. In this case, we have a binary classification problem (positive or negative) so we specifynum_labels=2
for this classifier.
6. Containerize and run training code locally
You can use the gcloud ai custom-jobs local-run
command to build a Docker container image based on your training code and run the image as a container on your local machine. Running a container locally executes your training code in a similar way to how it runs on Vertex AI Training, and can help you debug problems with your code before you perform custom training on Vertex AI.
In our training job, we'll export our trained model to a Cloud Storage Bucket. From your Terminal, run the following to define an env variable for your project, making sure to replace your-cloud-project
with the ID of your project:
PROJECT_ID='your-cloud-project'
Then, create a bucket. If you have an existing bucket, feel free to use that instead.
BUCKET_NAME="gs://${PROJECT_ID}-bucket"
gsutil mb -l us-central1 $BUCKET_NAME
When we run the custom training job on Vertex AI Training, we'll make use of a GPU. But since we did not specify our Workbench instance with GPUs, we'll use a CPU based image for local testing. In this example, we use a Vertex AI Training pre-built container.
Run the following to set the URI of a Docker image to use as the base of the container.
BASE_CPU_IMAGE=us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-7:latest
Then set a name for the resulting Docker image built by the local run command.
OUTPUT_IMAGE=$PROJECT_ID-local-package-cpu:latest
Our training code uses the Hugging Face datasets and transformers libraries. These libraries are not included in the image we have selected as our base image, so we will need to provide them as requirements. To do this, we will create a requirements.txt
file in our autopkg-codelab
directory.
Ensure you are in the autopkg-codelab
directory and type the following in your terminal.
touch requirements.txt
You should now have the following in your autopkg-codelab
directory:
+ requirements.txt
+ trainer/
+ task.py
Open up the requirements file and paste in the following
datasets==1.18.2
transformers==4.16.2
Finally, execute the gcloud ai custom-jobs local-run
command to kick of training on our Workbench managed instance.
gcloud ai custom-jobs local-run \
--executor-image-uri=$BASE_CPU_IMAGE \
--python-module=trainer.task \
--output-image-uri=$OUTPUT_IMAGE \
-- \
--job_dir=$BUCKET_NAME
You should see the Docker image being built. The dependencies we added to the requirements.txt
file will be pip installed. This may take a few minutes to complete the first time you execute this command. Once the image is built, thetask.py
file will start running and you'll see the model training. You should see something like this:
Because we are not using a GPU locally, model training will take a long time. You can Ctrl+c and cancel local training instead of waiting for the job to complete.
Note that if you wanted to do further testing, you can also directly run the image built above, without repackaging.
gcloud beta ai custom-jobs local-run \
--executor-image-uri=$OUTPUT_IMAGE \
-- \
--job_dir=$BUCKET_NAME \
--epochs=1
7. Create a custom job
Now that we have tested out local mode, we'll use the autopackaging feature to launch our custom training job on Vertex AI Training. With a single command, this feature will:
- Build a custom Docker image based on your code.
- Push the image to Container Registry.
- Start a
CustomJob
based on the image.
Return to the terminal and cd up one level above your autopkg-codelab
directory.
+ autopkg-codelab
+ requirements.txt
+ trainer/
+ task.py
Specify the Vertex AI Training pre-built TensorFlow GPU image as the base image for the custom training job.
BASE_GPU_IMAGE=us-docker.pkg.dev/vertex-ai/training/tf-gpu.2-7:latest
Next, execute the gcloud ai custom-jobs create
command. First, this command will build a custom Docker image based on the training code. The base image is the Vertex AI Training pre-built container we set as BASE_GPU_IMAGE
. The autopackaging feature will then pip install the datasets and transformers libraries as specified in our requirements.txt
file.
gcloud ai custom-jobs create \
--region=us-central1 \
--display-name=fine_tune_bert \
--args=--job_dir=$BUCKET_NAME \
--worker-pool-spec=machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=autopkg-codelab,python-module=trainer.task
Let's take a look at the worker-pool-spec
argument. This defines the worker pool configuration used by the custom job. You can specify multiple worker pool specs in order to create a custom job with multiple worker pools for distributed training. In this example, we only specify a single worker pool, as our training code is not configured for distributed training.
Here are some of the key fields of this spec:
machine-type
(Required): The type of the machine. Click here for supported types.replica-count
: The number of worker replicas to use for this worker pool, by default the value is 1.accelerator-type
: The type of GPUs. Click here for supported types. In this example, we specified one NVIDIA Tesla V100 GPU.accelerator-count
: The number of GPUs for each VM in the worker pool to use, by default the value is 1.executor-image-uri
: The URI of a container image that will run the provided package. This is set to our base image.local-package-path
: The local path of a folder that contains training code.python-module
: The Python module name to run within the provided package.
Similar to when you ran the local command, you will see the Docker image being built and then the training job kick off. Except instead of seeing the output of the training job, you'll see the following message confirming that your training job has launched. Note that the first time you run the custom-jobs create
command, it may take a few minutes for the image to be built and pushed.
Return to the Vertex AI Training section of the cloud console and under CUSTOM JOBS you should see your job running.
The job will take around 20 minutes to complete.
Once complete, you should see the following saved model artifacts in the model_output
directory in your bucket.
🎉 Congratulations! 🎉
You've learned how to use Vertex AI to:
- Containerize and run training code locally
- Submit training jobs to Vertex AI Training with autopackaging
To learn more about different parts of Vertex AI, check out the documentation.
8. Cleanup
Because we configured the notebook to time out after 60 idle minutes, we don't need to worry about shutting the instance down. If you would like to manually shut down the instance, click the Stop button on the Vertex AI Workbench section of the console. If you'd like to delete the notebook entirely, click the Delete button.
To delete the Storage Bucket, using the Navigation menu in your Cloud Console, browse to Storage, select your bucket, and click Delete: