AI Platform (Unified): Training and serving a custom model

In this lab, you will use AI Platform (Unified) to train and serve a TensorFlow model using code in a custom container. This unified offering is the newest AI product offering on Google Cloud, and is currently in beta.

While we're using TensorFlow for the model code here, you could easily replace it with another framework.

What you learn

You'll learn how to:

  • Build and containerize model training code in AI Platform Notebooks
  • Submit a custom model training job to AI Platform
  • Deploy your trained model to an endpoint, and use that endpoint to get predictions

The total cost to run this lab on Google Cloud is about $1.

This lab uses the newest AI product offering available on Google Cloud. AI Platform (Unified) is currently in preview, and integrates the ML offerings across Google Cloud into a seamless development experience. Previously, models trained with AutoML and custom models were accessible via separate services. The new offering combines both into a unified API, along with other new products. During the preview, we encourage you to use AI Platform (Unified) for any new projects. You can also migrate existing projects to the unified platform. If you have any feedback, please see the support page.

You'll need a Google Cloud Platform project with billing enabled to run this codelab. To create a project, follow the instructions here.

Step 1: Enable the Compute Engine API

Navigate to Compute Engine and select Enable if it isn't already enabled. You'll need this to create your notebook instance.

Step 2: Enable the AI Platform API

Navigate to the AI Platform (Unified) section of your Cloud Console and click Enable AI Platform API.

AI Platform Unified dashboard

Step 3: Create an AI Platform Notebooks instance

From the AI Platform (Unified) section of your Cloud Console, click on Notebooks:

AI Platform Unified menu

From there, select New Instance. Then select the TensorFlow Enterprise 2.3 instance type without GPUs:

TFE instance

Use the default options and then click Create. Once the instance has been created, select Open JupyterLab:

Open CAIP Notebook

The model we'll be training and serving in this lab is built upon this tutorial from the TensorFlow docs. The tutorial uses the Auto MPG dataset from Kaggle to predict the fuel efficiency of a vehicle.

We'll submit this training job to AI Platform by putting our training code in a Docker container and pushing this container to Google Container Registry. Using this approach, we can train a model built with any framework.

To start, from the Launcher menu, open a Terminal window in your notebook instance:

Open terminal in notebook

Create a new directory called mpg and cd into it:

mkdir mpg
cd mpg

Step 1: Create a Dockerfile

Our first step in containerizing our code is to create a Dockerfile. In our Dockerfile we'll include all the commands needed to run our image. It'll install all the libraries we're using and set up the entry point for our training code. From your Terminal, create an empty Dockerfile:

touch Dockerfile

Open the Dockerfile and copy the following into it:

FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-1
WORKDIR /root

WORKDIR /

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.train"]

This Dockerfile uses the Deep Learning Container TensorFlow Enterprise 2.1 Docker image. The Deep Learning Containers on Google Cloud come with many common ML and data science frameworks pre-installed. The one we're using includes TF Enterprise 2.1, Pandas, Scikit-learn, and others. After downloading that image, this Dockerfile sets up the entrypoint for our training code. We haven't created these files yet – in the next step, we'll add the code for training and exporting our model.

Step 2: Create a Cloud Storage bucket

In our training job, we'll export our trained TensorFlow model to a Cloud Storage Bucket. AI Platform will use this to read our exported model assets and deploy the model. From your Terminal, run the following to create a Cloud Storage Bucket, making sure to replace the PROJECT_NAME variable with the name of the Cloud project you're using for this codelab. The -l (location) flag is important since this needs to be in the same region where you deploy a model endpoint later in the tutorial.

PROJECT_NAME='your-cloud-project'
BUCKET_NAME='gs://'+$PROJECT_NAME+'-bucket'
gsutil mb $BUCKET_NAME -l us-central1

Ensure your bucket was created correctly by navigating to the Storage section of your console. Keep track of your bucket name since you'll be using it in the next step.

Step 3: Add model training code

From your Terminal, run the following to create a directory for our training code and a Python file where we'll add the code:

mkdir trainer
touch trainer/train.py

You should now have the following in your mpg/ directory:

+ Dockerfile
+ trainer/
    + train.py

Next, open the train.py file you just created and copy the code below (this is adapted from the tutorial in the TensorFlow docs).

At the beginning of the file, update the BUCKET variable with the name of the Storage Bucket you created in the previous step:

import numpy as np
import pandas as pd
import pathlib
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

"""## The Auto MPG dataset

The dataset is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/).

### Get the data
First download the dataset.
"""

dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path

"""Import it using pandas"""

column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset.tail()

# TODO: replace `your-gcs-bucket` with the name of the Storage bucket you created earlier
BUCKET = 'gs://your-gcs-bucket'

"""### Clean the data

The dataset contains a few unknown values.
"""

dataset.isna().sum()

"""To keep this initial tutorial simple drop those rows."""

dataset = dataset.dropna()

"""The `"Origin"` column is really categorical, not numeric. So convert that to a one-hot:"""

dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')
dataset.tail()

"""### Split the data into train and test

Now split the dataset into a training set and a test set.

We will use the test set in the final evaluation of our model.
"""

train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

"""### Inspect the data

Have a quick look at the joint distribution of a few pairs of columns from the training set.

Also look at the overall statistics:
"""

train_stats = train_dataset.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()
train_stats

"""### Split features from labels

Separate the target value, or "label", from the features. This label is the value that you will train the model to predict.
"""

train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

"""### Normalize the data

Look again at the `train_stats` block above and note how different the ranges of each feature are.

It is good practice to normalize features that use different scales and ranges. Although the model *might* converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

Note: Although we intentionally generate these statistics from only the training dataset, these statistics will also be used to normalize the test dataset. We need to do that to project the test dataset into the same distribution that the model has been trained on.
"""

def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

"""This normalized data is what we will use to train the model.

Caution: The statistics used to normalize the inputs here (mean and standard deviation) need to be applied to any other data that is fed to the model, along with the one-hot encoding that we did earlier.  That includes the test set as well as live data when the model is used in production.

## The model

### Build the model

Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output layer that returns a single, continuous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on.
"""

def build_model():
  model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=[len(train_dataset.keys())]),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

model = build_model()

"""### Inspect the model

Use the `.summary` method to print a simple description of the model
"""

model.summary()

"""Now try out the model. Take a batch of `10` examples from the training data and call `model.predict` on it.

It seems to be working, and it produces a result of the expected shape and type.

### Train the model

Train the model for 1000 epochs, and record the training and validation accuracy in the `history` object.

Visualize the model's training progress using the stats stored in the `history` object.

This graph shows little improvement, or even degradation in the validation error after about 100 epochs. Let's update the `model.fit` call to automatically stop training when the validation score doesn't improve. We'll use an *EarlyStopping callback* that tests a training condition for  every epoch. If a set amount of epochs elapses without showing improvement, then automatically stop the training.

You can learn more about this callback [here](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping).
"""

model = build_model()

EPOCHS = 1000

# The patience parameter is the amount of epochs to check for improvement
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

early_history = model.fit(normed_train_data, train_labels, 
                    epochs=EPOCHS, validation_split = 0.2, 
                    callbacks=[early_stop])


# Export model and save to GCS
model.save(BUCKET + '/mpg/model')

Step 4: Build and test the container locally

From your Terminal, define a variable with the URI of your container image in Google Container Registry. In the URI, replace your-cloud-project with the name of your Cloud project.

export IMAGE_URI=gcr.io/your-cloud-project/mpg:v1

Then, build the container by running the following from the root of your mpg directory:

docker build ./ -t $IMAGE_URI

Run the container within your notebook instance to ensure it's working correctly:

docker run $IMAGE_URI

The model should finish training in 1-2 minutes with a validation accuracy around 72% (exact accuracy may vary). When you've finished running the container locally, push it to Google Container Registry:

docker push $IMAGE_URI

With our container pushed to Container Registry, we're now ready to kick off a custom model training job.

AI Platform (Unified) gives you two options for training models:

  • AutoML: Train high-quality models with minimal effort and ML expertise.
  • Custom training: Run your custom training applications in the cloud using one of Google Cloud's pre-built containers or use your own.

In this lab, we're using custom training via our own custom container on Google Container Registry. To start, navigate to the Training section in the AI Platform (Unified) section of your Cloud console:

Training menu

Step 1: Kick off the training job

Click Create to enter the parameters for your training job. In the first step (Define your model):

Enter mpg (or whatever you'd like to call your model) for Model name

Under Dataset, select No managed dataset

Click Continue

In Step 2, select Custom training (advanced) as your training method and click Continue.

In the Container settings step, select Custom container:

Custom container option

In the first box (Container image on GCR), enter the value of your IMAGE_URI variable above. It should be: gcr.io/your-cloud-project/mpg:v1, with your own project name. Leave the rest of the fields blank and click Continue.

We won't use hyperparameter tuning in this tutorial, so leave the Enable hyperparameter tuning box unchecked and click Continue.

In Compute and pricing, leave the selected region as-is and select the following machine type:

Machine type

Because the model in this demo trains quickly, we're using a smaller machine type.

Under Inference settings, select Pre-built container:

Pre-built for inference

Leave the default settings for the pre-built container as is. Under model directory, enter your GCS bucket with the mpg subdirectory. This is the path in your model training script where you export your trained model. It should look like:

Model output path

AI Platform will look in this location when it deploys your model. Now you're ready for training! Click Start training to kick off the training job. In the Training section of your console, you should see two jobs created:

Training jobs

To view the logs for your job, click on the one titled mpg-custom-job.

When we set up our training job, we specified where AI Platform (Unified) should look for our exported model assets. As part of our training pipeline, AI Platform (Unified) will create a model resource based on this asset path. The model resource itself isn't a deployed model, but once you have a model you're ready to deploy it to an endpoint. To learn more about Models and Endpoints in AI Platform (Unified), check out the documentation.

In this step we'll create an endpoint for our trained model. We can use this to get predictions on our model via the AI Platform (Unified) API.

Step 1: Deploy endpoint

When your training job completes, click on the job named mpg:

Completed jobs

When your training job ran, AI Platform created a model resource for you. In order to use this model, you need to deploy an endpoint. You can have many endpoints per model. Next, click Deploy to endpoint.

Select Create new endpoint and give it a name, like v1. Leave Traffic split at 100 and enter 1 for Minimum number of compute nodes. Under Machine type, select n1-standard-2 (or any machine type you'd like). Then click Continue and then Deploy.

Deploying the endpoint will take a few minutes, and you'll get an email when the deploy completes. When the endpoint has finished deploying, you'll see the following, which shows one endpoint deployed under your Model resource:

Deploy to endpoint

Step 2: Get predictions on the deployed model

We'll get predictions on our trained model from a Python notebook, using the AI Platform Python API. Go back to your notebook instance, and create a Python 3 notebook from the Launcher:

Open notebook

In your notebook, run the following in a cell to install the AI Platform (Unified) SDK:

!pip3 install https://storage.googleapis.com/google-cloud-aiplatform/libraries/python/0.2.0/google-cloud-aiplatform-0.2.0.tar.gz

In your notebook, add and run a cell with the following function. This function will make a prediction request to our deployed endpoint:

from google.cloud import aiplatform

def predict_sample(instance_dict: Dict, project: str, endpoint_id: str):
    client_options = {
        "api_endpoint": "us-central1-prediction-aiplatform.googleapis.com"
    }
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)
    location = "us-central1"
    name = "projects/{project}/locations/{location}/endpoints/{endpoint}".format(
        project=project, location=location, endpoint=endpoint_id
    )
    parameters_dict = {}
    parameters = json_format.ParseDict(parameters_dict, Value())
    instance = json_format.ParseDict(instance_dict, Value())
    instances = [instance]
    response = client.predict(endpoint=name, instances=instances, parameters=parameters)
    print(response)

Below that, add and run a new cell with the following example from our test set:

test_mpg = [1.4838871833555929,
 1.8659883497083019,
 2.234620276849616,
 1.0187816540094903,
 -2.530890710602246,
 -1.6046416850441676,
 -0.4651483719733302,
 -0.4952254087173721,
 0.7746763768735953]

This example already has normalized values, which is the format our model is expecting. Next, click on Sample Request for the endpoint you just deployed:

Sample request

Click on the Python tab, and copy the first string in the request. This contains your project and endpoint IDs:

Endpoint string

Finally, make a prediction request to your endpoint. In the code below, replace your-endpoint-str with the string you copied above and your-project-id with the ID of your cloud project:

predict_sample(
  test_mpg,
  "your-project-id",
  "your-endpoint-str"
)

You should see a prediction value close to 15.8. The actual value for this example is 15 miles per gallon.

🎉 Congratulations! 🎉

You've learned how to use unified Cloud AI Platform to:

  • Train a model by providing the training code in a custom container. You used a TensorFlow model in this example, but you can train a model built with any framework using custom containers.
  • Deploy a TensorFlow model using a pre-built container as part of the same workflow you used for training.
  • Create a model endpoint and generate a prediction.

To learn more about different parts of the Unified AI Platform, check out the [documentation](To learn more about different parts of the Unified AI Platform, check out the documentation).

If you'd like to continue using the notebook you created in this lab, it is recommended that you turn it off when not in use. From the Notebooks UI in your Cloud Console, select the notebook and then select Stop:

Stop instance

If you'd like to delete the notebook entirely, simply click the Delete button in the top right.

To delete the endpoint you deployed, navigate to the Endpoints section of your AI Platform (Unified) console and click the delete icon:

Delete endpoint

To delete the Storage Bucket, using the Navigation menu in your Cloud Console, browse to Storage, select your bucket, and click Delete:

Delete storage