此页面由 Cloud Translation API 翻译。

Vertex AI：训练和部署自定义模型

1. 概览

在本实验中，您将使用 Vertex AI 在自定义容器中使用代码训练和部署 TensorFlow 模型。

虽然我们在此处使用 TensorFlow 构建模型代码，但您可以轻松将其替换为其他框架。

学习内容

您将了解如何：

在 Vertex Workbench 中构建模型训练代码并将其容器化
向 Vertex AI 提交自定义模型训练作业
将经过训练的模型部署到端点，并使用该端点进行预测

在 Google Cloud 上运行此实验的总费用约为 1 美元。

2. Vertex AI 简介

本实验使用的是 Google Cloud 上提供的最新 AI 产品。Vertex AI 将整个 Google Cloud 的机器学习产品集成到无缝的开发体验中。以前，使用 AutoML 训练的模型和自定义模型是通过不同的服务访问的。现在，该新产品与其他新产品一起将这两种模型合并到一个 API 中。您还可以将现有项目迁移到 Vertex AI。如果您有任何反馈，请参阅支持页面。

Vertex AI 包含许多不同的产品，可支持端到端机器学习工作流。本实验将重点介绍下面突出显示的产品：训练、预测和 Workbench。

Vertex 产品概览

3. 设置您的环境

您需要一个启用了结算功能的 Google Cloud Platform 项目才能运行此 Codelab。如需创建项目，请按照此处的说明操作。

第 1 步：启用 Compute Engine API

前往 Compute Engine，然后选择启用（如果尚未启用）。您需要用它来创建笔记本实例。

第 2 步：启用 Vertex AI API

前往 Cloud Console 的 Vertex AI 部分，然后点击启用 Vertex AI API。

Vertex AI 信息中心

第 3 步：启用 Container Registry API

前往 Container Registry，然后选择启用（如果尚未启用）。您将使用此产品为您的自定义训练作业创建容器。

第 4 步：创建 Vertex AI Workbench 实例

在 Cloud Console 的 Vertex AI 部分中，点击“Workbench”：

Vertex AI 菜单

然后，在用户管理的笔记本中，点击新建笔记本：

创建新的笔记本

然后，选择最新版本的TensorFlow 企业版（提供长期支持）实例类型不带 GPU：

TFE 实例

使用默认选项，然后点击创建。

我们将在本实验中训练和提供的模型基于 TensorFlow 文档中的本教程构建而成。本教程使用 Kaggle 的 Auto MPG 数据集来预测车辆的燃油效率。

4. 容器化训练代码

我们会将训练代码置于 Docker 容器中，并将该容器推送到 Google Container Registry，从而将此训练作业提交到 Vertex。通过这种方法，我们可以训练使用任何框架构建的模型。

首先，通过“启动器”菜单在笔记本实例中打开终端窗口：

在笔记本中打开终端

创建一个名为 mpg 的新目录并通过 cd 命令进入该目录：

mkdir mpg
cd mpg

第 1 步：创建 Dockerfile

将代码容器化的第一步是创建一个 Dockerfile。Dockerfile 中将包含运行映像所需的所有命令。它将安装我们使用的所有库，并为训练代码设置入口点。在终端中，创建一个空的 Dockerfile：

touch Dockerfile

打开 Dockerfile 并将以下代码复制到其中：

FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-6
WORKDIR /

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.train"]

此 Dockerfile 使用 Deep Learning Container TensorFlow Enterprise 2.3 Docker 映像。Google Cloud 上的 Deep Learning Containers 预安装了许多常见的机器学习和数据科学框架。我们使用的是 TF Enterprise 2.3、Pandas、Scikit-learn 等工具。下载该映像后，此 Dockerfile 会为我们的训练代码设置入口点。我们尚未创建这些文件。在下一步中，我们将添加用于训练和导出模型的代码。

第 2 步：创建 Cloud Storage 存储分区

在训练作业中，我们将将经过训练的 TensorFlow 模型导出到 Cloud Storage 存储桶。Vertex 将使用此 API 读取导出的模型资产并部署模型。在终端中，运行以下命令为项目定义一个环境变量，务必注意将 your-cloud-project 替换为您的项目 ID：

PROJECT_ID='your-cloud-project'

接下来，在终端中运行以下命令，以在项目中创建一个新的存储分区。-l（位置）标志非常重要，因为它需要位于您稍后在本教程中部署模型端点的区域：

BUCKET_NAME="gs://${PROJECT_ID}-bucket"
gsutil mb -l us-central1 $BUCKET_NAME

第 3 步：添加模型训练代码

在终端中，运行以下命令，为训练代码创建一个目录，并为我们将在其中添加代码的 Python 文件：

mkdir trainer
touch trainer/train.py

现在，mpg/ 目录中应包含以下内容：

+ Dockerfile
+ trainer/
    + train.py

接下来，打开您刚刚创建的 train.py 文件，并复制以下代码（此代码改编自 TensorFlow 文档中的教程）。

在文件开头，使用您在上一步中创建的存储分区的名称更新 BUCKET 变量：

import numpy as np
import pandas as pd
import pathlib
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

"""## The Auto MPG dataset

The dataset is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/).

### Get the data
First download the dataset.
"""

dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path

"""Import it using pandas"""

column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset.tail()

# TODO: replace `your-gcs-bucket` with the name of the Storage bucket you created earlier
BUCKET = 'gs://your-gcs-bucket'

"""### Clean the data

The dataset contains a few unknown values.
"""

dataset.isna().sum()

"""To keep this initial tutorial simple drop those rows."""

dataset = dataset.dropna()

"""The `"Origin"` column is really categorical, not numeric. So convert that to a one-hot:"""

dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')
dataset.tail()

"""### Split the data into train and test

Now split the dataset into a training set and a test set.

We will use the test set in the final evaluation of our model.
"""

train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

"""### Inspect the data

Have a quick look at the joint distribution of a few pairs of columns from the training set.

Also look at the overall statistics:
"""

train_stats = train_dataset.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()
train_stats

"""### Split features from labels

Separate the target value, or "label", from the features. This label is the value that you will train the model to predict.
"""

train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

"""### Normalize the data

Look again at the `train_stats` block above and note how different the ranges of each feature are.

It is good practice to normalize features that use different scales and ranges. Although the model *might* converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

Note: Although we intentionally generate these statistics from only the training dataset, these statistics will also be used to normalize the test dataset. We need to do that to project the test dataset into the same distribution that the model has been trained on.
"""

def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

"""This normalized data is what we will use to train the model.

Caution: The statistics used to normalize the inputs here (mean and standard deviation) need to be applied to any other data that is fed to the model, along with the one-hot encoding that we did earlier.  That includes the test set as well as live data when the model is used in production.

## The model

### Build the model

Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output layer that returns a single, continuous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on.
"""

def build_model():
  model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=[len(train_dataset.keys())]),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

model = build_model()

"""### Inspect the model

Use the `.summary` method to print a simple description of the model
"""

model.summary()

"""Now try out the model. Take a batch of `10` examples from the training data and call `model.predict` on it.

It seems to be working, and it produces a result of the expected shape and type.

### Train the model

Train the model for 1000 epochs, and record the training and validation accuracy in the `history` object.

Visualize the model's training progress using the stats stored in the `history` object.

This graph shows little improvement, or even degradation in the validation error after about 100 epochs. Let's update the `model.fit` call to automatically stop training when the validation score doesn't improve. We'll use an *EarlyStopping callback* that tests a training condition for  every epoch. If a set amount of epochs elapses without showing improvement, then automatically stop the training.

You can learn more about this callback [here](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping).
"""

model = build_model()

EPOCHS = 1000

# The patience parameter is the amount of epochs to check for improvement
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

early_history = model.fit(normed_train_data, train_labels, 
                    epochs=EPOCHS, validation_split = 0.2, 
                    callbacks=[early_stop])


# Export model and save to GCS
model.save(BUCKET + '/mpg/model')

第 4 步：在本地构建和测试容器

在终端中，使用 Google Container Registry 中容器映像的 URI 定义一个变量：

IMAGE_URI="gcr.io/$PROJECT_ID/mpg:v1"

然后，从 mpg 目录的根目录运行以下命令来构建容器：

docker build ./ -t $IMAGE_URI

在笔记本实例中运行容器，以确保它正常运行：

docker run $IMAGE_URI

模型应该会在 1-2 分钟内完成训练，验证准确率约为 72%（确切准确率可能有所不同）。在本地运行完容器后，将其推送到 Google Container Registry：

docker push $IMAGE_URI

将容器推送到 Container Registry 后，我们现在可以启动自定义模型训练作业了。

5. 在 Vertex AI 上运行训练作业

Vertex AI 为您提供了两种训练模型选择：

AutoML：只需极少的工作量和机器学习专业知识，即可训练出高质量模型。
自定义训练：使用 Google Cloud 的某个预构建容器或您自己的容器在云端运行您的自定义训练应用。

在本实验中，我们将通过 Google Container Registry 中自己的自定义容器进行自定义训练。首先，前往 Cloud 控制台的 Vertex 部分中的模型部分：

顶点菜单

第 1 步：启动训练作业

点击创建，输入训练作业和部署的模型的参数：

在数据集下，选择无代管式数据集
然后选择自定义训练（高级）作为训练方法，并点击继续。
点击继续

在下一步中，输入 mpg（或者用于指代模型的任何名称）作为模型名称。然后选择自定义容器：

自定义容器选项

在容器映像文本框中，点击浏览并找到您刚刚上传到 Container Registry 的 Docker 映像。将其余字段留空，然后点击继续。

我们不在本教程中使用超参数调节，因此请将“启用超参数调节”复选框保持为未选中状态，然后点击继续。

在计算和价格中，将所选区域保持不变，并选择 n1-standard-4 作为机器类型：

机器类型

将加速器字段留空，然后选择继续。由于此演示中的模型训练速度很快，因此我们将使用较小的机器类型。

在预测容器步骤下，选择预构建容器，然后选择 TensorFlow 2.6。

保留预构建容器的默认设置。在模型目录下，输入包含 mpg 子目录的 GCS 存储分区。这是模型训练脚本中用于导出训练模型的路径：

预测设置

Vertex 在部署模型时将在此位置查找。现在，您可以开始训练了！点击开始训练以启动训练作业。在控制台的“训练”部分，您会看到如下内容：

训练作业

6. 部署模型端点

在设置训练作业时，我们指定了 Vertex AI 应在哪里查找我们导出的模型资源。作为训练流水线的一部分，Vertex 将根据此资产路径创建模型资源。模型资源本身并不是已部署的模型，但拥有模型后，即可将其部署到端点。如需详细了解 Vertex AI 中的模型和端点，请参阅文档。

在此步骤中，我们将为经过训练的模型创建一个端点。我们可以用它来通过 Vertex AI API 获取模型的预测结果。

第 1 步：部署端点

训练作业完成后，您应该会在控制台的模型部分看到名为 mpg（或您为其指定的任何名称）的模型：

已完成的作业

当您的训练作业运行时，Vertex 会为您创建一个模型资源。如需使用此模型，您需要部署端点。每个模型可以有多个端点。点击模型，然后点击部署到端点。

选择创建新端点，然后为其命名，例如 v1。保留“访问权限”的标准选项，然后点击继续。

将流量分配比例保留为 100，并在计算节点数下限中输入 1。在机器类型下，选择 n1-standard-2（或您需要的任何机器类型）。保留其余默认值处于选中状态，然后点击继续。我们将不为此模型启用监控功能，因此接下来请点击部署以启动端点部署。

部署端点将需要 10-15 分钟的时间，部署完成后，您会收到一封电子邮件。端点部署完成后，您将看到以下内容，显示在模型资源下部署了一个端点：

部署到端点

第 2 步：对部署的模型获取预测结果

我们将使用 Vertex Python API 从 Python 笔记本中对经过训练的模型进行预测。返回笔记本实例，然后通过 Launcher 创建 Python 3 笔记本：

打开笔记本

在笔记本中，在单元中运行以下命令以安装 Vertex AI SDK：

!pip3 install google-cloud-aiplatform --upgrade --user

然后，在笔记本中添加一个单元，以导入 SDK 并创建对您刚刚部署的端点的引用：

from google.cloud import aiplatform

endpoint = aiplatform.Endpoint(
    endpoint_name="projects/YOUR-PROJECT-NUMBER/locations/us-central1/endpoints/YOUR-ENDPOINT-ID"
)

您需要将上述 endpoint_name 字符串中的两个值替换为您的项目编号和端点。如需查找项目编号，请前往您的项目信息中心并获取项目编号值。

您可以在控制台的端点部分找到您的端点 ID，具体路径如下：

查找端点 ID

最后，通过复制以下代码并在新单元中运行以下代码，对您的端点进行预测：

test_mpg = [1.4838871833555929,
 1.8659883497083019,
 2.234620276849616,
 1.0187816540094903,
 -2.530890710602246,
 -1.6046416850441676,
 -0.4651483719733302,
 -0.4952254087173721,
 0.7746763768735953]

response = endpoint.predict([test_mpg])

print('API response: ', response)

print('Predicted MPG: ', response.predictions[0][0])

该示例已具有标准化值，这是我们的模型所期望的格式。

运行此单元格，您应该会看到预测输出大约为每加仑 16 英里。

🎉 恭喜！🎉

您学习了如何使用 Vertex AI 执行以下操作：

通过在自定义容器中提供训练代码来训练模型。您在此示例中使用了 TensorFlow 模型，但您可以使用自定义容器训练使用任何框架构建的模型。
在用于训练的同一工作流中，使用预构建容器部署 TensorFlow 模型。
创建模型端点并生成预测。

如需详细了解 Vertex 的不同部分，请参阅相关文档。

7. 清理

如果您想继续使用在本实验中创建的笔记本，建议您在不使用时将其关闭。在 Cloud 控制台的 Workbench 界面中，选择相应笔记本，然后选择停止。

如果您想完全删除笔记本，请点击右上角的“删除”按钮。

如需删除已部署的端点，请前往 Vertex AI 控制台的端点部分，点击您创建的端点，然后选择从端点取消部署模型：

删除端点

如需删除存储桶，请使用 Cloud Console 中的导航菜单，浏览到“存储空间”，选择您的存储桶，然后点击“删除”：

删除存储空间