使用 Vertex AI 构建和部署模型

剩余时间：39 分钟

关于此 Codelab

上次更新时间：1月 24, 2022

Google 员工编写

1. 概览

在本实验中，您将学习如何使用 Vertex AI（Google Cloud 新发布的代管式机器学习平台）构建端到端机器学习工作流。您将学习如何从原始数据到部署的模型，并在此研讨会期间做好准备，以便使用 Vertex AI 开发您自己的机器学习项目并将其投入生产环境。在本实验中，我们将使用 Cloud Shell 构建自定义 Docker 映像，以演示使用 Vertex AI 进行训练的自定义容器。

虽然我们在这里使用 TensorFlow 构建模型代码，但您可以轻松将其替换为其他框架。

学习内容

您将了解如何：

使用 Cloud Shell 构建模型训练代码并将其容器化
向 Vertex AI 提交自定义模型训练作业
将经过训练的模型部署到端点，并使用该端点进行预测

在 Google Cloud 上运行此实验的总费用约为 2 美元。

2. Vertex AI 简介

本实验使用的是 Google Cloud 上提供的最新 AI 产品。Vertex AI 将整个 Google Cloud 的机器学习产品集成到无缝的开发体验中。以前，使用 AutoML 训练的模型和自定义模型是通过不同的服务访问的。现在，该新产品与其他新产品一起将这两种模型合并到一个 API 中。您还可以将现有项目迁移到 Vertex AI。如果您有任何反馈，请参阅支持页面。

Vertex 包含许多不同的工具，可帮助您完成机器学习工作流的每个阶段，如下图所示。我们将重点介绍如何使用下面突出显示的 Vertex Training 和 Prediction。

Vertex 服务

3. 设置环境

自定进度的环境设置

请记住项目 ID，这是所有 Google Cloud 项目中的唯一名称（很抱歉，上述名称已被占用，您无法使用！）。

接下来，您需要在 Cloud 控制台中启用结算功能，才能使用 Google Cloud 资源。

运行此 Codelab 应该不会产生太多的费用（如果有费用的话）。请务必按照“清理”部分部分，其中会指导您如何关停资源，以免产生超出本教程范围的结算费用。Google Cloud 的新用户符合参与 300 美元的免费试用计划的条件。

第 1 步：启动 Cloud Shell

在本实验中，您将使用 Cloud Shell 会话，这是一个由在 Google 云中运行的虚拟机托管的命令解释器。您可以在自己的计算机本地轻松运行此部分，但借助 Cloud Shell，每个人都可以在一致的环境中获得可重现的体验。完成实验后，您可以在自己的计算机上重试本部分。

为 Cloud Shell 授权

激活 Cloud Shell

在 Cloud 控制台的右上角，点击下方按钮以激活 Cloud Shell：

激活 Cloud Shell

如果您以前从未启动过 Cloud Shell，系统会显示一个中间屏幕（非首屏）来介绍 Cloud Shell。如果是这种情况，请点击继续（此后您将不会再看到此通知）。一次性屏幕如下所示：

Cloud Shell 设置

预配和连接到 Cloud Shell 只需花几分钟时间。

Cloud Shell 初始化

这个虚拟机装有您需要的所有开发工具。它提供了一个持久的 5GB 主目录，并且在 Google Cloud 中运行，大大增强了网络性能和身份验证。只需使用一个浏览器或 Google Chromebook 即可完成本 Codelab 中的大部分（甚至全部）工作。

在连接到 Cloud Shell 后，您应该会看到自己已通过身份验证，并且相关项目已设置为您的项目 ID：

在 Cloud Shell 中运行以下命令以确认您已通过身份验证：

gcloud auth list

命令输出

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

在 Cloud Shell 中运行以下命令，以确认 gcloud 命令了解您的项目：

gcloud config list project

命令输出

[core]
project = <PROJECT_ID>

如果不是上述结果，您可以使用以下命令进行设置：

gcloud config set project <PROJECT_ID>

命令输出

Updated property [core/project].

Cloud Shell 有一些环境变量，其中 GOOGLE_CLOUD_PROJECT 包含我们当前 Cloud 项目的名称。在本实验的各个位置，都会用到这个 ID。您可以通过运行以下命令来查看此信息：

echo $GOOGLE_CLOUD_PROJECT

第 2 步：启用 API

在后续步骤中，您将看到需要这些服务的位置（以及原因），但现在，请运行此命令，为您的项目授予对 Compute Engine、Container Registry 和 Vertex AI 服务的访问权限：

gcloud services enable compute.googleapis.com         \
                       containerregistry.googleapis.com  \
                       aiplatform.googleapis.com

这将生成类似于以下内容的成功消息：

Operation "operations/acf.cc11852d-40af-47ad-9d59-477a12847c9e" finished successfully.

第 3 步：创建 Cloud Storage 存储分区

如需在 Vertex AI 上运行训练作业，我们需要一个存储分区来存储我们保存的模型资源。在 Cloud Shell 终端运行以下命令以创建存储分区：

BUCKET_NAME=gs://$GOOGLE_CLOUD_PROJECT-bucket
gsutil mb -l us-central1 $BUCKET_NAME

第 4 步：别名 Python 3

本实验中的代码使用 Python 3。为确保在运行您将在本实验中创建的脚本时使用 Python 3，请在 Cloud Shell 中运行以下命令来创建别名：

alias python=python3

我们将在本实验中训练和提供的模型基于 TensorFlow 文档中的本教程构建而成。本教程使用 Kaggle 的 Auto MPG 数据集来预测车辆的燃油效率。

4. 容器化训练代码

我们会将训练代码置于 Docker 容器中，并将该容器推送到 Google Container Registry，从而将此训练作业提交到 Vertex。通过这种方法，我们可以训练使用任何框架构建的模型。

第 1 步：设置文件

首先，在 Cloud Shell 中的终端运行以下命令，创建 Docker 容器所需的文件：

mkdir mpg
cd mpg
touch Dockerfile
mkdir trainer
touch trainer/train.py

现在，您应该有一个如下所示的 mpg/ 目录：

+ Dockerfile
+ trainer/
    + train.py

我们将使用 Cloud Shell 的内置代码编辑器查看和修改这些文件。您可以点击 Cloud Shell 右上角菜单栏中的按钮，在编辑器和终端之间来回切换：

在 Cloud Shell 中切换到编辑器

第 2 步：创建 Dockerfile

为了将代码容器化，我们首先要创建一个 Dockerfile。Dockerfile 中将包含运行映像所需的所有命令。它将安装我们使用的所有库，并为训练代码设置入口点。

在 Cloud Shell 文件编辑器中，打开 mpg/ 目录，然后双击以打开 Dockerfile：

打开 Dockerfile

然后将以下内容复制到此文件中：

FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-3
WORKDIR /

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.train"]

此 Dockerfile 使用 Deep Learning Container TensorFlow Enterprise 2.3 Docker 映像。Google Cloud 上的 Deep Learning Containers 预安装了许多常见的机器学习和数据科学框架。我们使用的是 TF Enterprise 2.3、Pandas、Scikit-learn 等工具。下载该映像后，此 Dockerfile 会为训练代码设置入口点，我们将在下一步中添加该代码。

第 3 步：添加模型训练代码

接下来，在 Cloud Shell 编辑器中打开 train.py 文件并复制以下代码（这改写自 TensorFlow 文档中的教程）。

# This will be replaced with your bucket name after running the `sed` command in the tutorial
BUCKET = "BUCKET_NAME"

import numpy as np
import pandas as pd
import pathlib
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)

"""## The Auto MPG dataset

The dataset is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/).

### Get the data
First download the dataset.
"""

"""Import it using pandas"""

dataset_path = "https://storage.googleapis.com/io-vertex-codelab/auto-mpg.csv"
dataset = pd.read_csv(dataset_path, na_values = "?")

dataset.tail()

"""### Clean the data

The dataset contains a few unknown values.
"""

dataset.isna().sum()

"""To keep this initial tutorial simple drop those rows."""

dataset = dataset.dropna()

"""The `"origin"` column is really categorical, not numeric. So convert that to a one-hot:"""

dataset['origin'] = dataset['origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')
dataset.tail()

"""### Split the data into train and test

Now split the dataset into a training set and a test set.

We will use the test set in the final evaluation of our model.
"""

train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

"""### Inspect the data

Have a quick look at the joint distribution of a few pairs of columns from the training set.

Also look at the overall statistics:
"""

train_stats = train_dataset.describe()
train_stats.pop("mpg")
train_stats = train_stats.transpose()
train_stats

"""### Split features from labels

Separate the target value, or "label", from the features. This label is the value that you will train the model to predict.
"""

train_labels = train_dataset.pop('mpg')
test_labels = test_dataset.pop('mpg')

"""### Normalize the data

Look again at the `train_stats` block above and note how different the ranges of each feature are.

It is good practice to normalize features that use different scales and ranges. Although the model *might* converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

Note: Although we intentionally generate these statistics from only the training dataset, these statistics will also be used to normalize the test dataset. We need to do that to project the test dataset into the same distribution that the model has been trained on.
"""

def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

"""This normalized data is what we will use to train the model.

Caution: The statistics used to normalize the inputs here (mean and standard deviation) need to be applied to any other data that is fed to the model, along with the one-hot encoding that we did earlier.  That includes the test set as well as live data when the model is used in production.

## The model

### Build the model

Let's build our model. Here, we'll use a `Sequential` model with two densely connected hidden layers, and an output layer that returns a single, continuous value. The model building steps are wrapped in a function, `build_model`, since we'll create a second model, later on.
"""

def build_model():
  model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=[len(train_dataset.keys())]),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

model = build_model()

"""### Inspect the model

Use the `.summary` method to print a simple description of the model
"""

model.summary()

"""Now try out the model. Take a batch of `10` examples from the training data and call `model.predict` on it.

It seems to be working, and it produces a result of the expected shape and type.

### Train the model

Train the model for 1000 epochs, and record the training and validation accuracy in the `history` object.

Visualize the model's training progress using the stats stored in the `history` object.

This graph shows little improvement, or even degradation in the validation error after about 100 epochs. Let's update the `model.fit` call to automatically stop training when the validation score doesn't improve. We'll use an *EarlyStopping callback* that tests a training condition for  every epoch. If a set amount of epochs elapses without showing improvement, then automatically stop the training.

You can learn more about this callback [here](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping).
"""

model = build_model()

EPOCHS = 1000

# The patience parameter is the amount of epochs to check for improvement
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

early_history = model.fit(normed_train_data, train_labels, 
                    epochs=EPOCHS, validation_split = 0.2, 
                    callbacks=[early_stop])


# Export model and save to GCS
model.save(BUCKET + '/mpg/model')

将上述代码复制到 mpg/trainer/train.py 文件中后，返回 Cloud Shell 中的终端并运行以下命令，以将您的存储分区名称添加到该文件中：

sed -i "s|BUCKET_NAME|$BUCKET_NAME|g" trainer/train.py

第 4 步：在本地构建和测试容器

在终端中，运行以下命令，使用 Google Container Registry 中容器映像的 URI 定义变量：

IMAGE_URI="gcr.io/$GOOGLE_CLOUD_PROJECT/mpg:v1"

然后，从 mpg 目录的根目录运行以下命令来构建容器：

docker build ./ -t $IMAGE_URI

构建容器后，将其推送到 Google Container Registry：

docker push $IMAGE_URI

如需验证您的映像已推送到 Container Registry，请前往控制台的 Container Registry 部分，您应该会看到如下内容：

Container Registry 预览

将容器推送到 Container Registry 后，我们现在可以启动自定义模型训练作业了。

5. 在 Vertex AI 上运行训练作业

Vertex 为您提供了两种训练模型选择：

AutoML：只需极少的工作量和机器学习专业知识，即可训练出高质量模型。
自定义训练：使用 Google Cloud 的某个预构建容器或您自己的容器在云端运行您的自定义训练应用。

在本实验中，我们将通过 Google Container Registry 上自己的自定义容器使用自定义训练。首先，请前往 Cloud 控制台的“Vertex”部分中的训练部分：

第 1 步：启动训练作业

点击创建，输入训练作业和部署的模型的参数：

在数据集下，选择无代管式数据集
然后选择自定义训练（高级）作为训练方法，并点击继续。
在模型名称部分输入 mpg（或您要调用模型的任何内容）
点击继续

在“容器设置”步骤中，选择自定义容器：

自定义容器选项

在第一个框（容器映像）中，点击浏览并找到您刚刚推送到 Container Registry 的容器。代码应如下所示：

查找容器

将其余字段留空，然后点击继续。

我们不在本教程中使用超参数调节，因此请将“启用超参数调节”复选框保持为未选中状态，然后点击继续。

在计算和价格中，让所选区域保持不变，并选择 n1-standard-4 作为机器类型：

机器类型

由于此演示中的模型训练速度很快，因此我们将使用较小的机器类型。

在预测容器步骤下，选择无预测容器：

没有预测容器

6. 部署模型端点

在此步骤中，我们将为经过训练的模型创建一个端点。我们可以用它来通过 Vertex AI API 获取模型的预测结果。为此，我们在公开 GCS 存储分区中提供了导出的训练模型资产的一个版本。

在一个组织中，由一个团队或个人负责构建模型，另一个团队负责部署模型，这种情况很常见。我们将在此处介绍的步骤将向您展示如何获取已经过训练的模型并部署该模型进行预测。

在这里，我们将使用 Vertex AI SDK 创建模型，将模型部署到端点，并获取预测结果。

第 1 步：安装 Vertex SDK

在您的 Cloud Shell 终端中，运行以下命令以安装 Vertex AI SDK：

pip3 install google-cloud-aiplatform --upgrade --user

我们可以使用此 SDK 与 Vertex 的许多不同部分进行交互。

第 2 步：创建模型并部署端点

接下来，我们将创建一个 Python 文件，并使用 SDK 创建模型资源并将其部署到端点。在 Cloud Shell 的文件编辑器中，选择文件，然后选择新建文件：

Cloud Shell 中的新文件

将文件命名为 deploy.py。在您的编辑器中打开此文件，并复制以下代码：

from google.cloud import aiplatform

# Create a model resource from public model assets
model = aiplatform.Model.upload(
    display_name="mpg-imported",
    artifact_uri="gs://io-vertex-codelab/mpg-model/",
    serving_container_image_uri="gcr.io/cloud-aiplatform/prediction/tf2-cpu.2-3:latest"
)

# Deploy the above model to an endpoint
endpoint = model.deploy(
    machine_type="n1-standard-4"
)

接下来，返回到 Cloud Shell 中的终端，使用 cd 返回到根目录，然后运行您刚刚创建的这个 Python 脚本：

cd ..
python3 deploy.py | tee deploy-output.txt

资源创建时，您将看到已记录到终端的更新。运行需要 10-15 分钟。为了确保其正常运行，请在 Vertex AI 中前往控制台的模型部分：

Vertex 控制台中的模型

点击 mgp-imported，您应该会看到正在创建该模型的端点：

待处理端点

端点部署完成后，在 Cloud Shell 终端中，您会看到如下日志：

Endpoint model deployed. Resource name: projects/your-project-id/locations/us-central1/endpoints/your-endpoint-id

您将在下一步中用它来获取您部署的端点的预测结果。

第 3 步：在已部署的端点上获取预测结果

在 Cloud Shell 编辑器中，新建一个名为 predict.py 的文件：

创建预测文件

打开 predict.py 并将以下代码粘贴到其中：

from google.cloud import aiplatform

endpoint = aiplatform.Endpoint(
    endpoint_name="ENDPOINT_STRING"
)

# A test example we'll send to our model for prediction
test_mpg = [1.4838871833555929,
 1.8659883497083019,
 2.234620276849616,
 1.0187816540094903,
 -2.530890710602246,
 -1.6046416850441676,
 -0.4651483719733302,
 -0.4952254087173721,
 0.7746763768735953]

response = endpoint.predict([test_mpg])

print('API response: ', response)

print('Predicted MPG: ', response.predictions[0][0])

接下来，返回终端并输入以下内容，将预测文件中的 ENDPOINT_STRING 替换为您自己的端点：

ENDPOINT=$(cat deploy-output.txt | sed -nre 's:.*Resource name\: (.*):\1:p' | tail -1)
sed -i "s|ENDPOINT_STRING|$ENDPOINT|g" predict.py

现在，可以运行 predict.py 文件以从我们部署的模型端点获取预测结果：

python3 predict.py

您应该会看到该 API 的响应以及测试预测的预测燃料效率。

🎉 恭喜！🎉

您学习了如何使用 Vertex AI 执行以下操作：

通过在自定义容器中提供训练代码来训练模型。您在此示例中使用了 TensorFlow 模型，但您可以使用自定义容器训练使用任何框架构建的模型。
在用于训练的工作流中，使用预构建的容器部署 TensorFlow 模型。
创建模型端点并生成预测。

如需详细了解 Vertex AI 的不同部分，请参阅文档。如果您想查看在第 5 步开始的训练作业的结果，请前往 Vertex 控制台的训练部分。

7. 清理

如需删除已部署的端点，请前往 Vertex 控制台的端点部分，然后点击删除图标：

删除端点

如需删除存储桶，请使用 Cloud Console 中的导航菜单，浏览到“存储空间”，选择您的存储桶，然后点击“删除”：

删除存储空间

报告错误