從原型到實際工作環境：超參數調整

從原型到正式版：超參數調整

還剩 54 分鐘

從原型到正式版：超參數調整

程式碼研究室簡介

上次更新時間：8月 25, 2022

作者：Nikita Namjoshi

1. 總覽

在這個研究室中，您將使用 Vertex AI 在 Vertex AI 訓練中執行超參數調整工作。

這個實驗室屬於「從原型設計到投入實際工作環境」系列影片。請務必先完成前一個研究室，再試用這個研究室。歡迎觀看相關系列影片，瞭解更多資訊：

。

課程內容

學習重點：

針對自動化超參數調整修改訓練應用程式程式碼
使用 Vertex AI Python SDK 設定及啟動超參數調整工作

在 Google Cloud 中執行這個研究室的總費用約為 $1 美元。

2. Vertex AI 簡介

這個實驗室使用 Google Cloud 最新推出的 AI 產品服務。Vertex AI 整合了 Google Cloud 中的機器學習產品，提供流暢的開發體驗。先前，使用 AutoML 訓練的模型和自訂模型必須透過不同的服務存取。新產品將這兩項功能與其他新產品整合到單一 API 中。您也可以將現有專案遷移至 Vertex AI。

Vertex AI 包含許多不同產品，可支援端對端機器學習工作流程。本研究室將著重於下列產品：訓練和工作台

Vertex 產品總覽

3. 設定環境

完成「使用 Vertex AI 訓練自訂模型」研究室中的步驟，完成環境設定作業。

4. 將訓練應用程式程式碼容器化

您可以將訓練應用程式程式碼放入 Docker 容器，然後將該容器推送至 Google Artifact Registry，藉此將訓練工作提交至 Vertex AI。這樣一來，您就可以訓練和調整透過任何架構建構的模型。

首先，請在先前實驗室中建立的 Workbench 筆記本中，透過 Launcher 選單開啟終端機視窗。

在筆記本中開啟終端機

步驟 1：編寫訓練程式碼

建立名為 flowers-hptune 的新目錄，然後切換至該目錄：

mkdir flowers-hptune
cd flowers-hptune

執行下列指令，建立訓練程式碼的目錄，以及您要新增下方程式碼的 Python 檔案。

mkdir trainer
touch trainer/task.py

flowers-hptune/ 目錄現在應該會包含以下內容：

+ trainer/
    + task.py

接著，開啟您剛才建立的 task.py 檔案，然後複製下方程式碼。

您需要將 BUCKET_ROOT 中的 {your-gcs-bucket} 替換為實驗室 1 中儲存花卉資料集的 Cloud Storage 值區。

import tensorflow as tf
import numpy as np
import os
import hypertune
import argparse

## Replace {your-gcs-bucket} !!
BUCKET_ROOT='/gcs/{your-gcs-bucket}'

# Define variables
NUM_CLASSES = 5
EPOCHS=10
BATCH_SIZE = 32

IMG_HEIGHT = 180
IMG_WIDTH = 180

DATA_DIR = f'{BUCKET_ROOT}/flower_photos'

def get_args():
  '''Parses args. Must include all hyperparameters you want to tune.'''

  parser = argparse.ArgumentParser()
  parser.add_argument(
      '--learning_rate',
      required=True,
      type=float,
      help='learning rate')
  parser.add_argument(
      '--momentum',
      required=True,
      type=float,
      help='SGD momentum value')
  parser.add_argument(
      '--num_units',
      required=True,
      type=int,
      help='number of units in last hidden layer')
  args = parser.parse_args()
  return args

def create_datasets(data_dir, batch_size):
  '''Creates train and validation datasets.'''

  train_dataset = tf.keras.utils.image_dataset_from_directory(
    data_dir,
    validation_split=0.2,
    subset="training",
    seed=123,
    image_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=batch_size)

  validation_dataset = tf.keras.utils.image_dataset_from_directory(
    data_dir,
    validation_split=0.2,
    subset="validation",
    seed=123,
    image_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=batch_size)

  train_dataset = train_dataset.cache().shuffle(1000).prefetch(buffer_size=tf.data.AUTOTUNE)
  validation_dataset = validation_dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

  return train_dataset, validation_dataset


def create_model(num_units, learning_rate, momentum):
  '''Creates model.'''

  model = tf.keras.Sequential([
    tf.keras.layers.Resizing(IMG_HEIGHT, IMG_WIDTH),
    tf.keras.layers.Rescaling(1./255, input_shape=(IMG_HEIGHT, IMG_WIDTH, 3)),
    tf.keras.layers.Conv2D(16, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(32, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Conv2D(64, 3, padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(num_units, activation='relu'),
    tf.keras.layers.Dense(NUM_CLASSES, activation='softmax')
  ])

  model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=momentum),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              metrics=['accuracy'])
  
  return model

def main():
  args = get_args()
  train_dataset, validation_dataset = create_datasets(DATA_DIR, BATCH_SIZE)
  model = create_model(args.num_units, args.learning_rate, args.momentum)
  history = model.fit(train_dataset, validation_data=validation_dataset, epochs=EPOCHS)

  # DEFINE METRIC
  hp_metric = history.history['val_accuracy'][-1]

  hpt = hypertune.HyperTune()
  hpt.report_hyperparameter_tuning_metric(
      hyperparameter_metric_tag='accuracy',
      metric_value=hp_metric,
      global_step=EPOCHS)


if __name__ == "__main__":
    main()

在建構容器之前，讓我們進一步瞭解程式碼。有幾個元件是專門用於使用超參數調整服務。

指令碼會匯入 hypertune 程式庫。
get_args() 函式會為您要調整的每個超參數定義指令列引數。在這個範例中，要調整的超參數是學習率、最佳化器中的動量值，以及模型最後一個隱藏層中的單位數量，但您也可以嘗試其他超參數。然後，這些引數中傳遞的值會用於在程式碼中設定對應的超參數。
在 main() 函式的結尾，hypertune 程式庫是用來定義您要最佳化的指標。在 TensorFlow 中，keras 的 model.fit 方法會傳回 History 物件。History.history 屬性是連續訓練週期的訓練損失值和指標值記錄。如果您將驗證資料傳遞至 model.fit，History.history 屬性也會包含驗證損失和指標值。舉例來說，如果您使用驗證資料訓練模型三個迴圈，並提供 accuracy 做為指標，History.history 屬性就會類似下列字典。

{
 "accuracy": [
   0.7795261740684509,
   0.9471358060836792,
   0.9870933294296265
 ],
 "loss": [
   0.6340447664260864,
   0.16712145507335663,
   0.04546636343002319
 ],
 "val_accuracy": [
   0.3795261740684509,
   0.4471358060836792,
   0.4870933294296265
 ],
 "val_loss": [
   2.044623374938965,
   4.100203514099121,
   3.0728273391723633
 ]

如果您希望超參數調整服務找出可最大化模型驗證準確度的值，請將指標定義為 val_accuracy 清單的最後一個項目 (或 NUM_EPOCS - 1)。然後將這個指標傳遞至 HyperTune 的執行個體。您可以為 hyperparameter_metric_tag 選擇任何字串，但日後啟動超參數調整工作時，您必須再次使用該字串。

步驟 2：建立 Dockerfile

如要將程式碼容器化，您需要建立 Dockerfile。而 Dockerfile 包含執行映像檔所需的所有指令。其會安裝所有必要的程式庫，並設定訓練程式碼的進入點。

在終端機中，在 flowers-hptune 目錄的根目錄中建立空白的 Dockerfile：

touch Dockerfile

flowers-hptune/ 目錄現在應該會包含以下內容：

+ Dockerfile
+ trainer/
    + task.py

開啟 Dockerfile，並將下列內容複製到檔案中。您會發現此範例與我們在第一個研究室中使用的 Dockerfile 幾乎相同，差別在於我們現在會安裝 cloudml-hypertune 程式庫。

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-8

WORKDIR /

# Installs hypertune library
RUN pip install cloudml-hypertune

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

步驟 3：建構容器

在終端機中執行以下指令，為專案定義 env 變數，請務必將 your-cloud-project 替換為專案 ID：

PROJECT_ID='your-cloud-project'

在 Artifact Registry 中定義存放區。我們會使用在第一個實驗室中建立的存放區。

REPO_NAME='flower-app'

使用 Google Artifact Registry 中容器映像檔的 URI 定義變數：

IMAGE_URI=us-central1-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/flower_image_hptune:latest

設定 Docker

gcloud auth configure-docker \
    us-central1-docker.pkg.dev

接著，從 flower-hptune 目錄根層級執行下列指令，建構容器：

docker build ./ -t $IMAGE_URI

最後，將其推送至 Artifact Registry：

docker push $IMAGE_URI

將容器推送至 Artifact Registry 後，您就可以開始訓練工作。

5. 使用 SDK 執行超參數調整工作

在本節中，您將瞭解如何使用 Vertex Python API 設定及提交超參數調整工作。

在 Launcher 中建立 TensorFlow 2 筆記本。

new_notebook

匯入 Vertex AI SDK。

from google.cloud import aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt

如要啟動超參數調整工作，您必須先定義 worker_pool_specs，這會指定機器類型和 Docker 映像檔。以下規格定義了搭載兩個 NVIDIA Tesla V100 GPU 的機器。

您必須將 image_uri 中的 {PROJECT_ID} 替換為您的專案。

# The spec of the worker pools including machine type and Docker image
# Be sure to replace PROJECT_ID in the `image_uri` with your project.

worker_pool_specs = [{
    "machine_spec": {
        "machine_type": "n1-standard-4",
        "accelerator_type": "NVIDIA_TESLA_V100",
        "accelerator_count": 1
    },
    "replica_count": 1,
    "container_spec": {
        "image_uri": "us-central1-docker.pkg.dev/{PROJECT_ID}/flower-app/flower_image_hptune:latest"
    }
}]

接下來，定義 parameter_spec，這是一個字典，可指定要最佳化的參數。字典鍵是您為每個超參數指派給指令列引數的字串，而字典值則是參數規格。

對於每個超參數，您必須定義類型以及調整服務將嘗試的值邊界。超參數的類型可以是 Double、Integer、Categorical 或 Discrete。如果您選取的類型是 Double 或 Integer，則必須提供最小值和最大值。如果您選取「類別」或「離散」，則需要提供值。對於 Double 和 Integer 類型，您也需要提供 Scaling 值。如要進一步瞭解如何選擇最佳比例，請觀看這部影片。

# Dictionary representing parameters to optimize.
# The dictionary key is the parameter_id, which is passed into your training
# job as a command line argument,
# And the dictionary value is the parameter specification of the metric.
parameter_spec = {
    "learning_rate": hpt.DoubleParameterSpec(min=0.001, max=1, scale="log"),
    "momentum": hpt.DoubleParameterSpec(min=0, max=1, scale="linear"),
    "num_units": hpt.DiscreteParameterSpec(values=[64, 128, 512], scale=None)
}

最後要定義的規格是 metric_spec，這是代表要最佳化的指標的字典。字典鍵是您在訓練應用程式程式碼中設定的 hyperparameter_metric_tag，而值則是最佳化目標。

# Dictionary representing metric to optimize.
# The dictionary key is the metric_id, which is reported by your training job,
# And the dictionary value is the optimization goal of the metric.
metric_spec={'accuracy':'maximize'}

定義規格後，您將建立 CustomJob，這是用於在每個超參數調整試驗中執行工作的常見規格。

您必須將 {YOUR_BUCKET} 替換為先前建立的值區。

# Replace YOUR_BUCKET
my_custom_job = aiplatform.CustomJob(display_name='flowers-hptune-job',
                              worker_pool_specs=worker_pool_specs,
                              staging_bucket='gs://{YOUR_BUCKET}')

接著，請建立並執行 HyperparameterTuningJob。

hp_job = aiplatform.HyperparameterTuningJob(
    display_name='flowers-hptune-job',
    custom_job=my_custom_job,
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=15,
    parallel_trial_count=3)

hp_job.run()

請注意以下引數：

max_trial_count：您必須為服務執行的測試次數設定上限。一般來說，測試次數越多，成效就越好，但測試次數過多就會導致邊際效益遞減，此時再增加測試次數，對您嘗試最佳化的指標幾乎沒有影響，最佳做法是先從較少的試驗開始，瞭解所選超參數的影響程度，再逐步擴大規模。
parallel_trial_count：如果使用平行試驗，服務會佈建多個訓練處理叢集。增加並行測試的數量，可縮短超參數微調工作執行的時間，但可能會降低工作整體的效益。這是因為預設調整策略會使用先前測試的結果，決定後續測試中值的分配方式。
search_algorithm：您可以將搜尋演算法設為格狀、隨機或預設 (None)。預設選項會套用貝氏最佳化方法來搜尋可能的超參數值空間，也是建議的演算法。請參閱這裡的文章，進一步瞭解這個演算法。

您可以在控制台中查看工作進度。

hp_job