本頁面由 Cloud Translation API 翻譯而成。

如何使用 Cloud Run 工作微調大型語言模型

1. 簡介

總覽

在本程式碼研究室中，您將使用 Cloud Run 工作來微調 Gemma 模型，然後使用 vLLM 在 Cloud Run 上提供結果。

為了完成本程式碼研究室，您將使用文字轉 SQL 資料集，讓 LLM 在收到自然語言問題時，以 SQL 查詢回覆。

課程內容

如何使用 Cloud Run 工作 GPU 進行精細調整
如何搭配 vLLM 使用 Cloud Run 提供模型
如何為 GPU 工作使用直接虛擬私有雲設定，以便加快模型上傳及服務的速度

2. 事前準備

啟用 API

開始使用本程式碼研究室前，請先執行以下 API 啟用作業：

gcloud services enable run.googleapis.com \
    compute.googleapis.com \
    run.googleapis.com \
    cloudbuild.googleapis.com \
    secretmanager.googleapis.com \
    artifactregistry.googleapis.com

GPU 配額

申請提高支援地區的配額。在 Cloud Run Admin API 下，配額為 nvidia_l4_gpu_allocation_no_zonal_redundancy。

注意：如果您使用的是新專案，啟用 API 後，可能需要幾分鐘，這個頁面才會顯示配額。

Hugging Face

本程式碼研究室使用 Hugging Face 託管的模型。如要取得這個模型，請使用「Read」權限要求 Hugging Face 使用者存取權杖。您稍後會以 YOUR_HF_TOKEN 的形式參照這個值區。

您也必須同意使用條款才能使用模型：https://huggingface.co/google/gemma-2b

3. 設定和需求

設定下列資源：

IAM 服務帳戶和相關 IAM 權限
Secret Manager 密鑰 (用於儲存 Hugging Face 權杖)
Cloud Storage 值區，用於儲存經過微調的模型，
Artifact Registry 存放區，用於儲存您用來微調模型的建構映像檔。

為本程式碼實驗室設定環境變數。我們已為您預先填入多個變數。指定專案 ID、區域和 Hugging Face 權杖。

export PROJECT_ID=<YOUR_PROJECT_ID>
export REGION=<YOUR_REGION>
export HF_TOKEN=<YOUR_HF_TOKEN>

export AR_REPO=codelab-finetuning-jobs
export IMAGE_NAME=finetune-to-gcs
export JOB_NAME=finetuning-to-gcs-job
export BUCKET_NAME=$PROJECT_ID-codelab-finetuning-jobs
export SECRET_ID=HF_TOKEN
export SERVICE_ACCOUNT="finetune-job-sa"
export SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

執行下列指令，建立服務帳戶：

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Service account for fine-tuning codelab"

使用 Secret Manager 儲存 Hugging Face 存取權杖：

gcloud secrets create $SECRET_ID \
      --replication-policy="automatic"

printf $HF_TOKEN | gcloud secrets versions add $SECRET_ID --data-file=-

將 Secret Manager 密鑰存取者角色授予服務帳戶：

gcloud secrets add-iam-policy-binding $SECRET_ID \
  --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role='roles/secretmanager.secretAccessor'

建立值區來代管微調後的模型：

gcloud storage buckets create -l $REGION gs://$BUCKET_NAME

授予服務帳戶對 bucket 的存取權：

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
  --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role=roles/storage.objectAdmin

建立 Artifact Registry 存放區來儲存容器映像檔：

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for finetuning using CR jobs" \
    --project=$PROJECT_ID

4. 建立 Cloud Run 工作映像檔

在下一個步驟中，您將建立可執行以下操作的程式碼：

從 Hugging Face 匯入 Gemma 模型
使用 Hugging Face 的資料集對模型進行微調。該工作會使用單一 L4 GPU 進行精細調整。
將名為 new_model 的精修模型上傳至 Cloud Storage 值區

建立目錄來放置微調工作程式碼。

mkdir codelab-finetuning-job
cd codelab-finetuning-job

建立名為 finetune.py 的檔案

# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,

)
from peft import LoraConfig, PeftModel

from trl import SFTTrainer

# Cloud Storage bucket to upload the model
bucket_name = os.getenv("BUCKET_NAME", "YOUR_BUCKET_NAME")

# The model that you want to train from the Hugging Face hub
model_name = os.getenv("MODEL_NAME", "google/gemma-2b")

# The instruction dataset to use
dataset_name = "b-mc2/sql-create-context"

# Fine-tuned model name
new_model = os.getenv("NEW_MODEL", "gemma-2b-sql")

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = int(os.getenv("LORA_R", "4"))

# Alpha parameter for LoRA scaling
lora_alpha = int(os.getenv("LORA_ALPHA", "8"))

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = True
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = int(os.getenv("TRAIN_BATCH_SIZE", "1"))

# Batch size per GPU for evaluation
per_device_eval_batch_size = int(os.getenv("EVAL_BATCH_SIZE", "2"))

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = int(os.getenv("GRADIENT_ACCUMULATION_STEPS", "1"))

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = int(os.getenv("LOGGING_STEPS", "50"))

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = int(os.getenv("MAX_SEQ_LENGTH", "512"))

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {'':torch.cuda.current_device()}

# Set limit to a positive number
limit = int(os.getenv("DATASET_LIMIT", "5000"))

dataset = load_dataset(dataset_name, split="train")
if limit != -1:
    dataset = dataset.shuffle(seed=42).select(range(limit))


def transform(data):
    question = data['question']
    context = data['context']
    answer = data['answer']
    template = "Question: {question}\nContext: {context}\nAnswer: {answer}"
    return {'text': template.format(question=question, context=context, answer=answer)}


transformed = dataset.map(transform)

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    torch_dtype=torch.float16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"]
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=transformed,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

trainer.train()

trainer.model.save_pretrained(new_model)

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# push to Cloud Storage

file_path_to_save_the_model = '/finetune/new_model'
model.save_pretrained(file_path_to_save_the_model)
tokenizer.save_pretrained(file_path_to_save_the_model)

建立 requirements.txt 檔案：

accelerate==0.34.2
bitsandbytes==0.45.5
datasets==2.19.1
transformers==4.51.3
peft==0.11.1
trl==0.8.6
torch==2.3.0

建立 Dockerfile：

FROM nvidia/cuda:12.6.2-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt /requirements.txt

RUN pip3 install -r requirements.txt --no-cache-dir

COPY finetune.py /finetune.py

ENV PYTHONUNBUFFERED 1

CMD python3 /finetune.py --device cuda

在 Artifact Registry 存放區中建構容器：

gcloud builds submit \
  --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
  --region $REGION

5. 部署及執行工作

在這個步驟中，您將為工作建立 YAML 設定，並使用直接 VPC 出口，以便更快將資料上傳至 Google Cloud Storage。

請注意，這個檔案包含您稍後會更新的變數。

建立名為 finetune-job.yaml.tmpl 的檔案：

apiVersion: run.googleapis.com/v1
kind: Job
metadata:
  name: $JOB_NAME
  labels:
    cloud.googleapis.com/location: $REGION
  annotations:
    run.googleapis.com/launch-stage: ALPHA
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/execution-environment: gen2
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      parallelism: 1
      taskCount: 1
      template:
        spec:
          serviceAccountName: $SERVICE_ACCOUNT_ADDRESS
          containers:
          - name: $IMAGE_NAME
            image: $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME
            env:
            - name: MODEL_NAME
              value: "google/gemma-2b"
            - name: NEW_MODEL
              value: "gemma-2b-sql-finetuned"
            - name: BUCKET_NAME
              value: "$BUCKET_NAME"
            - name: LORA_R
              value: "8"
            - name: LORA_ALPHA
              value: "16"
            - name: GRADIENT_ACCUMULATION_STEPS
              value: "2"
            - name: DATASET_LIMIT
              value: "1000"
            - name: LOGGING_STEPS
              value: "5"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  key: 'latest'
                  name: HF_TOKEN
            resources:
              limits:
                cpu: 8000m
                nvidia.com/gpu: '1'
                memory: 32Gi
            volumeMounts:
            - mountPath: /finetune/new_model
              name: finetuned_model
          volumes:
          - name: finetuned_model
            csi:
              driver: gcsfuse.run.googleapis.com
              readOnly: false
              volumeAttributes:
                bucketName: $BUCKET_NAME
          maxRetries: 3
          timeoutSeconds: '3600'
          nodeSelector:
            run.googleapis.com/accelerator: nvidia-l4

執行下列指令，將 YAML 中的變數替換為環境變數：
```
envsubst < finetune-job.yaml.tmpl > finetune-job.yaml
```

建立 Cloud Run 工作：

gcloud alpha run jobs replace finetune-job.yaml

執行工作：

gcloud alpha run jobs execute $JOB_NAME --region $REGION --async

這項作業大約 10 分鐘就能完成。您可以使用上一個指令輸出內容中提供的連結，查看狀態。

6. 使用 Cloud Run 服務，透過 vLLM 提供經過微調的模型

在這個步驟中，您將部署 Cloud Run 服務。這項設定會透過私人網路使用直接虛擬私有雲存取 Cloud Storage 值區，以便加快下載速度。

請注意，這個檔案包含您稍後會更新的變數。

建立 service.yaml.tmpl 檔案：

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: serve-gemma-sql
  labels:
    cloud.googleapis.com/location: $REGION
  annotations:
    run.googleapis.com/launch-stage: BETA
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
spec:
  template:
    metadata:
      labels:
      annotations:
        autoscaling.knative.dev/maxScale: '1'
        run.googleapis.com/cpu-throttling: 'false'
        run.googleapis.com/gpu-zonal-redundancy-disabled: 'true'
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      containers:
      - name: serve-finetuned
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250505_0916_RC00
        ports:
        - name: http1
          containerPort: 8000
        resources:
          limits:
            cpu: 8000m
            nvidia.com/gpu: '1'
            memory: 32Gi
        volumeMounts:
        - name: fuse
          mountPath: /finetune/new_model
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=/finetune/new_model
        - --tensor-parallel-size=1
        env:
        - name: MODEL_ID
          value: 'new_model'
        - name: HF_HUB_OFFLINE
          value: '1'
      volumes:
      - name: fuse
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: $BUCKET_NAME
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4

使用值區名稱更新 service.yaml 檔案。

envsubst < service.yaml.tmpl > service.yaml

部署 Cloud Run 服務：

gcloud alpha run services replace service.yaml

7. 測試微調後的模型

在這個步驟中，您會提示模型測試精細調整。

取得 Cloud Run 服務的服務網址：

SERVICE_URL=$(gcloud run services describe serve-gemma-sql --platform managed --region $REGION --format 'value(status.url)')

為模型建立提示。

USER_PROMPT="Question: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)"

使用 CURL 呼叫服務，以便提示模型：

curl -X POST $SERVICE_URL/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: bearer $(gcloud auth print-identity-token)" \
  -d @- <<EOF
{
    "prompt": "${USER_PROMPT}"
}
EOF

畫面會顯示類似以下的回應：

{"predictions":["Prompt:\nQuestion: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)\nOutput:\n CREATE TABLE people_to_candidates (candidate_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people (person_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people_to_candidates (person_id VARCHAR, candidate_id"]}

8. 恭喜！

恭喜您完成程式碼研究室！

建議您參閱 Cloud Run 說明文件。

涵蓋內容

如何使用 Cloud Run 工作 GPU 進行精細調整
如何搭配 vLLM 使用 Cloud Run 提供模型
如何為 GPU 工作使用直接虛擬私有雲設定，以便加快模型上傳及服務的速度

9. 清理

為避免產生意外費用，如果 Cloud Run 服務不小心叫用次數超過免付費層級的 Cloud Run 叫用次數配額，您可以刪除在步驟 6 中建立的 Cloud Run 服務。

如要刪除 Cloud Run 服務，請前往 Cloud Run 控制台 (https://console.cloud.google.com/run) 並刪除 serve-gemma-sql 服務。

如要刪除整個專案，請前往「Manage Resources」，選取您在步驟 2 中建立的專案，然後選擇「Delete」(刪除)。如果您刪除專案，就必須在 Cloud SDK 中變更專案。您可以執行 gcloud projects list 來查看所有可用專案的清單。