如何使用 Cloud Run 工作微調大型語言模型

如何使用 Cloud Run 工作微調大型語言模型

程式碼研究室簡介

subject上次更新時間:3月 21, 2025
account_circle作者:Google 員工

1. 簡介

總覽

在本例中,您將使用文字轉 SQL 資料集精調 gemma-2b 模型,讓 LLM 在收到自然語言問題時,以 SQL 查詢回覆。接著,您將使用 vLLM 將精修模型提供至 Cloud Run。

課程內容

  • 如何使用 Cloud Run 工作 GPU 進行精細調整
  • 如何為 GPU 工作使用直接虛擬私有雲設定,以便加快模型上傳和提供的速度

2. 事前準備

如要使用 GPU 功能,您必須為支援的區域申請提高配額。所需配額為 nvidia_l4_gpu_allocation_no_zonal_redundancy,位於 Cloud Run Admin API 下。請按這裡,直接前往要求配額的頁面。

3. 設定和需求

設定在本程式碼研究室中會用到的環境變數。

PROJECT_ID=<YOUR_PROJECT_ID>
REGION
=<YOUR_REGION>
HF_TOKEN
=<YOUR_HF_TOKEN>

AR_REPO
=codelab-finetuning-jobs
IMAGE_NAME
=finetune-to-gcs
JOB_NAME
=finetuning-to-gcs-job
BUCKET_NAME
=$PROJECT_ID-codelab-finetuning-jobs
SECRET_ID
=HF_TOKEN
SERVICE_ACCOUNT
="finetune-job-sa"
SERVICE_ACCOUNT_ADDRESS
=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

執行下列指令,建立服務帳戶:

gcloud iam service-accounts create $SERVICE_ACCOUNT \
 
--display-name="Cloud Run job to access HF_TOKEN Secret ID"

使用 Secret Manager 儲存 HuggingFace 存取權杖。

如要進一步瞭解如何建立及使用密鑰,請參閱 Secret Manager 說明文件

gcloud secrets create $SECRET_ID \
   
--replication-policy="automatic"

printf $HF_TOKEN
| gcloud secrets versions add $SECRET_ID --data-file=-

您會看到類似以下的輸出內容:

you'll see output similar to

Created secret [HF_TOKEN].
Created version [1] of the secret [HF_TOKEN].

將 Secret Manager 密鑰存取者角色授予預設的 Compute Engine 服務帳戶

gcloud secrets add-iam-policy-binding $SECRET_ID \
   
--member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
   
--role='roles/secretmanager.secretAccessor'

建立值區來代管微調後的模型

gsutil mb -l $REGION gs://$BUCKET_NAME

然後授予 SA 對值區的存取權。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

為工作建立構件登錄庫

gcloud artifacts repositories create $AR_REPO \
   
--repository-format=docker \
   
--location=$REGION \
   
--description="codelab for finetuning using CR jobs" \
   
--project=$PROJECT_ID

4. 建立 Cloud Run 工作映像檔

在下一個步驟中,您將建立可執行以下操作的程式碼:

  • 從 huggingface 匯入 gemma-2b
  • 使用 Hugging Face 的資料集,針對 gemma-2b 和文字轉 SQL 資料集執行微調。工作會使用單一 L4 GPU 進行微調。
  • 將名為 new_model 的微調模型上傳至使用者的 GCS 值區

建立精細調整工作程式碼的目錄。

mkdir codelab-finetuning-job
cd codelab
-finetuning-job

建立名為 finetune.py 的檔案

# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import torch
from datasets import load_dataset, Dataset
from transformers import (
   
AutoModelForCausalLM,
   
AutoTokenizer,
   
BitsAndBytesConfig,
   
TrainingArguments,

)
from peft import LoraConfig, PeftModel

from trl import SFTTrainer
from pathlib import Path

# GCS bucket to upload the model
bucket_name = os.getenv("BUCKET_NAME", "YOUR_BUCKET_NAME")

# The model that you want to train from the Hugging Face hub
model_name = os.getenv("MODEL_NAME", "google/gemma-2b")

# The instruction dataset to use
dataset_name = "b-mc2/sql-create-context"

# Fine-tuned model name
new_model = os.getenv("NEW_MODEL", "gemma-2b-sql")

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = int(os.getenv("LORA_R", "4"))

# Alpha parameter for LoRA scaling
lora_alpha = int(os.getenv("LORA_ALPHA", "8"))

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = True
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = int(os.getenv("TRAIN_BATCH_SIZE", "1"))

# Batch size per GPU for evaluation
per_device_eval_batch_size = int(os.getenv("EVAL_BATCH_SIZE", "2"))

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = int(os.getenv("GRADIENT_ACCUMULATION_STEPS", "1"))

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = int(os.getenv("LOGGING_STEPS", "50"))

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = int(os.getenv("MAX_SEQ_LENGTH", "512"))

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {'':torch.cuda.current_device()}

# Set limit to a positive number
limit = int(os.getenv("DATASET_LIMIT", "5000"))

dataset = load_dataset(dataset_name, split="train")
if limit != -1:
   
dataset = dataset.shuffle(seed=42).select(range(limit))


def transform(data):
   
question = data['question']
   
context = data['context']
   
answer = data['answer']
   
template = "Question: {question}\nContext: {context}\nAnswer: {answer}"
   
return {'text': template.format(question=question, context=context, answer=answer)}


transformed = dataset.map(transform)

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
   
load_in_4bit=use_4bit,
   
bnb_4bit_quant_type=bnb_4bit_quant_type,
   
bnb_4bit_compute_dtype=compute_dtype,
   
bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
   
major, _ = torch.cuda.get_device_capability()
   
if major >= 8:
       
print("=" * 80)
       
print("Your GPU supports bfloat16")
       
print("=" * 80)

# Load base model
# model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained(
   
model_name,
   
quantization_config=bnb_config,
   
device_map=device_map,
   
torch_dtype=torch.float16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
   
lora_alpha=lora_alpha,
   
lora_dropout=lora_dropout,
   
r=lora_r,
   
bias="none",
   
task_type="CAUSAL_LM",
   
target_modules=["q_proj", "v_proj"]
)

# Set training parameters
training_arguments = TrainingArguments(
   
output_dir=output_dir,
   
num_train_epochs=num_train_epochs,
   
per_device_train_batch_size=per_device_train_batch_size,
   
gradient_accumulation_steps=gradient_accumulation_steps,
   
optim=optim,
   
save_steps=save_steps,
   
logging_steps=logging_steps,
   
learning_rate=learning_rate,
   
weight_decay=weight_decay,
   
fp16=fp16,
   
bf16=bf16,
   
max_grad_norm=max_grad_norm,
   
max_steps=max_steps,
   
warmup_ratio=warmup_ratio,
   
group_by_length=group_by_length,
   
lr_scheduler_type=lr_scheduler_type,
)

trainer = SFTTrainer(
   
model=model,
   
train_dataset=transformed,
   
peft_config=peft_config,
   
dataset_text_field="text",
   
max_seq_length=max_seq_length,
   
tokenizer=tokenizer,
   
args=training_arguments,
   
packing=packing,
)

trainer.train()

trainer.model.save_pretrained(new_model)

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
   
model_name,
   
low_cpu_mem_usage=True,
   
return_dict=True,
   
torch_dtype=torch.float16,
   
device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Push to HF
# model.push_to_hub(new_model, check_pr=True)
# tokenizer.push_to_hub(new_model, check_pr=True)

# push to GCS

file_path_to_save_the_model = '/finetune/new_model'
model.save_pretrained(file_path_to_save_the_model)
tokenizer.save_pretrained(file_path_to_save_the_model)

建立 requirements.txt 檔案。

accelerate==0.30.1
bitsandbytes
==0.43.1
datasets
==2.19.1
transformers
==4.41.0
peft
==0.11.1
trl
==0.8.6
torch
==2.3.0

建立 Dockerfile

FROM nvidia/cuda:12.6.2-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt /requirements.txt

RUN pip3 install -r requirements.txt --no-cache-dir

COPY finetune.py /finetune.py

ENV PYTHONUNBUFFERED 1

CMD python3 /finetune.py --device cuda

在 Artifact Registry 存放區中建構容器

gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME

5. 部署及執行工作

在這個步驟中,您將建立含有直接 VPC 出口的 Job YAML 設定,以便更快速地將資料上傳至 Google Cloud Storage。

請注意,這個檔案包含您將在後續步驟中更新的變數。

首先,建立名為 finetune-job.yaml 的檔案

apiVersion: run.googleapis.com/v1
kind: Job
metadata:
  name: finetuning-to-gcs-job
  labels:
    cloud.googleapis.com/location: us-central1
  annotations:
    run.googleapis.com/launch-stage: ALPHA
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/execution-environment: gen2
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      parallelism: 1
      taskCount: 1
      template:
        spec:
          serviceAccountName: YOUR_SERVICE_ACCOUNT_NAME@YOUR_PROJECT_ID.iam.gserviceaccount.com
          containers:
          - name: finetune-to-gcs
            image: YOUR_REGION-docker.pkg.dev/YOUR_PROJECT_ID/YOUR_AR_REPO/YOUR_IMAGE_NAME
            env:
            - name: MODEL_NAME
              value: "google/gemma-2b"
            - name: NEW_MODEL
              value: "gemma-2b-sql-finetuned"
            - name: LORA_R
              value: "8"
            - name: LORA_ALPHA
              value: "16"
            - name: TRAIN_BATCH_SIZE
              value: "1"
            - name: EVAL_BATCH_SIZE
              value: "2"
            - name: GRADIENT_ACCUMULATION_STEPS
              value: "2"
            - name: DATASET_LIMIT
              value: "1000"
            - name: MAX_SEQ_LENGTH
              value: "512"
            - name: LOGGING_STEPS
              value: "5"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  key: 'latest'
                  name: HF_TOKEN
            resources:
              limits:
                cpu: 8000m
                nvidia.com/gpu: '1'
                memory: 32Gi
            volumeMounts:
            - mountPath: /finetune/new_model
              name: finetuned_model
          volumes:
          - name: finetuned_model
            csi:
              driver: gcsfuse.run.googleapis.com
              readOnly: false
              volumeAttributes:
                bucketName: YOUR_PROJECT_ID-codelab-finetuning-jobs
          maxRetries: 3
          timeoutSeconds: '3600'
          nodeSelector:
            run.googleapis.com/accelerator: nvidia-l4

接著,執行下列指令,將預留位置替換為映像檔的環境變數:

sed -i "s/YOUR_SERVICE_ACCOUNT_NAME/$SERVICE_ACCOUNT/; s/YOUR_PROJECT_ID/$PROJECT_ID/;  s/YOUR_PROJECT_ID/$PROJECT_ID/; s/YOUR_REGION/$REGION/; s/YOUR_AR_REPO/$AR_REPO/; s/YOUR_IMAGE_NAME/$IMAGE_NAME/; s/YOUR_PROJECT_ID/$PROJECT_ID/" finetune-job.yaml

接下來,請建立 Cloud Run 工作

gcloud alpha run jobs replace finetune-job.yaml

並執行工作。這項作業大約需要 10 分鐘。

gcloud alpha run jobs execute $JOB_NAME --region $REGION

6. 使用 Cloud Run 服務,透過 vLLM 提供經過微調的模型

為 Cloud Run 服務程式碼建立資料夾,以便提供精細調整的模型

cd ..
mkdir codelab
-finetuning-service
cd codelab
-finetuning-service

建立 service.yaml 檔案

這項設定會透過私人網路使用直接虛擬私有雲,以便更快速下載 GCS 值區。

請注意,這個檔案包含您將在後續步驟中更新的變數。

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: serve-gemma2b-sql
  labels:
    cloud.googleapis.com/location: us-central1
  annotations:
    run.googleapis.com/launch-stage: BETA
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
spec:
  template:
    metadata:
      labels:
      annotations:
        autoscaling.knative.dev/maxScale: '5'
        run.googleapis.com/cpu-throttling: 'false'
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
    spec:
      containers:
      - name: serve-finetuned
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
        ports:
        - name: http1
          containerPort: 8000
        resources:
          limits:
            cpu: 8000m
            nvidia.com/gpu: '1'
            memory: 32Gi
        volumeMounts:
        - name: fuse
          mountPath: /finetune/new_model
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=/finetune/new_model
        - --tensor-parallel-size=1
        env:
        - name: MODEL_ID
          value: 'new_model'
        - name: HF_HUB_OFFLINE
          value: '1'
      volumes:
      - name: fuse
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: YOUR_BUCKET_NAME
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4

使用值區名稱更新 service.yaml 檔案。

sed -i "s/YOUR_BUCKET_NAME/$BUCKET_NAME/" service.yaml

接著部署 Cloud Run 服務

gcloud alpha run services replace service.yaml

7. 測試經過微調的模型

首先,請取得 Cloud Run 服務的服務網址。

SERVICE_URL=$(gcloud run services describe serve-gemma2b-sql --platform managed --region $REGION --format 'value(status.url)')

為模型建立提示。

USER_PROMPT="Question: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)"

接著使用 curl 服務

curl -X POST $SERVICE_URL/generate \
  -H "Content-Type: application/json" \
  -H "Authorization: bearer $(gcloud auth print-identity-token)" \
  -d @- <<EOF
{
    "prompt": "${USER_PROMPT}",
    "temperature": 0.1,
    "top_p": 1.0,
    "max_tokens": 56
}
EOF

畫面會顯示類似以下的回應:

{"predictions":["Prompt:\nQuestion: What are the first name and last name of all candidates? Context: CREATE TABLE candidates (candidate_id VARCHAR); CREATE TABLE people (first_name VARCHAR, last_name VARCHAR, person_id VARCHAR)\nOutput:\n CREATE TABLE people_to_candidates (candidate_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people (person_id VARCHAR, person_id VARCHAR) CREATE TABLE people_to_people_to_candidates (person_id VARCHAR, candidate_id"]}