Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

如何使用 Cloud Run 作业微调 LLM

1. 简介

概览

在此 Codelab 中，您将使用 Cloud Run 作业对 Gemma 3 模型进行微调，然后使用 vLLM 在 Cloud Run 上部署结果。

您将执行的操作

使用 EmojiLM：对新的表情符号语言进行建模中确立的 KomeijiForce/Text2Emoji 数据集，训练模型以使用特定结果响应特定短语。

训练后，模型会响应以“Translate to emoji: ”为前缀的句子，并返回与该句子对应的一系列表情符号。

学习内容

如何使用 Cloud Run 作业 GPU 进行微调
如何使用 Cloud Run 和 vLLM 部署模型
如何为 GPU 作业使用直接 VPC 配置，以更快地上传和提供模型

2. 准备工作

启用 API

在开始使用此 Codelab 之前，请运行以下命令来启用以下 API：

gcloud services enable run.googleapis.com \
    compute.googleapis.com \
    run.googleapis.com \
    cloudbuild.googleapis.com \
    secretmanager.googleapis.com \
    artifactregistry.googleapis.com

GPU 配额

查看 GPU 配额文档，确认如何申请配额。

如果您遇到任何“您没有使用 GPU 的配额”错误，请前往 g.co/cloudrun/gpu-quota 确认您的配额。

注意：如果您使用的是新项目，从启用 API 到配额显示在配额页面上可能需要几分钟时间。

Hugging Face

此 Codelab 使用 Hugging Face 上托管的模型。如需获取此模型，请申请具有“读取”权限的 Hugging Face 用户访问令牌。您稍后会将其引用为 YOUR_HF_TOKEN。

如需使用 gemma-3-1b-it 模型，您必须同意使用条款。

3. 设置和要求

设置以下资源：

IAM 服务账号和关联的 IAM 权限，
用于存储 Hugging Face 令牌的 Secret Manager Secret，
用于存储微调后的模型的 Cloud Storage 存储分区，以及
Artifact Registry 制品库，用于存储您将构建的映像以对模型进行微调。

为此 Codelab 设置环境变量。我们为您预先填充了一些变量。指定项目 ID、区域和 Hugging Face 令牌。

export PROJECT_ID=<YOUR_PROJECT_ID>
export REGION=<YOUR_REGION>
export HF_TOKEN=<YOUR_HF_TOKEN>

export NEW_MODEL=gemma-emoji
export AR_REPO=codelab-finetuning-jobs
export IMAGE_NAME=finetune-to-gcs
export JOB_NAME=finetuning-to-gcs-job
export BUCKET_NAME=$PROJECT_ID-codelab-finetuning-jobs
export SECRET_ID=HF_TOKEN
export SERVICE_ACCOUNT="finetune-job-sa"
export SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

运行以下命令以创建服务账号：

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Service account for fine-tuning codelab"

使用 Secret Manager 存储 Hugging Face 访问令牌：

gcloud secrets create $SECRET_ID \
      --replication-policy="automatic"

printf $HF_TOKEN | gcloud secrets versions add $SECRET_ID --data-file=-

向您的服务账号授予 Secret Manager Secret Accessor 角色：

gcloud secrets add-iam-policy-binding $SECRET_ID \
  --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role='roles/secretmanager.secretAccessor'

创建一个用于托管微调模型的存储分区：

gcloud storage buckets create -l $REGION gs://$BUCKET_NAME

向您的服务账号授予对相应存储分区的访问权限：

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
  --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role=roles/storage.objectAdmin

创建 Artifact Registry 制品库以存储容器映像：

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for finetuning using CR jobs" \
    --project=$PROJECT_ID

4. 创建 Cloud Run 作业映像

在下一步中，您将创建执行以下操作的代码：

从 Hugging Face 导入 Gemma 模型
使用 Hugging Face 中的数据集对模型进行微调。作业使用单个 L4 GPU 进行微调。
将名为 new_model 的微调模型上传到您的 Cloud Storage 存储分区

为微调作业代码创建目录。

mkdir codelab-finetuning-job
cd codelab-finetuning-job

创建名为 finetune.py 的文件

# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

# Cloud Storage bucket to upload the model
bucket_name = os.getenv("BUCKET_NAME", "YOUR_BUCKET_NAME")

# The model that you want to train from the Hugging Face hub
model_name = os.getenv("MODEL_NAME", "google/gemma-3-1b-it")

# The instruction dataset to use
dataset_name = "KomeijiForce/Text2Emoji"

# Fine-tuned model name
new_model = os.getenv("NEW_MODEL", "gemma-emoji")

############################ Setup ############################################

# Load the entire model on the GPU 0
device_map = {"": torch.cuda.current_device()}

# Limit dataset to a random selection
dataset = load_dataset(dataset_name, split="train").shuffle(seed=42).select(range(1000))

# Setup input formats: trains the model to respond to "Translate to emoji:" with emoji output.
tokenizer = AutoTokenizer.from_pretrained(model_name)

def format_to_chat(example):
    return {
        "conversations": [
            {"role": "user", "content": f"Translate to emoji: {example['text']}"},
            {"role": "assistant", "content": example["emoji"]},
        ]
    }

formatted_dataset = dataset.map(
    format_to_chat,
    batched=False,                        # Process row by row
    remove_columns=dataset.column_names,  # Optional: Keep only the new column
)

def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["conversations"], tokenize=False)
    return {"text": texts}

final_dataset = formatted_dataset.map(apply_chat_template, batched=True)

############################# Config #########################################

# Load tokenizer and model with QLoRA configuration
bnb_4bit_compute_dtype = "float16"  # Compute dtype for 4-bit base models
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4",  # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,  # Activate nested quantization for 4-bit base models (double quantization)
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    torch_dtype=torch.float16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

############################## Train ##########################################

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=16,     # Alpha parameter for LoRA scaling
    lora_dropout=0.1,  # Dropout probability for LoRA layers,
    r=8,               # LoRA attention dimension
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"],
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,  # Batch size per GPU for training
    gradient_accumulation_steps=2,  # Number of update steps to accumulate the gradients for
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=5,
    learning_rate=2e-4,    # Initial learning rate (AdamW optimizer)
    weight_decay=0.001,    # Weight decay to apply to all layers except bias/LayerNorm weights
    fp16=True, bf16=False, # Enable fp16/bf16 training
    max_grad_norm=0.3,     # Maximum gradient normal (gradient clipping)
    warmup_ratio=0.03,     # Ratio of steps for a linear warmup (from 0 to learning rate)
    group_by_length=True,  # Group sequences into batches with same length # Saves memory and speeds up training considerably
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=final_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=512,  # Maximum sequence length to use
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,       # Pack multiple short examples in the same input sequence to increase efficiency
)

trainer.train()
trainer.model.save_pretrained(new_model)

################################# Save ########################################

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# push results to Cloud Storage
file_path_to_save_the_model = "/finetune/new_model"
model.save_pretrained(file_path_to_save_the_model)
tokenizer.save_pretrained(file_path_to_save_the_model)

创建 requirements.txt 文件：

accelerate==0.34.2
bitsandbytes==0.45.5
datasets==2.19.1
transformers==4.51.3
peft==0.11.1
trl==0.8.6
torch==2.3.0

创建 Dockerfile：

FROM nvidia/cuda:12.6.2-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt /requirements.txt

RUN pip3 install -r requirements.txt --no-cache-dir

COPY finetune.py /finetune.py

ENV PYTHONUNBUFFERED 1

CMD python3 /finetune.py --device cuda

在 Artifact Registry 代码库中构建容器：

gcloud builds submit \
  --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
  --region $REGION

5. 部署和执行作业

在此步骤中，您将创建具有直接 VPC 出站流量的作业，以便更快地将数据上传到 Google Cloud Storage。

创建 Cloud Run 作业：

gcloud run jobs create $JOB_NAME \
  --region $REGION \
  --image $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
  --set-env-vars BUCKET_NAME=$BUCKET_NAME \
  --set-secrets HF_TOKEN=$SECRET_ID:latest \
  --cpu 8.0 \
  --memory 32Gi \
  --gpu 1 \
  --add-volume name=finetuned_model,type=cloud-storage,bucket=$BUCKET_NAME \
  --add-volume-mount volume=finetuned_model,mount-path=/finetune/new_model \
  --service-account $SERVICE_ACCOUNT_ADDRESS

执行作业：

gcloud run jobs execute $JOB_NAME --region $REGION --async

此作业大约需要 10 分钟才能完成。您可以使用上一个命令的输出中提供的链接来查看状态。

6. 使用 Cloud Run 服务通过 vLLM 部署微调后的模型

在此步骤中，您将部署 Cloud Run 服务。此配置使用直接 VPC 通过专用网络访问 Cloud Storage 存储分区，以加快下载速度。

部署 Cloud Run 服务：

gcloud run deploy serve-gemma-emoji \
  --image us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250601_0916_RC01 \
  --region $REGION \
  --port 8000 \
  --set-env-vars MODEL_ID=new_model,HF_HUB_OFFLINE=1 \
  --cpu 8.0 \
  --memory 32Gi \
  --gpu 1 \
  --add-volume name=finetuned_model,type=cloud-storage,bucket=$BUCKET_NAME \
  --add-volume-mount volume=finetuned_model,mount-path=/finetune/new_model \
  --service-account $SERVICE_ACCOUNT_ADDRESS \
  --max-instances 1 \
  --command python3 \
  --args="-m,vllm.entrypoints.api_server,--model=/finetune/new_model,--tensor-parallel-size=1" \
  --no-gpu-zonal-redundancy \
  --labels=dev-tutorial=codelab-tuning \
  --no-invoker-iam-check

7. 测试微调后的模型

在此步骤中，您将提示模型使用 curl 测试微调。

获取 Cloud Run 服务的服务网址：

SERVICE_URL=$(gcloud run services describe serve-gemma-emoji \
    --region $REGION --format 'value(status.url)')

为模型创建提示。

USER_PROMPT="Translate to emoji: I ate a banana for breakfast, later I'm thinking of having soup!"

使用 curl 调用服务以提示模型，并使用 jq 过滤结果：

curl -s -X POST ${SERVICE_URL}/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: bearer $(gcloud auth print-identity-token)" \
-d @- <<EOF | jq ".choices[0].message.content"
{   "model": "${NEW_MODEL}",
    "messages": [{
        "role": "user",
        "content": [ { "type": "text", "text": "${USER_PROMPT}"}]
    }]
}
EOF

您应该会看到如下所示的响应：

🍌🤔😋🥣

8. 恭喜！

恭喜您完成此 Codelab！

建议您查看 Cloud Run 作业 GPU 文档。

所学内容

如何使用 Cloud Run 作业 GPU 进行微调
如何使用 Cloud Run 和 vLLM 部署模型
如何为 GPU 作业使用直接 VPC 配置，以更快地上传和提供模型

9. 清理

为避免产生意外费用（例如，Cloud Run 服务被意外调用次数超过免费层的每月 Cloud Run 调用次数配额），您可以删除在第 6 步中创建的 Cloud Run 服务。

如需删除 Cloud Run 服务，请前往 Cloud Run Cloud 控制台 (https://console.cloud.google.com/run)，然后删除 serve-gemma-emoji 服务。

如需删除整个项目，请前往管理资源，选择您在第 2 步中创建的项目，然后选择“删除”。如果您删除项目，则需要在 Cloud SDK 中更改项目。您可以运行 gcloud projects list 查看所有可用项目的列表。