หน้านี้ได้รับการแปลโดย Cloud Translation API

วิธีปรับแต่ง LLM โดยใช้งาน Cloud Run

1. บทนำ

ภาพรวม

ในโค้ดแล็บนี้ คุณจะได้ใช้ Cloud Run Jobs เพื่อปรับแต่งโมเดล Gemma 3 จากนั้นให้บริการผลลัพธ์ใน Cloud Run โดยใช้ vLLM

สิ่งที่คุณต้องดำเนินการ

ฝึกโมเดลให้ตอบกลับวลีหนึ่งๆ ด้วยผลลัพธ์ที่เฉพาะเจาะจงโดยใช้ชุดข้อมูล KomeijiForce/Text2Emoji ซึ่งสร้างขึ้นเป็นส่วนหนึ่งของ EmojiLM: Modeling the New Emoji Language

หลังจากฝึกแล้ว โมเดลจะตอบกลับประโยคที่ขึ้นต้นด้วย "แปลเป็นอีโมจิ: " ด้วยชุดอีโมจิที่สอดคล้องกับประโยคนั้น

สิ่งที่คุณจะได้เรียนรู้

วิธีทำการปรับแต่งโดยใช้ GPU ของ Cloud Run Jobs
วิธีแสดงโมเดลโดยใช้ Cloud Run กับ vLLM
วิธีใช้การกำหนดค่า VPC โดยตรงสำหรับงาน GPU เพื่ออัปโหลดและแสดงโมเดลได้เร็วขึ้น

2. ก่อนเริ่มต้น

เปิดใช้ API

ก่อนที่จะเริ่มใช้ Codelab นี้ได้ ให้เปิดใช้ API ต่อไปนี้โดยการเรียกใช้

gcloud services enable run.googleapis.com \
    compute.googleapis.com \
    run.googleapis.com \
    cloudbuild.googleapis.com \
    secretmanager.googleapis.com \
    artifactregistry.googleapis.com

โควต้า GPU

โปรดอ่านเอกสารประกอบเกี่ยวกับโควต้า GPU เพื่อยืนยันวิธีกรอกคำขอโควต้า

หากพบข้อผิดพลาด "คุณไม่มีโควต้าสำหรับการใช้ GPU" ให้ยืนยันโควต้าที่ g.co/cloudrun/gpu-quota

หมายเหตุ: หากคุณใช้โปรเจ็กต์ใหม่ ระบบอาจใช้เวลา 2-3 นาทีระหว่างการเปิดใช้ API กับการแสดงโควต้าในหน้าโควต้า

Hugging Face

Codelab นี้ใช้โมเดลที่โฮสต์ใน Hugging Face หากต้องการรับโมเดลนี้ ให้ขอโทเค็นการเข้าถึงของผู้ใช้ Hugging Face ที่มีสิทธิ์ "อ่าน" คุณจะอ้างอิงถึงสิ่งนี้ในภายหลังเป็น YOUR_HF_TOKEN

หากต้องการใช้โมเดล gemma-3-1b-it คุณต้องยอมรับข้อกำหนดในการใช้งาน

3. การตั้งค่าและข้อกำหนด

ตั้งค่าทรัพยากรต่อไปนี้

บัญชีบริการ IAM และสิทธิ์ IAM ที่เกี่ยวข้อง
ข้อมูลลับของ Secret Manager เพื่อจัดเก็บโทเค็น Hugging Face
ที่เก็บข้อมูล Cloud Storage เพื่อจัดเก็บโมเดลที่ปรับแต่งแล้ว และ
ที่เก็บ Artifact Registry เพื่อจัดเก็บอิมเมจที่คุณจะสร้างเพื่อปรับแต่งโมเดล

ตั้งค่าตัวแปรสภาพแวดล้อมสำหรับ Codelab นี้ เราได้ป้อนตัวแปรจำนวนหนึ่งไว้ให้คุณล่วงหน้าแล้ว ระบุรหัสโปรเจ็กต์ ภูมิภาค และโทเค็น Hugging Face

export PROJECT_ID=<YOUR_PROJECT_ID>
export REGION=<YOUR_REGION>
export HF_TOKEN=<YOUR_HF_TOKEN>

export NEW_MODEL=gemma-emoji
export AR_REPO=codelab-finetuning-jobs
export IMAGE_NAME=finetune-to-gcs
export JOB_NAME=finetuning-to-gcs-job
export BUCKET_NAME=$PROJECT_ID-codelab-finetuning-jobs
export SECRET_ID=HF_TOKEN
export SERVICE_ACCOUNT="finetune-job-sa"
export SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

สร้างบัญชีบริการโดยเรียกใช้คำสั่งนี้

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Service account for fine-tuning codelab"

ใช้ Secret Manager เพื่อจัดเก็บโทเค็นเพื่อการเข้าถึง Hugging Face โดยทำดังนี้
```
gcloud secrets create $SECRET_ID \
      --replication-policy="automatic"

printf $HF_TOKEN | gcloud secrets versions add $SECRET_ID --data-file=-
```
มอบบทบาทผู้เข้าถึงข้อมูลลับของผู้จัดการข้อมูลลับให้แก่บัญชีบริการ
```
gcloud secrets add-iam-policy-binding $SECRET_ID \
  --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role='roles/secretmanager.secretAccessor'
```
สร้างที่เก็บข้อมูลที่จะโฮสต์โมเดลที่ปรับแต่งแล้วโดยทำดังนี้
```
gcloud storage buckets create -l $REGION gs://$BUCKET_NAME
```
ให้สิทธิ์เข้าถึงที่เก็บข้อมูลแก่บัญชีบริการโดยทำดังนี้
```
gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
  --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role=roles/storage.objectAdmin
```

สร้างที่เก็บ Artifact Registry เพื่อจัดเก็บอิมเมจคอนเทนเนอร์

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for finetuning using CR jobs" \
    --project=$PROJECT_ID

4. สร้างอิมเมจงาน Cloud Run

ในขั้นตอนถัดไป คุณจะสร้างโค้ดที่ทำสิ่งต่อไปนี้

นำเข้าโมเดล Gemma จาก Hugging Face
ทำการปรับแต่งโมเดลด้วยชุดข้อมูลจาก Hugging Face งานนี้ใช้ GPU L4 เดียวสำหรับการปรับแต่ง
อัปโหลดโมเดลที่ปรับแต่งแล้วชื่อ new_model ไปยังที่เก็บข้อมูล Cloud Storage

สร้างไดเรกทอรีสำหรับโค้ดงานการปรับแต่ง
```
mkdir codelab-finetuning-job
cd codelab-finetuning-job
```

สร้างไฟล์ชื่อ finetune.py

# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

# Cloud Storage bucket to upload the model
bucket_name = os.getenv("BUCKET_NAME", "YOUR_BUCKET_NAME")

# The model that you want to train from the Hugging Face hub
model_name = os.getenv("MODEL_NAME", "google/gemma-3-1b-it")

# The instruction dataset to use
dataset_name = "KomeijiForce/Text2Emoji"

# Fine-tuned model name
new_model = os.getenv("NEW_MODEL", "gemma-emoji")

############################ Setup ############################################

# Load the entire model on the GPU 0
device_map = {"": torch.cuda.current_device()}

# Limit dataset to a random selection
dataset = load_dataset(dataset_name, split="train").shuffle(seed=42).select(range(1000))

# Setup input formats: trains the model to respond to "Translate to emoji:" with emoji output.
tokenizer = AutoTokenizer.from_pretrained(model_name)

def format_to_chat(example):
    return {
        "conversations": [
            {"role": "user", "content": f"Translate to emoji: {example['text']}"},
            {"role": "assistant", "content": example["emoji"]},
        ]
    }

formatted_dataset = dataset.map(
    format_to_chat,
    batched=False,                        # Process row by row
    remove_columns=dataset.column_names,  # Optional: Keep only the new column
)

def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["conversations"], tokenize=False)
    return {"text": texts}

final_dataset = formatted_dataset.map(apply_chat_template, batched=True)

############################# Config #########################################

# Load tokenizer and model with QLoRA configuration
bnb_4bit_compute_dtype = "float16"  # Compute dtype for 4-bit base models
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4",  # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,  # Activate nested quantization for 4-bit base models (double quantization)
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    torch_dtype=torch.float16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

############################## Train ##########################################

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=16,     # Alpha parameter for LoRA scaling
    lora_dropout=0.1,  # Dropout probability for LoRA layers,
    r=8,               # LoRA attention dimension
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"],
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,  # Batch size per GPU for training
    gradient_accumulation_steps=2,  # Number of update steps to accumulate the gradients for
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=5,
    learning_rate=2e-4,    # Initial learning rate (AdamW optimizer)
    weight_decay=0.001,    # Weight decay to apply to all layers except bias/LayerNorm weights
    fp16=True, bf16=False, # Enable fp16/bf16 training
    max_grad_norm=0.3,     # Maximum gradient normal (gradient clipping)
    warmup_ratio=0.03,     # Ratio of steps for a linear warmup (from 0 to learning rate)
    group_by_length=True,  # Group sequences into batches with same length # Saves memory and speeds up training considerably
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=final_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=512,  # Maximum sequence length to use
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,       # Pack multiple short examples in the same input sequence to increase efficiency
)

trainer.train()
trainer.model.save_pretrained(new_model)

################################# Save ########################################

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# push results to Cloud Storage
file_path_to_save_the_model = "/finetune/new_model"
model.save_pretrained(file_path_to_save_the_model)
tokenizer.save_pretrained(file_path_to_save_the_model)

วิธีสร้างไฟล์ requirements.txt

accelerate==0.34.2
bitsandbytes==0.45.5
datasets==2.19.1
transformers==4.51.3
peft==0.11.1
trl==0.8.6
torch==2.3.0

สร้าง Dockerfile

FROM nvidia/cuda:12.6.2-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt /requirements.txt

RUN pip3 install -r requirements.txt --no-cache-dir

COPY finetune.py /finetune.py

ENV PYTHONUNBUFFERED 1

CMD python3 /finetune.py --device cuda

สร้างคอนเทนเนอร์ในที่เก็บ Artifact Registry โดยทำดังนี้
```
gcloud builds submit \
  --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
  --region $REGION
```

5. นำไปใช้และเรียกใช้งาน

ในขั้นตอนนี้ คุณจะสร้างงานที่มีการส่งออก VPC โดยตรงเพื่อให้อัปโหลดไปยัง Google Cloud Storage ได้เร็วขึ้น

สร้างงาน Cloud Run

gcloud run jobs create $JOB_NAME \
  --region $REGION \
  --image $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
  --set-env-vars BUCKET_NAME=$BUCKET_NAME \
  --set-secrets HF_TOKEN=$SECRET_ID:latest \
  --cpu 8.0 \
  --memory 32Gi \
  --gpu 1 \
  --add-volume name=finetuned_model,type=cloud-storage,bucket=$BUCKET_NAME \
  --add-volume-mount volume=finetuned_model,mount-path=/finetune/new_model \
  --service-account $SERVICE_ACCOUNT_ADDRESS

เรียกใช้งาน

gcloud run jobs execute $JOB_NAME --region $REGION --async

งานนี้จะใช้เวลาประมาณ 10 นาที คุณตรวจสอบสถานะได้โดยใช้ลิงก์ที่ระบุไว้ในเอาต์พุตของคำสั่งสุดท้าย

6. ใช้บริการ Cloud Run เพื่อแสดงโมเดลที่ปรับแต่งแล้วด้วย vLLM

ในขั้นตอนนี้ คุณจะทำให้บริการ Cloud Run ใช้งานได้ การกำหนดค่านี้ใช้ VPC โดยตรงเพื่อเข้าถึงที่เก็บข้อมูล Cloud Storage ผ่านเครือข่ายส่วนตัวเพื่อให้ดาวน์โหลดได้เร็วขึ้น

ติดตั้งใช้งานบริการ Cloud Run

gcloud run deploy serve-gemma-emoji \
  --image us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250601_0916_RC01 \
  --region $REGION \
  --port 8000 \
  --set-env-vars MODEL_ID=new_model,HF_HUB_OFFLINE=1 \
  --cpu 8.0 \
  --memory 32Gi \
  --gpu 1 \
  --add-volume name=finetuned_model,type=cloud-storage,bucket=$BUCKET_NAME \
  --add-volume-mount volume=finetuned_model,mount-path=/finetune/new_model \
  --service-account $SERVICE_ACCOUNT_ADDRESS \
  --max-instances 1 \
  --command python3 \
  --args="-m,vllm.entrypoints.api_server,--model=/finetune/new_model,--tensor-parallel-size=1" \
  --no-gpu-zonal-redundancy \
  --labels=dev-tutorial=codelab-tuning \
  --no-invoker-iam-check

7. ทดสอบโมเดลที่ปรับแต่งแล้ว

ในขั้นตอนนี้ คุณจะแจ้งให้โมเดลทดสอบการปรับแต่งโดยใช้ curl

รับ URL ของบริการสำหรับบริการ Cloud Run โดยทำดังนี้

SERVICE_URL=$(gcloud run services describe serve-gemma-emoji \
    --region $REGION --format 'value(status.url)')

สร้างพรอมต์สำหรับโมเดล

USER_PROMPT="Translate to emoji: I ate a banana for breakfast, later I'm thinking of having soup!"

เรียกใช้บริการโดยใช้ curl เพื่อแจ้งโมเดล กรองผลลัพธ์ด้วย jq ดังนี้

curl -s -X POST ${SERVICE_URL}/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: bearer $(gcloud auth print-identity-token)" \
-d @- <<EOF | jq ".choices[0].message.content"
{   "model": "${NEW_MODEL}",
    "messages": [{
        "role": "user",
        "content": [ { "type": "text", "text": "${USER_PROMPT}"}]
    }]
}
EOF

คุณควรเห็นคำตอบที่คล้ายกับคำตอบต่อไปนี้

🍌🤔😋🥣

8. ยินดีด้วย

ขอแสดงความยินดีที่ทำ Codelab นี้เสร็จสมบูรณ์

เราขอแนะนำให้อ่านเอกสารประกอบเกี่ยวกับ GPU ของงาน Cloud Run

สิ่งที่เราได้พูดถึงไปแล้ว

วิธีทำการปรับแต่งโดยใช้ GPU ของ Cloud Run Jobs
วิธีแสดงโมเดลโดยใช้ Cloud Run กับ vLLM
วิธีใช้การกำหนดค่า VPC โดยตรงสำหรับงาน GPU เพื่ออัปโหลดและแสดงโมเดลได้เร็วขึ้น

9. ล้างข้อมูล

หากต้องการหลีกเลี่ยงการเรียกเก็บเงินโดยไม่ตั้งใจ เช่น หากมีการเรียกใช้บริการ Cloud Run โดยไม่ตั้งใจมากกว่าการจัดสรรการเรียกใช้ Cloud Run รายเดือนในระดับฟรี คุณสามารถลบบริการ Cloud Run ที่สร้างไว้ในขั้นตอนที่ 6 ได้

หากต้องการลบบริการ Cloud Run ให้ไปที่ Cloud Run Cloud Console ที่ https://console.cloud.google.com/run แล้วลบserve-gemma-emojiบริการ

หากต้องการลบทั้งโปรเจ็กต์ ให้ไปที่จัดการทรัพยากร เลือกโปรเจ็กต์ที่คุณสร้างในขั้นตอนที่ 2 แล้วเลือก "ลบ" หากลบโปรเจ็กต์ คุณจะต้องเปลี่ยนโปรเจ็กต์ใน Cloud SDK คุณดูรายการโปรเจ็กต์ทั้งหมดที่พร้อมใช้งานได้โดยเรียกใช้ gcloud projects list