How to fine tune a LLM using Cloud Run Jobs

1. Introduction

Overview

In this codelab, you will use Cloud Run jobs to finetune a Gemma 3 model, then serve the result on Cloud Run using vLLM.

What you'll do

Train a model to respond to a specific phrase with a specific result using the KomeijiForce/Text2Emoji dataset, established as part of EmojiLM: Modeling the New Emoji Language.

After training, the model responds to a sentence prefixed with "Translate to emoji: ", with a series of emoji corresponding to that sentence.

What you'll learn

How to conduct fine tuning using Cloud Run Jobs GPU
How to serve a model using Cloud Run with vLLM
How to use Direct VPC configuration for a GPU Job for faster upload and serving of the model

2. Before you begin

Enable APIs

Before you can start using this codelab, enable the following APIs by running:

gcloud services enable run.googleapis.com \
    compute.googleapis.com \
    run.googleapis.com \
    cloudbuild.googleapis.com \
    secretmanager.googleapis.com \
    artifactregistry.googleapis.com

GPU Quota

Review the GPU Quota documentation to confirm how to request quota.

If you encounter any "You do not have quota for using GPUs" errors, confirm your quota on g.co/cloudrun/gpu-quota.

Note: If you are using a new project, it may take a few minutes between enabling the API and having the quotas appear in the quota page.

Hugging Face

This codelab uses a model hosted on Hugging Face. To get this model, request for the Hugging Face user access token with "Read" permission. You will reference this later as YOUR_HF_TOKEN.

To use the gemma-3-1b-it model, you must agree to the usage terms.

3. Setup and Requirements

Set up the following resources:

IAM service account and associated IAM permissions,
Secret Manager secret to store your Hugging Face token,
Cloud Storage bucket to store your fine-tuned model, and
Artifact Registry repository to store the image you'll build to fine-tune your model.

Set environment variables for this codelab. We pre-populated a number of variables for you. Specify your project ID, region, and Hugging Face token.

export PROJECT_ID=<YOUR_PROJECT_ID>
export REGION=<YOUR_REGION>
export HF_TOKEN=<YOUR_HF_TOKEN>

export NEW_MODEL=gemma-emoji
export AR_REPO=codelab-finetuning-jobs
export IMAGE_NAME=finetune-to-gcs
export JOB_NAME=finetuning-to-gcs-job
export BUCKET_NAME=$PROJECT_ID-codelab-finetuning-jobs
export SECRET_ID=HF_TOKEN
export SERVICE_ACCOUNT="finetune-job-sa"
export SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

Create the service account by running this command:

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Service account for fine-tuning codelab"

Use Secret Manager to store Hugging Face access token:

gcloud secrets create $SECRET_ID \
      --replication-policy="automatic"

printf $HF_TOKEN | gcloud secrets versions add $SECRET_ID --data-file=-

Grant your service account the role of Secret Manager Secret Accessor:

gcloud secrets add-iam-policy-binding $SECRET_ID \
  --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role='roles/secretmanager.secretAccessor'

Create a bucket that will host your fine-tuned model:

gcloud storage buckets create -l $REGION gs://$BUCKET_NAME

Grant your service account access to the bucket:

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
  --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role=roles/storage.objectAdmin

Create an Artifact Registry repository to store the container image:

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for finetuning using CR jobs" \
    --project=$PROJECT_ID

4. Create the Cloud Run job image

In the next step, you'll create the code that does the following:

Imports the Gemma model from Hugging Face
Performs fine tuning on the model with the dataset from Hugging Face. The job uses single L4 GPU for fine tuning.
Uploads the fine-tuned model called new_model to your Cloud Storage bucket

Create a directory for your fine tuning job code.

mkdir codelab-finetuning-job
cd codelab-finetuning-job

Create a file called finetune.py

# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

# Cloud Storage bucket to upload the model
bucket_name = os.getenv("BUCKET_NAME", "YOUR_BUCKET_NAME")

# The model that you want to train from the Hugging Face hub
model_name = os.getenv("MODEL_NAME", "google/gemma-3-1b-it")

# The instruction dataset to use
dataset_name = "KomeijiForce/Text2Emoji"

# Fine-tuned model name
new_model = os.getenv("NEW_MODEL", "gemma-emoji")

############################ Setup ############################################

# Load the entire model on the GPU 0
device_map = {"": torch.cuda.current_device()}

# Limit dataset to a random selection
dataset = load_dataset(dataset_name, split="train").shuffle(seed=42).select(range(1000))

# Setup input formats: trains the model to respond to "Translate to emoji:" with emoji output.
tokenizer = AutoTokenizer.from_pretrained(model_name)

def format_to_chat(example):
    return {
        "conversations": [
            {"role": "user", "content": f"Translate to emoji: {example['text']}"},
            {"role": "assistant", "content": example["emoji"]},
        ]
    }

formatted_dataset = dataset.map(
    format_to_chat,
    batched=False,                        # Process row by row
    remove_columns=dataset.column_names,  # Optional: Keep only the new column
)

def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["conversations"], tokenize=False)
    return {"text": texts}

final_dataset = formatted_dataset.map(apply_chat_template, batched=True)

############################# Config #########################################

# Load tokenizer and model with QLoRA configuration
bnb_4bit_compute_dtype = "float16"  # Compute dtype for 4-bit base models
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4",  # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,  # Activate nested quantization for 4-bit base models (double quantization)
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    torch_dtype=torch.float16,
)
model.config.use_cache = False
model.config.pretraining_tp = 1

############################## Train ##########################################

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=16,     # Alpha parameter for LoRA scaling
    lora_dropout=0.1,  # Dropout probability for LoRA layers,
    r=8,               # LoRA attention dimension
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"],
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,  # Batch size per GPU for training
    gradient_accumulation_steps=2,  # Number of update steps to accumulate the gradients for
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=5,
    learning_rate=2e-4,    # Initial learning rate (AdamW optimizer)
    weight_decay=0.001,    # Weight decay to apply to all layers except bias/LayerNorm weights
    fp16=True, bf16=False, # Enable fp16/bf16 training
    max_grad_norm=0.3,     # Maximum gradient normal (gradient clipping)
    warmup_ratio=0.03,     # Ratio of steps for a linear warmup (from 0 to learning rate)
    group_by_length=True,  # Group sequences into batches with same length # Saves memory and speeds up training considerably
    lr_scheduler_type="cosine",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=final_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=512,  # Maximum sequence length to use
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,       # Pack multiple short examples in the same input sequence to increase efficiency
)

trainer.train()
trainer.model.save_pretrained(new_model)

################################# Save ########################################

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# push results to Cloud Storage
file_path_to_save_the_model = "/finetune/new_model"
model.save_pretrained(file_path_to_save_the_model)
tokenizer.save_pretrained(file_path_to_save_the_model)

Create a requirements.txt file:

accelerate==0.34.2
bitsandbytes==0.45.5
datasets==2.19.1
transformers==4.51.3
peft==0.11.1
trl==0.8.6
torch==2.3.0

Create a Dockerfile:

FROM nvidia/cuda:12.6.2-runtime-ubuntu22.04

RUN apt-get update && \
    apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt /requirements.txt

RUN pip3 install -r requirements.txt --no-cache-dir

COPY finetune.py /finetune.py

ENV PYTHONUNBUFFERED 1

CMD python3 /finetune.py --device cuda

Build the container in your Artifact Registry repository:

gcloud builds submit \
  --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
  --region $REGION

5. Deploy and execute the job

In this step, you'll create the job with direct VPC egress for faster uploads to Google Cloud Storage.

Create the Cloud Run Job:

gcloud run jobs create $JOB_NAME \
  --region $REGION \
  --image $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/$IMAGE_NAME \
  --set-env-vars BUCKET_NAME=$BUCKET_NAME \
  --set-secrets HF_TOKEN=$SECRET_ID:latest \
  --cpu 8.0 \
  --memory 32Gi \
  --gpu 1 \
  --add-volume name=finetuned_model,type=cloud-storage,bucket=$BUCKET_NAME \
  --add-volume-mount volume=finetuned_model,mount-path=/finetune/new_model \
  --service-account $SERVICE_ACCOUNT_ADDRESS

Execute the job:

gcloud run jobs execute $JOB_NAME --region $REGION --async

The job will take around 10 minutes to complete. You can check on the status using the link provided in the output of the last command.

6. Use a Cloud Run service to serve your finetuned model with vLLM

In this step, you will deploy a Cloud Run service. This configuration uses direct VPC to access Cloud Storage bucket over private network for faster downloads.

Deploy your Cloud Run Service:

gcloud run deploy serve-gemma-emoji \
  --image us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250601_0916_RC01 \
  --region $REGION \
  --port 8000 \
  --set-env-vars MODEL_ID=new_model,HF_HUB_OFFLINE=1 \
  --cpu 8.0 \
  --memory 32Gi \
  --gpu 1 \
  --add-volume name=finetuned_model,type=cloud-storage,bucket=$BUCKET_NAME \
  --add-volume-mount volume=finetuned_model,mount-path=/finetune/new_model \
  --service-account $SERVICE_ACCOUNT_ADDRESS \
  --max-instances 1 \
  --command python3 \
  --args="-m,vllm.entrypoints.api_server,--model=/finetune/new_model,--tensor-parallel-size=1" \
  --no-gpu-zonal-redundancy \
  --labels=dev-tutorial=codelab-tuning \
  --no-invoker-iam-check

7. Test your fine-tuned model

In this step, you will prompt your model to test the fine tuning using curl.

Get the service URL for your Cloud Run service:

SERVICE_URL=$(gcloud run services describe serve-gemma-emoji \
    --region $REGION --format 'value(status.url)')

Create your prompt for your model.

USER_PROMPT="Translate to emoji: I ate a banana for breakfast, later I'm thinking of having soup!"

Call your service using curl to prompt your model, filtering the results with jq:

curl -s -X POST ${SERVICE_URL}/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: bearer $(gcloud auth print-identity-token)" \
-d @- <<EOF | jq ".choices[0].message.content"
{   "model": "${NEW_MODEL}",
    "messages": [{
        "role": "user",
        "content": [ { "type": "text", "text": "${USER_PROMPT}"}]
    }]
}
EOF

You should see a response similar to the following:

🍌🤔😋🥣

8. Congratulations!

Congratulations for completing the codelab!

We recommend reviewing the Cloud Run Jobs GPU documentation.

What we've covered

How to conduct fine tuning using Cloud Run Jobs GPU
How to serve a model using Cloud Run with vLLM
How to use Direct VPC configuration for a GPU Job for faster upload and serving of the model

9. Clean up

To avoid inadvertent charges, for example, if the Cloud Run services are inadvertently invoked more times than your monthly Cloud Run invokement allocation in the free tier, you can delete the Cloud Run service you created in Step 6.

To delete the Cloud Run service, go to the Cloud Run Cloud Console at https://console.cloud.google.com/run and delete the serve-gemma-emoji service.

To delete the entire project, go to Manage Resources, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list.