Fine-tune Gemini on Vertex AI

1. Introduction

In this lab, you learn to perform the complete workflow of supervised fine-tuning on a Google Gemini model to adapt it for a specific task: article summarization. While large language models are powerful, their general-purpose nature means they can be made even more effective for specific use cases through fine-tuning. By training the model on a high-quality dataset of examples, you can improve its consistency, quality, and efficiency for your target task.

You'll use Gemini 2.5 Flash, a lightweight and cost-efficient model, and perform the fine-tuning using Vertex AI.

Architecture Overview

Here's what we'll build:

Cloud Shell: Your development environment.
Cloud Storage: Stores training/validation data in JSONL format.
Vertex AI Training: Manages the fine-tuning job.
Vertex AI Endpoint: Hosts your fine-tuned model.

What you'll learn

Prepare high-quality datasets for supervised fine-tuning.
Configure and launch fine-tuning jobs using the Vertex AI SDK for Python.
Evaluate models using automated metrics (ROUGE scores).
Compare base and fine-tuned models to quantify improvements.

2. Project setup

Google Account

If you don't already have a personal Google Account, you must create a Google Account.

Use a personal account instead of a work or school account.

Sign-in to the Google Cloud Console using a personal Google account.

Enable Billing

Redeem $5 Google Cloud credits (optional)

To run this workshop, you need a Billing Account with some credit. If you are planning to use your own billing, you can skip this step.

Click this link and sign in with a personal google account.You will see something like this:
Click the CLICK HERE TO ACCESS YOUR CREDITS button.This will bring you to a page to set up your billing profile
Click Confirm

You are now connected to a Google Cloud Platform Trial Billing Account.

Screenshot of billing overview

Create a project (optional)

If you do not have a current project you'd like to use for this lab, create a new project here.

3. Open Cloud Shell Editor

Click this link to navigate directly to Cloud Shell Editor
If prompted to authorize at any point today, click Authorize to continue.
If the terminal doesn't appear at the bottom of the screen, open it:
- Click View
- Click Terminal
In the terminal, set your project with this command:
```
gcloud config set project [PROJECT_ID]
```
- Example:
```
gcloud config set project lab-project-id-example
```
- If you can't remember your project ID, you can list all your project IDs with:
```
gcloud projects list
```
You should see this message:
```
Updated property [core/project].
```
If you see a WARNING and are asked Do you want to continue (Y/n)?, then you have likely entered the project ID incorrectly. Press n, press Enter, and try to run the gcloud config set project command again.

4. Enable APIs

To use Vertex AI and other services, you need to enable the necessary APIs in your Google Cloud project.

In the terminal, enable the APIs:
- Vertex AI API (aiplatform.googleapis.com): Enables the use of Vertex AI for fine-tuning and serving models.
- Cloud Storage API (storage.googleapis.com): Enables the storage of datasets and model artifacts.
```
gcloud services enable aiplatform.googleapis.com \
    storage.googleapis.com
```

5. Set up the project environment

Create a working directory

In the terminal, create a directory for your project and navigate into it.
```
mkdir gemini-finetuning
cd gemini-finetuning
```

Set up environment variables

In the terminal, define the environment variables for your project. We will create an env.sh file to store these variables so they can be easily reloaded if your session disconnects.

cat <<EOF > env.sh
export PROJECT_ID=\$(gcloud config get-value project)
export REGION="us-central1"
export BUCKET_NAME="\${PROJECT_ID}-gemini-tuning"
EOF

source env.sh

Create a Cloud Storage Bucket

In the terminal, create a bucket to store your dataset and model artifacts.

gcloud storage buckets create gs://$BUCKET_NAME --project=$PROJECT_ID --location=$REGION

Set up Virtual Environment

We will use uv to manage our Python environment. In the terminal, run:
```
uv venv .venv
source .venv/bin/activate
```

In the terminal, install the required Python packages.

uv pip install google-cloud-aiplatform rouge-score matplotlib pandas tqdm

6. Prepare the training data

Quality data is the foundation of successful fine-tuning. You will use the WikiLingua dataset, transform it into the specific JSONL format that Gemini requires, and upload it to your storage bucket.

In the terminal, create a file named prepare_data.py.
```
cloudshell edit prepare_data.py
```

Paste the following code into prepare_data.py.

import json
import os
import pandas as pd
from google.cloud import storage
import subprocess

# Configuration
BUCKET_NAME = os.environ["BUCKET_NAME"]
PROJECT_ID = os.environ["PROJECT_ID"]

def download_data():
    print("Downloading WikiLingua dataset...")
    # Using gsutil to copy from public bucket
    subprocess.run(["gsutil", "cp", "gs://github-repo/generative-ai/gemini/tuning/summarization/wikilingua/*", "."], check=True)

def convert_to_gemini_format(input_file, output_file, max_samples=1000):
    print(f"Converting {input_file} to Gemini format (first {max_samples} samples)...")
    converted_data = []
    with open(input_file, 'r') as f:
        for i, line in enumerate(f):
            if i >= max_samples:
                break
            obj = json.loads(line)
            messages = obj.get("messages", [])

            # Convert messages to Gemini 2.5 format
            # Input: {"messages": [{"role": "user", "content": "..."}, {"role": "model", "content": "..."}]}
            # Output: {"contents": [{"role": "user", "parts": [{"text": "..."}]}, {"role": "model", "parts": [{"text": "..."}]}]}

            contents = []
            for msg in messages:
                role = msg["role"]
                content = msg["content"]
                contents.append({
                    "role": role,
                    "parts": [{"text": content}]
                })

            converted_data.append({"contents": contents})

    with open(output_file, 'w') as f:
        for item in converted_data:
            f.write(json.dumps(item) + "\n")

    print(f"Saved {len(converted_data)} examples to {output_file}")

def upload_to_gcs(local_file, destination_blob_name):
    print(f"Uploading {local_file} to gs://{BUCKET_NAME}/{destination_blob_name}...")
    storage_client = storage.Client(project=PROJECT_ID)
    bucket = storage_client.bucket(BUCKET_NAME)
    blob = bucket.blob(destination_blob_name)
    blob.upload_from_filename(local_file)
    print("Upload complete.")

def main():
    download_data()

    # Process Training Data
    convert_to_gemini_format("sft_train_samples.jsonl", "train_gemini.jsonl")
    upload_to_gcs("train_gemini.jsonl", "datasets/train/train_gemini.jsonl")

    # Process Validation Data
    convert_to_gemini_format("sft_val_samples.jsonl", "val_gemini.jsonl")
    upload_to_gcs("val_gemini.jsonl", "datasets/val/val_gemini.jsonl")

    print("Data preparation complete!")

if __name__ == "__main__":
    main()

In the terminal, run the data preparation script.
```
python prepare_data.py
```

7. Establish baseline performance

Before fine-tuning, you need a benchmark. You will measure how well the base gemini-2.5-flash model performs on the summarization task using ROUGE scores.

In the terminal, create a file named evaluate.py.
```
cloudshell edit evaluate.py
```

Paste the following code into evaluate.py.

import argparse
import json
import os
import pandas as pd
from google.cloud import aiplatform
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig, HarmCategory, HarmBlockThreshold
from rouge_score import rouge_scorer
from tqdm import tqdm
import matplotlib.pyplot as plt
import time

# Configuration
PROJECT_ID = os.environ["PROJECT_ID"]
REGION = os.environ["REGION"]

aiplatform.init(project=PROJECT_ID, location=REGION)

def evaluate(model_name, test_file, max_samples=50, output_json="results.json"):
    print(f"Evaluating model: {model_name}")

    # Load Test Data
    test_df = pd.read_csv(test_file)
    test_df = test_df.head(max_samples)

    model = GenerativeModel(model_name)

    safety_settings = {
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
        HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_ONLY_HIGH,
        HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_ONLY_HIGH,
    }

    generation_config = GenerationConfig(
        temperature=0.1,
        max_output_tokens=1024,
    )

    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    results = []

    for index, row in tqdm(test_df.iterrows(), total=len(test_df)):
        input_text = row['input_text']
        reference_summary = row['output_text']

        try:
            response = model.generate_content(
                input_text,
                generation_config=generation_config,
                safety_settings=safety_settings
            )
            generated_summary = response.text

            scores = scorer.score(reference_summary, generated_summary)

            results.append({
                "generated": generated_summary,
                "reference": reference_summary,
                "rouge1": scores['rouge1'].fmeasure,
                "rouge2": scores['rouge2'].fmeasure,
                "rougeL": scores['rougeL'].fmeasure
            })
        except Exception as e:
            print(f"Error processing example {index}: {e}")
            # Sleep briefly to avoid quota issues if hitting limits
            time.sleep(1)

    # Save results
    with open(output_json, 'w') as f:
        json.dump(results, f, indent=2)

    return pd.DataFrame(results)

def plot_results(df, title, filename):
    os.makedirs("plots", exist_ok=True)

    metrics = ['rouge1', 'rouge2', 'rougeL']
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    for i, metric in enumerate(metrics):
        axes[i].hist(df[metric], bins=10, alpha=0.7, color='skyblue', edgecolor='black')
        axes[i].set_title(f'{metric} Distribution')
        axes[i].set_xlabel('Score')
        axes[i].set_ylabel('Count')

    plt.suptitle(title)
    plt.tight_layout()
    plt.savefig(f"plots/{filename}")
    print(f"Plot saved to plots/{filename}")

def compare_results(baseline_file, tuned_file):
    with open(baseline_file, 'r') as f:
        baseline_data = pd.DataFrame(json.load(f))
    with open(tuned_file, 'r') as f:
        tuned_data = pd.DataFrame(json.load(f))

    print("\n--- Comparison ---")
    metrics = ['rouge1', 'rouge2', 'rougeL']
    for metric in metrics:
        base_mean = baseline_data[metric].mean()
        tuned_mean = tuned_data[metric].mean()
        diff = tuned_mean - base_mean
        print(f"{metric}: Base={base_mean:.4f}, Tuned={tuned_mean:.4f}, Diff={diff:+.4f}")

    # Comparative Plot
    os.makedirs("plots", exist_ok=True)
    comparison_df = pd.DataFrame({
        'Metric': metrics,
        'Baseline': [baseline_data[m].mean() for m in metrics],
        'Tuned': [tuned_data[m].mean() for m in metrics]
    })

    comparison_df.plot(x='Metric', y=['Baseline', 'Tuned'], kind='bar', figsize=(10, 6))
    plt.title('Baseline vs Tuned Model Performance')
    plt.ylabel('Average Score')
    plt.xticks(rotation=0)
    plt.tight_layout()
    plt.savefig("plots/comparison.png")
    print("Comparison plot saved to plots/comparison.png")

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, default="gemini-2.5-flash", help="Model resource name")
    parser.add_argument("--baseline", type=str, help="Path to baseline results json for comparison")
    parser.add_argument("--output", type=str, default="results.json", help="Output file for results")
    args = parser.parse_args()

    # Ensure test data exists (it was downloaded in prepare_data step)
    if not os.path.exists("sft_test_samples.csv"):
        # Fallback download if needed
        subprocess.run(["gsutil", "cp", "gs://github-repo/generative-ai/gemini/tuning/summarization/wikilingua/sft_test_samples.csv", "."], check=True)

    df = evaluate(args.model, "sft_test_samples.csv", output_json=args.output)

    print("\n--- Results Summary ---")
    print(df.describe())

    plot_filename = "baseline_dist.png" if not args.baseline else "tuned_dist.png"
    plot_results(df, f"ROUGE Scores - {args.model}", plot_filename)

    if args.baseline:
        compare_results(args.baseline, args.output)

if __name__ == "__main__":
    main()

In the terminal, run the baseline evaluation.
```
python evaluate.py --model "gemini-2.5-flash" --output "baseline.json"
```
This will generate a baseline.json file and a plot in plots/baseline_dist.png.

8. Configure and launch fine-tuning

Now you will launch a managed fine-tuning job on Vertex AI.

In the terminal, create a file named tune.py.
```
cloudshell edit tune.py
```

Paste the following code into tune.py.

import os
import time
from google.cloud import aiplatform
import vertexai
from vertexai.preview.tuning import sft

# Configuration
PROJECT_ID = os.environ["PROJECT_ID"]
REGION = os.environ["REGION"]
BUCKET_NAME = os.environ["BUCKET_NAME"]

aiplatform.init(project=PROJECT_ID, location=REGION)

def train():
    print("Launching fine-tuning job...")

    sft_tuning_job = sft.train(
        source_model="gemini-2.5-flash", # Using specific version for stability
        train_dataset=f"gs://{BUCKET_NAME}/datasets/train/train_gemini.jsonl",
        validation_dataset=f"gs://{BUCKET_NAME}/datasets/val/val_gemini.jsonl",
        epochs=1, # Keep it short for the lab
        adapter_size=4,
        learning_rate_multiplier=1.0,
        tuned_model_display_name="gemini-2.5-flash-wikilingua",
    )

    print(f"Job started: {sft_tuning_job.resource_name}")
    print("Waiting for job to complete... (this may take ~45 minutes)")

    # Wait for the job to complete
    while not sft_tuning_job.has_ended:
        time.sleep(60)
        sft_tuning_job.refresh()
        print(f"Status: {sft_tuning_job.state.name}")

    print("Job completed!")
    print(f"Tuned Model Endpoint: {sft_tuning_job.tuned_model_endpoint_name}")
    return sft_tuning_job.tuned_model_endpoint_name

if __name__ == "__main__":
    train()

In the terminal, run the fine-tuning script.
```
python tune.py
```
Note: This process can take ~45 minutes. You can monitor the job in the Vertex AI Console.

9. Understand the training code

While your job is running, let's take a closer look at the tune.py script to understand how the fine-tuning works.

Managed Supervised Fine-Tuning

The script uses the vertexai.tuning.sft.train method to submit a managed tuning job. This abstracts away the complexity of provisioning infrastructure, distributing the training, and managing checkpoints.

sft_tuning_job = sft.train(
    source_model="gemini-2.5-flash",
    train_dataset=f"gs://{BUCKET_NAME}/datasets/train/train_gemini.jsonl",
    # ...
)

LoRA Configuration

Instead of manually defining a LoraConfig like you might in open-source frameworks, Vertex AI simplifies this into a few key parameters:

adapter_size: This parameter (set to 4 in our script) controls the rank of the LoRA adapters. A larger size allows the model to learn more complex adaptations but increases the number of trainable parameters.
epochs: We set this to 1 for this lab to keep the training time short (~20 minutes). In a production scenario, you might increase this to allow the model to learn more deeply from your data, though you should watch out for overfitting.

Model Selection

We explicitly specify source_model="gemini-2.5-flash". Vertex AI supports various versions of Gemini, and pinning a specific version ensures your pipeline remains stable and reproducible.

10. Compare models

Once the fine-tuning job is complete, you can compare the performance of your new model against the baseline.

Get your tuned model endpoint. It was printed at the end of the tune.py script. It will look something like projects/.../locations/.../endpoints/....

Run the evaluation script again, this time passing your tuned model and the baseline results for comparison.

# Replace [YOUR_TUNED_MODEL_ENDPOINT] with the actual endpoint name
export TUNED_MODEL="projects/[YOUR_PROJECT_ID]/locations/[YOUR_REGION]/endpoints/[YOUR_ENDPOINT_ID]"

python evaluate.py --model "$TUNED_MODEL" --baseline "baseline.json" --output "tuned.json"

View the results. The script will output a comparison of ROUGE scores and generate a plots/comparison.png chart showing the improvement.You can view the plots by opening the plots folder in the Cloud Shell Editor.

11. Clean up

To avoid incurring charges, delete the resources you created.

In the terminal, delete the Cloud Storage bucket and the tuned model.

gcloud storage rm -r gs://$BUCKET_NAME
# Note: You can delete the model endpoint from the Vertex AI Console

12. Congratulations!

You have successfully fine-tuned Gemini 2.5 Flash on Vertex AI!

Recap

In this lab, you:

Prepared a dataset in JSONL format for Gemini fine-tuning.
Established a baseline using the base Gemini 2.5 Flash model.
Launched a supervised fine-tuning job on Vertex AI.
Evaluated and compared the fine-tuned model against the baseline.

What's next

This lab is part of the Production-Ready AI with Google Cloud Learning Path.

Explore the full curriculum to bridge the gap from prototype to production.

Share your progress with the hashtag #ProductionReadyAI.