Fine-tune Open Source LLMs on Google Cloud

1. Introduction

In this lab, you learn to build a complete, production-grade fine-tuning pipeline for Llama 2, a popular open-source language model, using Google Kubernetes Engine (GKE). You'll learn about architectural decisions, common trade-offs, and components that mirror real-world Machine Learning Operations (MLOps) workflows.

You'll provision a GKE cluster, build a containerized training pipeline using LoRA (Low-Rank Adaptation), and run your training job on GKE.

Architecture Overview

Here's what we'll build today:

Architecture showing GKE cluster with GPU nodes, GCS bucket, and model serving

The architecture includes:

GKE Cluster: Manages our compute resources
GPU Node Pool: 1x L4 GPU (Spot) for training
GCS Bucket: Stores models and datasets
Workload Identity: Secure access between K8s and GCS

What you'll learn

Provision and configure a GKE cluster with features optimized for ML workloads.
Implement secure, keyless access from GKE to other Google Cloud services using Workload Identity.
Build a containerized training pipeline using Docker.
Fine-tune an open-source model efficiently using Parameter-Efficient Fine-Tuning (PEFT) with LoRA.

2. Project setup

Google Account

If you don't already have a personal Google Account, you must create a Google Account.

Use a personal account instead of a work or school account.

Sign-in to the Google Cloud Console using a personal Google account.

Create a project (optional)

If you do not have a current project you'd like to use for this lab, create a new project here.

3. Open Cloud Shell Editor

Click this link to navigate directly to Cloud Shell Editor
If prompted to authorize at any point today, click Authorize to continue.
If the terminal doesn't appear at the bottom of the screen, open it:
- Click View
- Click Terminal
In the terminal, set your project with this command:
```
gcloud config set project [PROJECT_ID]
```
- Example:
```
gcloud config set project lab-project-id-example
```
- If you can't remember your project ID, you can list all your project IDs with:
```
gcloud projects list
```
You should see this message:
```
Updated property [core/project].
```
If you see a WARNING and are asked Do you want to continue (Y/n)?, then you have likely entered the project ID incorrectly. Press n, press Enter, and try to run the gcloud config set project command again.

4. Enable APIs

To use GKE and other services, you need to enable the necessary APIs in your Google Cloud project.

In the terminal, enable the APIs:

gcloud services enable container.googleapis.com \
    artifactregistry.googleapis.com \
    cloudbuild.googleapis.com \
    iam.googleapis.com \
    compute.googleapis.com \
    iamcredentials.googleapis.com \
    storage.googleapis.com

Introducing the APIs

Google Kubernetes Engine API (container.googleapis.com) allows you to create and manage the GKE cluster that runs your application.
Artifact Registry API (artifactregistry.googleapis.com) provides a secure, private repository to store your container images.
Cloud Build API (cloudbuild.googleapis.com) is used by the gcloud builds submit command to build your container image in the cloud.
IAM API (iam.googleapis.com) allows you to manage access control and identity for your Google Cloud resources.
Compute Engine API (compute.googleapis.com) provides secure and customizable virtual machines that run on Google's infrastructure.
IAM Service Account Credentials API (iamcredentials.googleapis.com) allows creating short-lived credentials for service accounts.
Cloud Storage API (storage.googleapis.com) allows you to store and retrieve data in the cloud, used here for model and dataset storage.

5. Set up the project environment

Create a working directory

In the terminal, create a directory for your project and navigate into it.
```
mkdir llama-finetuning
cd llama-finetuning
```

Set up environment variables

In the terminal, create a file named env.sh to store your environment variables. This ensures you can easily reload them if your session disconnects.

cat <<EOF > env.sh
export PROJECT_ID=$(gcloud config get-value project)
export CLUSTER_NAME="ml-gke"
export GPU_NODE_POOL_NAME="gpu-pool"
export MACHINE_TYPE="e2-standard-4"
export GPU_MACHINE_TYPE="g2-standard-16"
export GPU_TYPE="nvidia-l4"
export GPU_COUNT=1
export REGION="asia-southeast1"
export NODE_LOCATIONS="asia-southeast1-a,asia-southeast1-b"
EOF

Source the file to load the variables into your current session:
```
source env.sh
```
Note: If your Cloud Shell session refreshes or you open a new terminal tab, run source env.sh to reload these variables.
Note: L4 GPUs are in high demand and may have limited availability in some regions. This lab uses $REGION (e.g., asia-southeast1) where capacity is generally available.

If you encounter ZONE_RESOURCE_POOL_EXHAUSTED or other capacity errors:
- Wait a few minutes and try again. Capacity fluctuates frequently.
- Try a different region/zone (e.g., us-central1, us-west1, europe-west4). Check availability with: gcloud compute accelerator-types list --filter="name=nvidia-l4"

6. Provision the GKE Cluster

In the terminal, create the GKE cluster with a default node pool. This will take about 5 minutes.

gcloud container clusters create $CLUSTER_NAME \
    --project=$PROJECT_ID \
    --region=$REGION \
    --release-channel=rapid \
    --machine-type=$MACHINE_TYPE \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --addons=GcsFuseCsiDriver,HttpLoadBalancing \
    --enable-image-streaming \
    --enable-ip-alias \
    --num-nodes=1 \
    --enable-autoscaling \
    --min-nodes=1 \
    --max-nodes=3

Next, add a GPU node pool to the cluster. This node pool will be used for training the model.

gcloud container node-pools create $GPU_NODE_POOL_NAME \
    --project=$PROJECT_ID \
    --cluster=$CLUSTER_NAME \
    --region=$REGION \
    --machine-type=$GPU_MACHINE_TYPE \
    --accelerator type=$GPU_TYPE,count=$GPU_COUNT,gpu-driver-version=latest \
    --ephemeral-storage-local-ssd=count=1 \
    --enable-autoscaling \
    --enable-image-streaming \
    --num-nodes=0 \
    --min-nodes=0 \
    --max-nodes=1 \
    --location-policy=ANY \
    --node-taints=nvidia.com/gpu=present:NoSchedule \
    --node-locations=$NODE_LOCATIONS \
    --spot

Finally, get the credentials for your new cluster and verify that you can connect to it.

gcloud container clusters get-credentials $CLUSTER_NAME --region=$REGION

kubectl get nodes

7. Configure Hugging Face access

With your infrastructure ready, you now need to provide your project with the necessary credentials to access your model and data. In this task, you will first get a Hugging Face token.

Get a Hugging Face Token

If you don't have a Hugging Face account, navigate to huggingface.co/join in a new browser tab and complete the registration process.
Once registered and logged-in, navigate to huggingface.co/meta-llama/Llama-2-7b-hf.
Read the license terms and click the button to accept them.
Navigate to your Hugging Face access tokens page at huggingface.co/settings/tokens.
Click New token.
For Role, select Read.
For Name, enter a descriptive name (e.g., finetuning-lab).
Click Create a token.
Copy the generated token to your clipboard. You will need it in the next step.

Update Environment Variables

Now, let's add your Hugging Face token and a name for your GCS bucket to your env.sh file. Replace [your-hf-token] with the token you just copied.

In the terminal, append the new variables to env.sh and reload them:

cat <<EOF >> env.sh
export HF_TOKEN="[your-hf-token]"
export BUCKET_NAME="\${PROJECT_ID}-llama-fine-tuning"
EOF

source env.sh

8. Configure Workload Identity

Next, you will set up Workload Identity, which is the recommended way to allow applications running on GKE to access Google Cloud services without needing to manage static service account keys. You can learn more in the Workload Identity documentation.

First, create a Google Service Account (GSA). In the terminal, run:

cat <<EOF >> env.sh
export GSA_NAME="llama-fine-tuning"
EOF
source env.sh

gcloud iam service-accounts create $GSA_NAME \
  --display-name="Llama Fine-tuning Service Account"

Next, create the GCS bucket and grant the GSA permissions to access it:

gcloud storage buckets create gs://$BUCKET_NAME --project=$PROJECT_ID --location=$REGION
gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
    --member=serviceAccount:${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
    --role=roles/storage.admin

Now, create a Kubernetes Service Account (KSA):

cat <<EOF >> env.sh
export KSA_NAME="llama-workload-sa"
export NAMESPACE="ml-workloads"
EOF
source env.sh

kubectl create namespace $NAMESPACE
kubectl create serviceaccount $KSA_NAME --namespace $NAMESPACE

Finally, create the IAM policy binding between the GSA and KSA:

gcloud iam service-accounts add-iam-policy-binding ${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:${PROJECT_ID}.svc.id.goog[${NAMESPACE}/${KSA_NAME}]"

kubectl annotate serviceaccount $KSA_NAME --namespace $NAMESPACE \
    iam.gke.io/gcp-service-account=${GSA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com

9. Stage the base model

In production ML pipelines, large models like Llama 2 (~13 GB) are typically pre-staged in Cloud Storage rather than downloaded during training. This approach provides better reliability, faster access, and avoids network issues. Google Cloud provides pre-downloaded versions of popular models in public GCS buckets, which you will use for this lab.

First, let's verify you can access the Google-provided Llama 2 model:

gcloud storage ls gs://vertex-model-garden-public-us-central1/llama2/llama2-7b-hf/

Copy the Llama 2 model from this public bucket to your own project's bucket using the gcloud storage command. This transfer uses Google's high-speed internal network and should only take a minute or two.
```
gcloud storage cp -r gs://vertex-model-garden-public-us-central1/llama2/llama2-7b-hf \
  gs://${BUCKET_NAME}/llama2-7b/
```

Verify the model files were copied correctly by listing the contents of your bucket.

gcloud storage ls --recursive --long gs://${BUCKET_NAME}/llama2-7b/llama2-7b-hf/

10. Prepare the training code

Now you will build the containerized application that fine-tunes the model. This task uses LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning (PEFT) technique that dramatically reduces memory requirements by only training small "adapter" layers instead of the entire model.

Now, create the Python scripts for the training pipeline.

In the terminal, run the following command to open the train.py file:
```
cloudshell edit train.py
```
Paste the following code into the train.py file:

#!/usr/bin/env python3
"""Fine-tune Llama 2 with LoRA on American Stories dataset """

import os
import torch
import logging

from pathlib import Path
from datasets import load_dataset, concatenate_datasets
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from peft import get_peft_model, LoraConfig

os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["NCCL_DEBUG"] = "INFO"

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class SimpleTextDataset(torch.utils.data.Dataset):
    def __init__(self, input_ids, attention_mask):
        self.input_ids = input_ids
        self.attention_mask = attention_mask
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'labels': self.input_ids[idx].clone()
        }

def get_lora_config():
    config = {
       "r": 16,
       "lora_alpha": 32,
       "target_modules": [
          "q_proj", "k_proj", "v_proj", "o_proj",
          "gate_proj", "up_proj", "down_proj"
       ],
       "lora_dropout": 0.05,
       "task_type": "CAUSAL_LM",
    }
    return LoraConfig(**config)

def load_model_and_tokenizer(model_path):
    logger.info("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"
    
    logger.info("Loading model...")

    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
        use_cache=False
    )
    
    return model, tokenizer

def prepare_dataset(tokenizer, max_length=512):
    logger.info("Loading American Stories dataset...")
     We recommend using o
    dataset = load_dataset(
        "dell-research-harvard/AmericanStories",
        "subset_years",
        year_list=["1809", "1810", "1811", "1812", "1813", "1814", "1815"],
        trust_remote_code=True
    )
    
    all_articles = []
    for year_data in dataset.values():
        all_articles.extend(year_data["article"])
    
    logger.info(f"Total articles collected: {len(all_articles)}")
    
    batch_size = 1000
    all_input_ids = []
    all_attention_masks = []
    
    for i in range(0, len(all_articles), batch_size):
        batch_articles = all_articles[i:i+batch_size]
        logger.info(f"Processing batch {i//batch_size + 1}/{(len(all_articles) + batch_size - 1)//batch_size}")
        
        encodings = tokenizer(
            batch_articles,
            padding="max_length",
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )
        
        all_input_ids.append(encodings['input_ids'])
        all_attention_masks.append(encodings['attention_mask'])
    
    # Concatenate all batches
    input_ids = torch.cat(all_input_ids, dim=0)
    attention_mask = torch.cat(all_attention_masks, dim=0)
    
    logger.info(f"Total tokenized examples: {len(input_ids)}")
    
    # Create simple dataset
    dataset = SimpleTextDataset(input_ids, attention_mask)
    
    return dataset

def train_model(model, tokenizer, train_dataset, output_dir):
    logger.info(f"Train dataset size: {len(train_dataset)}")
    
    n_gpus = torch.cuda.device_count()
    logger.info(f"Available GPUs: {n_gpus}")
    
    # For multi-GPU, we can increase batch size
    per_device_batch_size = 2 if n_gpus > 1 else 1
    gradient_accumulation_steps = 2 if n_gpus > 1 else 4
    
    # Training for 250 steps
    max_steps = 250
    
    training_args = TrainingArguments(
        output_dir=output_dir,
        max_steps=max_steps,
        per_device_train_batch_size=per_device_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=2e-4,
        warmup_steps=20,
        fp16=True,
        gradient_checkpointing=True,
        logging_steps=10,
        evaluation_strategy="no",
        save_strategy="no",
        optim="adamw_torch",
        ddp_find_unused_parameters=False,
        dataloader_num_workers=0,
        remove_unused_columns=False,
        report_to=[],
        disable_tqdm=False,
        logging_first_step=True,
    )
    
    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        tokenizer=tokenizer,
        data_collator=DataCollatorForLanguageModeling(
            tokenizer=tokenizer,
            mlm=False,
            pad_to_multiple_of=8
        )
    )
    
    # Train
    logger.info(f"Starting training on {n_gpus} GPU(s)...")
    logger.info(f"Training for {max_steps} steps - approximately {(max_steps * 2.5 / 60):.1f} minutes")
    
    try:
        trainer.train()
        logger.info("Training completed successfully!")
    except Exception as e:
        logger.error(f"Training error: {e}")
        logger.info("Attempting to save model despite error...")
    
    logger.info("Saving model...")
    try:
        trainer.save_model(output_dir)
        tokenizer.save_pretrained(output_dir)
        logger.info(f"Model saved successfully to {output_dir}!")

        if not output_dir.startswith("/gcs-mount"):
            logger.info("Copying artifacts to GCS bucket...")
            gcs_target = "/gcs-mount/llama2-7b-american-stories"
            os.makedirs(gcs_target, exist_ok=True)
            return_code = os.system(f"cp -r {output_dir}/* {gcs_target}/")
            if return_code != 0:
                raise RuntimeError(f"Failed to copy model to GCS: cp command returned {return_code}")
            logger.info(f"Copied to {gcs_target}")

    except Exception as e:
        logger.error(f"Error saving model or copying to GCS: {e}")
        raise

def run_inference(model, tokenizer):
    logger.info("Running inference test...")
    
    prompt = "The year was 1812, and the"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    model.eval()
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=50, 
            do_sample=True, 
            temperature=0.7,
            pad_token_id=tokenizer.pad_token_id
        )
    
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    logger.info("-" * 50)
    logger.info(f"Input Prompt: {prompt}")
    logger.info(f"Generated Text: {generated_text}")
    logger.info("-" * 50)

def main():
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            logger.info(f"GPU {i}: {torch.cuda.get_device_name(i)}")
    
    model_path = os.getenv('MODEL_PATH', '/gcs-mount/llama2-7b/llama2-7b-hf')
    output_path = os.getenv('OUTPUT_PATH', '/gcs-mount/llama2-7b-american-stories')
    
    # Load model and tokenizer
    model, tokenizer = load_model_and_tokenizer(model_path)
    model.enable_input_require_grads()

    # Apply LoRA
    logger.info("Applying LoRA configuration...")
    lora_config = get_lora_config()
    model = get_peft_model(model, lora_config)
    model.train()
    
    # Prepare dataset
    train_dataset = prepare_dataset(tokenizer)
    
    # Train
    train_model(model, tokenizer, train_dataset, output_path)
    
    # Run Inference
    run_inference(model, tokenizer)
    
    logger.info("Training and inference complete!")

if __name__ == "__main__":
    main()

11. Understand the training code

The train.py script orchestrates the fine-tuning process. Let's break down its key components.

Configuration

The script uses LoraConfig to define the Low-Rank Adaptation settings. LoRA significantly reduces the number of trainable parameters, allowing you to fine-tune large models on smaller GPUs.

def get_lora_config():
    config = {
       "r": 16,
       "lora_alpha": 32,
       "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", ...],
       "lora_dropout": 0.05,
       "task_type": "CAUSAL_LM",
    }
    return LoraConfig(**config)

Prepare Dataset

The prepare_dataset function loads the "American Stories" dataset and processes it into tokenized chunks. It uses a custom SimpleTextDataset to handle the input tensors efficiently.

def prepare_dataset(tokenizer, max_length=512):
    dataset = load_dataset("dell-research-harvard/AmericanStories", ...)
    # ... tokenization logic ...
    return SimpleTextDataset(input_ids, attention_mask)

Train

The train_model function sets up the Trainer with specific arguments optimized for this workload. Key parameters include:

gradient_accumulation_steps: Helps simulate a larger batch size without increasing memory usage.
fp16=True: Uses mixed precision training to reduce memory and increase speed.
gradient_checkpointing=True: Saves memory by recomputing activations during the backward pass instead of storing them.
optim="adamw_torch": Uses the standard AdamW optimizer implementation from PyTorch.

training_args = TrainingArguments(
    per_device_train_batch_size=per_device_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    fp16=True,
    gradient_checkpointing=True,
    optim="adamw_torch",
)

Inference

The run_inference function performs a quick test of the fine-tuned model using a sample prompt. It ensures the model is in evaluation mode and generates text to verify that the adapters are working correctly.

def run_inference(model, tokenizer):
    prompt = "The year was 1812, and the"
    # ... generation logic ...
    logger.info(f"Generated Text: {generated_text}")

12. Containerize the application

Now, build the training container image using Docker and push it to Google Artifact Registry.

In the terminal, run the following command to open the Dockerfile file:
```
cloudshell edit Dockerfile
```
Paste the following code into the Dockerfile file:

FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

WORKDIR /app

# Install required packages
RUN pip install --no-cache-dir \
    transformers==4.46.0 \
    datasets==3.1.0 \
    pyarrow==15.0.0 \
    peft==0.13.2 \
    accelerate==1.1.0 \
    tensorboard==2.18.0 \
    nvidia-ml-py==12.535.161 \
    scipy==1.13.1

# Copy training scripts
COPY train.py /app/

# Run training
CMD ["python", "train.py"]

Build and push the container

Create the Artifact Registry repository:

gcloud artifacts repositories create gke-finetune \
    --repository-format=docker \
    --location=$REGION \
    --description="Docker repository for Llama fine-tuning"

Build and push the image using Cloud Build:

gcloud builds submit --tag ${REGION}-docker.pkg.dev/${PROJECT_ID}/gke-finetune/llama-trainer:latest .

13. Deploy the fine-tuning job

Create the Kubernetes job manifest to start the fine-tuning job. In the terminal, run:
```
cloudshell edit training_job.yaml
```
Paste the following code into the training_job.yaml file:

apiVersion: batch/v1
kind: Job
metadata:
  name: llama-fine-tuning
  namespace: ml-workloads
spec:
  template:
    metadata:
      annotations:
        gke-gcsfuse/volumes: "true"
        gke-gcsfuse/memory-limit: "4Gi"
    spec:
      serviceAccountName: llama-workload-sa
      restartPolicy: OnFailure
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
        - key: cloud.google.com/gke-spot
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4

      containers:
        - name: training
          image: ${REGION}-docker.pkg.dev/${PROJECT_ID}/gke-finetune/llama-trainer:latest
          env:
            - name: MODEL_PATH
              value: "/gcs-mount/llama2-7b/llama2-7b-hf"
            - name: OUTPUT_PATH
              value: "/tmp/llama2-7b-american-stories"
            - name: NCCL_DEBUG
              value: "INFO"
          resources:
            requests:
              nvidia.com/gpu: 1
              cpu: "8"
              memory: "32Gi"
            limits:
              nvidia.com/gpu: 1

          volumeMounts:
            - name: gcs-fuse
              mountPath: /gcs-mount
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: gcs-fuse
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: ${BUCKET_NAME}
              mountOptions: "implicit-dirs"
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 32Gi

Finally, apply the Kubernetes job manifest to start the fine-tuning job on your GKE cluster.
```
envsubst < training_job.yaml | kubectl apply -f -
```

14. Monitor the training job

You can monitor the progress of your training job in the Google Cloud Console.

Go to the Kubernetes Engine > Workloads page.
View GKE Workloads
Click on the llama-fine-tuning job to view its details.
The Details tab is displayed by default. You can see the GPU utilization metrics in the Resources section.
Click on the Logs tab to view the training logs. You should see the training progress, including the loss and learning rate.

15. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the GKE cluster

gcloud container clusters delete $CLUSTER_NAME --region $REGION --quiet

Delete the Artifact Registry repository

gcloud artifacts repositories delete gke-finetune --location $REGION --quiet

Delete the GCS bucket

gcloud storage rm -r gs://${BUCKET_NAME}

16. Congratulations!

You have successfully fine-tuned an open-source LLM on GKE!

Recap

In this lab, you:

Provisioned a GKE cluster with GPU acceleration.
Configured Workload Identity for secure access to Google Cloud services.
Containerized a PyTorch training job using Docker and Artifact Registry.
Deployed a fine-tuning job using LoRA to adapt Llama 2 to a new dataset.

What's next

Learn more about AI on GKE.
Explore Vertex AI Model Garden.
Join the Google Cloud Community to connect with other developers.

Google Cloud Learning Path

This lab is part of the Production-Ready AI with Google Cloud Learning Path. Explore the full curriculum to bridge the gap from prototype to production.

Share your progress with the hashtag #ProductionReadyAI.