Run inference using a Gemma model on Cloud Run with RTX 6000 Pro GPU

1. Introduction

Overview

What you'll learn

  • How to deploy a Gemma model on a Cloud Run RTX 6000 Pro GPU
  • How to download a model concurrently from Cloud Storage during container startup

2. Setup and Requirements

Set environment variables that will be used throughout this codelab:

export PROJECT_ID=<YOUR_PROJECT_ID>

export REGION=europe-west4
export SERVICE_NAME=gemma-rtx-codelab

# set the project
gcloud config set project $PROJECT_ID

Enable APIs needed for this Codelab

gcloud services enable artifactregistry.googleapis.com \
        cloudbuild.googleapis.com \
        run.googleapis.com \
        compute.googleapis.com

Create a folder for the codelab

mkdir codelab-rtx
cd codelab-rtx

Enable Private Google Access to your subnet to optimize ML model loading from Cloud Storage. You can learn more in the GPU best practices doc section on loading models from Cloud Storage.

gcloud compute networks subnets update default \
  --region=europe-west4 \
  --enable-private-ip-google-access

3. Setup Cloud Storage

First, create a Cloud Storage bucket to store the model weights.

Create a unique bucket

# Generate a unique bucket name
export MODEL_BUCKET="${PROJECT_ID}-rtx-codelab-$(python3 -c 'import uuid; print(str(uuid.uuid4())[:8])')"
echo "Bucket name: $MODEL_BUCKET"

# Create the regional bucket
gcloud storage buckets create gs://$MODEL_BUCKET \
    --location=$REGION \
    --uniform-bucket-level-access

4. Retrieve Model Weights

Next, download the Gemma 3 model to a local directory and then upload it to your Cloud Storage bucket.

Install Ollama

You can run this command to install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Download Model

Create a directory for the downloaded model.

mkdir model-weights

You will use two terminal tabs for this process: one to run the Ollama server and another terminal to retrieve the model.

Terminal 1 (Server):

Start the server by passing the location of the downloaded model. This command will continue to run.

OLLAMA_MODELS=$(pwd)/model-weights ollama serve

Terminal 2 (Client): Open a new terminal tab and download the model. The client automatically talks to the running server.

# note if you wish to use a larger model, you can change this to gemma3:27b
ollama pull gemma3:1b

Back in Terminal 1: Once the download in Terminal 2 is complete, return to Terminal 1 and press Ctrl+C to stop the server.

Upload to Cloud Storage

Now upload the weights to your bucket. gcloud storage automatically handles parallel uploads for speed.

gcloud storage cp -r ./model-weights/* gs://$MODEL_BUCKET/

(Optional) Cleanup Local Weights

Since the model is now in Cloud Storage, remove the local copy.

rm -rf model-weights

5. Create the service

First, create a folder for the service.

mkdir rtx-service
cd rtx-service

Create a Dockerfile with the following content

FROM ollama/ollama:latest

# Install Google Cloud CLI
RUN apt-get update && apt-get install -y curl gnupg && \
    echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
    apt-get update && apt-get install -y google-cloud-cli && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Copy and set up the startup script
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

# Start using the entrypoint script
ENTRYPOINT ["/entrypoint.sh"]

Create a file called entrypoint.sh with the following contents:

#!/bin/bash
set -e

# Ensure OLLAMA_MODELS directory exists
mkdir -p $OLLAMA_MODELS

# Download model weights from GCS if MODEL_BUCKET is set
if [ -n "$MODEL_BUCKET" ]; then
  echo "Downloading model weights from gs://$MODEL_BUCKET..."
  # gcloud storage handles concurrent downloads automatically
  gcloud storage cp -r "gs://$MODEL_BUCKET/*" "$OLLAMA_MODELS/"
else
  echo "MODEL_BUCKET not set. Skipping download."
fi

# Start Ollama
exec ollama serve

6. Deploy to Cloud Run

In this section, you will deploy the service using gcloud run deploy. This command will build your container from source and deploy it to Cloud Run with the necessary GPU and network configurations.

Create Service Account

Create a dedicated service account for this application and grant it only the required permissions.

# Create a dedicated service account
gcloud iam service-accounts create rtx-codelab-identity \
    --display-name="RTX Codelab Identity"

# Grant permission to read from the model bucket
gcloud storage buckets add-iam-policy-binding gs://$MODEL_BUCKET \
    --member="serviceAccount:rtx-codelab-identity@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.objectViewer"

# Grant access to the Compute Engine network for the Cloud Run service identity
gcloud projects add-iam-policy-binding $PROJECT_ID \
     --member="serviceAccount:rtx-codelab-identity@$PROJECT_ID.iam.gserviceaccount.com" \
     --role="roles/compute.networkUser"

# Grant access to the Compute Engine network for the default service account
gcloud projects add-iam-policy-binding $PROJECT_ID \
 --member="serviceAccount:service-$(gcloud projects describe $PROJECT_ID --format='value(projectNumber)')@serverless-robot-prod.iam.gserviceaccount.com" \
 --role="roles/compute.networkUser"

Network Configuration

For optimal performance when downloading large models, use Direct VPC Egress. This allows the container to access Cloud Storage via the Google private network, bypassing the public internet and NAT gateways. The following flags are used in the gcloud run deploy command:

  • --network: Connects to the default VPC (ensure this network exists and has a subnet in your region with Private Google Access enabled).
  • --subnet: The specific subnet in your region (usually default if using the default network).
  • --vpc-egress: Set to all-traffic to force all egress traffic through the VPC.

Deployment Command

gcloud beta run deploy $SERVICE_NAME \
    --source . \
    --region $REGION \
    --project $PROJECT_ID \
    --no-allow-unauthenticated \
    --port 11434 \
    --service-account rtx-codelab-identity@$PROJECT_ID.iam.gserviceaccount.com \
    --cpu 20 --memory 80Gi \
    --gpu 1 \
    --gpu-type nvidia-rtx-pro-6000 \
    --set-env-vars MODEL_BUCKET=$MODEL_BUCKET \
    --network default \
    --subnet default \
    --vpc-egress all-traffic \
    --no-gpu-zonal-redundancy

7. Test the Service

Once deployed, you can interact with your Gemma 3 model using the Ollama API.

Get Service URL

Retrieve the URL of your deployed Cloud Run service.

SERVICE_URL=$(gcloud run services describe $SERVICE_NAME --region $REGION --format 'value(status.url)')
echo "Service URL: $SERVICE_URL"

Run Inference

Send a prompt to the model using curl. You can set "stream": false to get the full response in a single JSON object and use jq to extract just the text.

Note: if you are using a larger model, e.g. gemma3:27b, you'll need to change the model name in the json below.

curl -s "$SERVICE_URL/api/generate" \
  -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "gemma3:1b",
  "prompt": "Why is the sky blue?",
  "stream": false
}' | jq -r '.response'

8. Congratulations!

Congratulations for completing the codelab!

We recommend reviewing the Cloud Run documentation.

What we've covered

  • How to deploy a Gemma model on a Cloud Run RTX 6000 Pro GPU
  • How to download a model concurrently from Cloud Storage during container startup

9. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can either delete the project or delete the individual resources.

Option 1: Delete Resources

Delete the Cloud Run Service

gcloud run services delete $SERVICE_NAME \
      --region $REGION \
      --quiet

Delete the Service Account

gcloud iam service-accounts delete \
      rtx-codelab-identity@$PROJECT_ID.iam.gserviceaccount.com \
      --quiet

Delete the Cloud Storage Bucket

gcloud storage rm --recursive gs://$MODEL_BUCKET

Delete the Container Image

This build created a container image in Artifact Registry. You can find the image name and delete it.

List images to find the exact name (usually gcr.io/PROJECT_ID/SERVICE_NAME)

gcloud container images list --filter="name:$SERVICE_NAME"

Delete the image (replace IMAGE_NAME with the result from above)

gcloud container images delete <IMAGE_NAME> --force-delete-tags

Option 2: Delete the Project

To delete the entire project, go to Manage Resources, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list.