1. Introduction
Overview
What you'll learn
- How to deploy a Gemma model on a Cloud Run RTX 6000 Pro GPU
- How to download a model concurrently from Cloud Storage during container startup
2. Setup and Requirements
Set environment variables that will be used throughout this codelab:
export PROJECT_ID=<YOUR_PROJECT_ID>
export REGION=europe-west4
export SERVICE_NAME=gemma-rtx-codelab
# set the project
gcloud config set project $PROJECT_ID
Enable APIs needed for this Codelab
gcloud services enable artifactregistry.googleapis.com \
cloudbuild.googleapis.com \
run.googleapis.com \
compute.googleapis.com
Create a folder for the codelab
mkdir codelab-rtx
cd codelab-rtx
Enable Private Google Access to your subnet to optimize ML model loading from Cloud Storage. You can learn more in the GPU best practices doc section on loading models from Cloud Storage.
gcloud compute networks subnets update default \
--region=europe-west4 \
--enable-private-ip-google-access
3. Setup Cloud Storage
First, create a Cloud Storage bucket to store the model weights.
Create a unique bucket
# Generate a unique bucket name
export MODEL_BUCKET="${PROJECT_ID}-rtx-codelab-$(python3 -c 'import uuid; print(str(uuid.uuid4())[:8])')"
echo "Bucket name: $MODEL_BUCKET"
# Create the regional bucket
gcloud storage buckets create gs://$MODEL_BUCKET \
--location=$REGION \
--uniform-bucket-level-access
4. Retrieve Model Weights
Next, download the Gemma 3 model to a local directory and then upload it to your Cloud Storage bucket.
Install Ollama
You can run this command to install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Download Model
Create a directory for the downloaded model.
mkdir model-weights
You will use two terminal tabs for this process: one to run the Ollama server and another terminal to retrieve the model.
Terminal 1 (Server):
Start the server by passing the location of the downloaded model. This command will continue to run.
OLLAMA_MODELS=$(pwd)/model-weights ollama serve
Terminal 2 (Client): Open a new terminal tab and download the model. The client automatically talks to the running server.
# note if you wish to use a larger model, you can change this to gemma3:27b
ollama pull gemma3:1b
Back in Terminal 1: Once the download in Terminal 2 is complete, return to Terminal 1 and press Ctrl+C to stop the server.
Upload to Cloud Storage
Now upload the weights to your bucket. gcloud storage automatically handles parallel uploads for speed.
gcloud storage cp -r ./model-weights/* gs://$MODEL_BUCKET/
(Optional) Cleanup Local Weights
Since the model is now in Cloud Storage, remove the local copy.
rm -rf model-weights
5. Create the service
First, create a folder for the service.
mkdir rtx-service
cd rtx-service
Create a Dockerfile with the following content
FROM ollama/ollama:latest
# Install Google Cloud CLI
RUN apt-get update && apt-get install -y curl gnupg && \
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
apt-get update && apt-get install -y google-cloud-cli && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434
# Store model weight files in /models
ENV OLLAMA_MODELS /models
# Reduce logging verbosity
ENV OLLAMA_DEBUG false
# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1
# Copy and set up the startup script
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
# Start using the entrypoint script
ENTRYPOINT ["/entrypoint.sh"]
Create a file called entrypoint.sh with the following contents:
#!/bin/bash
set -e
# Ensure OLLAMA_MODELS directory exists
mkdir -p $OLLAMA_MODELS
# Download model weights from GCS if MODEL_BUCKET is set
if [ -n "$MODEL_BUCKET" ]; then
echo "Downloading model weights from gs://$MODEL_BUCKET..."
# gcloud storage handles concurrent downloads automatically
gcloud storage cp -r "gs://$MODEL_BUCKET/*" "$OLLAMA_MODELS/"
else
echo "MODEL_BUCKET not set. Skipping download."
fi
# Start Ollama
exec ollama serve
6. Deploy to Cloud Run
In this section, you will deploy the service using gcloud run deploy. This command will build your container from source and deploy it to Cloud Run with the necessary GPU and network configurations.
Create Service Account
Create a dedicated service account for this application and grant it only the required permissions.
# Create a dedicated service account
gcloud iam service-accounts create rtx-codelab-identity \
--display-name="RTX Codelab Identity"
# Grant permission to read from the model bucket
gcloud storage buckets add-iam-policy-binding gs://$MODEL_BUCKET \
--member="serviceAccount:rtx-codelab-identity@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
# Grant access to the Compute Engine network for the Cloud Run service identity
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:rtx-codelab-identity@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/compute.networkUser"
# Grant access to the Compute Engine network for the default service account
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:service-$(gcloud projects describe $PROJECT_ID --format='value(projectNumber)')@serverless-robot-prod.iam.gserviceaccount.com" \
--role="roles/compute.networkUser"
Network Configuration
For optimal performance when downloading large models, use Direct VPC Egress. This allows the container to access Cloud Storage via the Google private network, bypassing the public internet and NAT gateways. The following flags are used in the gcloud run deploy command:
--network: Connects to thedefaultVPC (ensure this network exists and has a subnet in your region with Private Google Access enabled).--subnet: The specific subnet in your region (usuallydefaultif using the default network).--vpc-egress: Set toall-trafficto force all egress traffic through the VPC.
Deployment Command
gcloud beta run deploy $SERVICE_NAME \
--source . \
--region $REGION \
--project $PROJECT_ID \
--no-allow-unauthenticated \
--port 11434 \
--service-account rtx-codelab-identity@$PROJECT_ID.iam.gserviceaccount.com \
--cpu 20 --memory 80Gi \
--gpu 1 \
--gpu-type nvidia-rtx-pro-6000 \
--set-env-vars MODEL_BUCKET=$MODEL_BUCKET \
--network default \
--subnet default \
--vpc-egress all-traffic \
--no-gpu-zonal-redundancy
7. Test the Service
Once deployed, you can interact with your Gemma 3 model using the Ollama API.
Get Service URL
Retrieve the URL of your deployed Cloud Run service.
SERVICE_URL=$(gcloud run services describe $SERVICE_NAME --region $REGION --format 'value(status.url)')
echo "Service URL: $SERVICE_URL"
Run Inference
Send a prompt to the model using curl. You can set "stream": false to get the full response in a single JSON object and use jq to extract just the text.
Note: if you are using a larger model, e.g. gemma3:27b, you'll need to change the model name in the json below.
curl -s "$SERVICE_URL/api/generate" \
-H "Authorization: Bearer $(gcloud auth print-identity-token)" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma3:1b",
"prompt": "Why is the sky blue?",
"stream": false
}' | jq -r '.response'
8. Congratulations!
Congratulations for completing the codelab!
We recommend reviewing the Cloud Run documentation.
What we've covered
- How to deploy a Gemma model on a Cloud Run RTX 6000 Pro GPU
- How to download a model concurrently from Cloud Storage during container startup
9. Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can either delete the project or delete the individual resources.
Option 1: Delete Resources
Delete the Cloud Run Service
gcloud run services delete $SERVICE_NAME \
--region $REGION \
--quiet
Delete the Service Account
gcloud iam service-accounts delete \
rtx-codelab-identity@$PROJECT_ID.iam.gserviceaccount.com \
--quiet
Delete the Cloud Storage Bucket
gcloud storage rm --recursive gs://$MODEL_BUCKET
Delete the Container Image
This build created a container image in Artifact Registry. You can find the image name and delete it.
List images to find the exact name (usually gcr.io/PROJECT_ID/SERVICE_NAME)
gcloud container images list --filter="name:$SERVICE_NAME"
Delete the image (replace IMAGE_NAME with the result from above)
gcloud container images delete <IMAGE_NAME> --force-delete-tags
Option 2: Delete the Project
To delete the entire project, go to Manage Resources, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list.