Run inference of Gemma 4 model on Cloud Run with RTX 6000 Pro GPU with vLLM

1. Introduction

Overview

What you'll learn

How to deploy a Gemma 4 model on a Cloud Run RTX 6000 Pro GPU
How to use vLLM and Run:ai Model Streamer for faster inference and shorter instance startup.

Gemma 4 is a family of Apache 2-licensed open weight models from Google DeepMind. The models are multimodal, multilingual, offer reasoning, and an efficient architecture. Cloud Run is a serverless environment for containers with support for GPUs.

2. Setup and Requirements

Here are environment variables that will be used throughout this codelab. You can save these in an environment file and "source" it. Make sure to correctly set the value of you project ID and optionally the region.

# Model name on HuggingFace Hub
export MODEL_NAME="google/gemma-4-31B-it"

# Cloud Run Service name
export SERVICE_NAME=gemma-rtx-vllm-codelab

# Cloud Project and Region for Cloud Run
export GOOGLE_CLOUD_PROJECT=<YOUR_PROJECT_ID> # Change to your Project Id
export GOOGLE_CLOUD_REGION=europe-west4

# Optional HuggingFace User Access Token for accessing model weights
# (https://huggingface.co/docs/hub/en/security-tokens),
# if you are loading a private model.
export HF_TOKEN=""

# Service account for Cloud Run service
export SERVICE_ACCOUNT="vllm-service-sa"
export SERVICE_ACCOUNT_EMAIL="${SERVICE_ACCOUNT}@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com"

# GCS Bucket for the model cache.
export MODEL_CACHE_BUCKET="${GOOGLE_CLOUD_PROJECT}-${GOOGLE_CLOUD_REGION}-hf-model-cache"
# Model cache location in GSC bucket
export GCS_MODEL_LOCATION="gs://${MODEL_CACHE_BUCKET}/model-cache/${MODEL_NAME}"

# VPC Network for Direct VPC Egress
export VPC_NETWORK="vllm-${GOOGLE_CLOUD_REGION}-net"
export VPC_SUBNET="vllm-${GOOGLE_CLOUD_REGION}-subnet"
export SUBNET_RANGE="10.8.0.0/26"

# set the project
gcloud config set project $GOOGLE_CLOUD_PROJECT
gcloud config set run/region $GOOGLE_CLOUD_REGION

Enable APIs needed for this Codelab

gcloud services enable --project "${GOOGLE_CLOUD_PROJECT}" \
    run.googleapis.com \
    cloudbuild.googleapis.com \
    artifactregistry.googleapis.com \
    iam.googleapis.com \
    compute.googleapis.com \
    vpcaccess.googleapis.com \
    storage.googleapis.com

3. Create Service Account

If you don't specify a service account when the Cloud Run service or job is created, Cloud Run uses Compute Engine default service account. A separate Service Account for Cloud Run service is recommended to avoid running the service with excessive permissions.

Create Service Account for Cloud Run service

gcloud iam service-accounts create ${SERVICE_ACCOUNT} \
  --project "${GOOGLE_CLOUD_PROJECT}" \
  --display-name "vLLM Service Account"

4. Setup Cloud Storage

Create a Cloud Storage bucket to store the model weights. This will allow using Direct VPC Egress for downloading model weights faster every time Cloud Run starts a service instance.

Combined with Run:ai Model Streamer feature in vLLM, it significantly reduces model loading time.

Create a bucket

Make sure it's a single-region bucket co-located with Cloud Run service.

gcloud storage buckets create "gs://${MODEL_CACHE_BUCKET}" \
    --uniform-bucket-level-access --public-access-prevention \
    --project "${GOOGLE_CLOUD_PROJECT}" --location "${GOOGLE_CLOUD_REGION}"

5. Retrieve and Cache Model Weights

Next, download the Gemma 4 model to your Cloud Storage bucket.

Models weights are dozens of gigabytes, and downloading them to your local machine or Cloud Shell may be impossible.

Instead, we use Cloud Build with enough storage to hold model weights.

gcloud builds submit --project="${GOOGLE_CLOUD_PROJECT}" --region="${GOOGLE_CLOUD_REGION}" --no-source \
    --substitutions="_MODEL_NAME=${MODEL_NAME},_HF_TOKEN=${HF_TOKEN},_GCS_MODEL_LOCATION=${GCS_MODEL_LOCATION}" \
    --config=/dev/stdin <<'EOF'
steps:
- name: 'gcr.io/google.com/cloudsdktool/google-cloud-cli:slim'
  entrypoint: 'bash'
  args:
  - '-c'
  - |
    set -e
    pip3 install --root-user-action=ignore --break-system-packages huggingface_hub[cli]
    echo "Downloading the model..."
    if [[ "$_HF_TOKEN" != "" ]]; then
      hf download "$_MODEL_NAME" --token $_HF_TOKEN --local-dir "./model-cache/$_MODEL_NAME"
    else
      hf download "$_MODEL_NAME" --local-dir "./model-cache/$_MODEL_NAME"
    fi
    echo "Uploading the model..."
    gcloud storage cp -r "./model-cache/$_MODEL_NAME" "$_GCS_MODEL_LOCATION"
options:
  machineType: 'E2_HIGHCPU_32'
  diskSizeGb: 500
EOF

6. Configure Networking for Direct VPC Egress

Direct VPC Egress configuration requires creating a network and subnet with Private Google Access enabled.

This allows Cloud Run services to connect to the set of external IP addresses used by Google APIs and services, including Cloud Storage.

Create a Network

gcloud compute networks create "$VPC_NETWORK" \
        --subnet-mode=custom \
        --bgp-routing-mode=regional \
        --project "$GOOGLE_CLOUD_PROJECT"

Create a Subnet

gcloud compute networks subnets create "$VPC_SUBNET" \
        --network="$VPC_NETWORK" \
        --region="$GOOGLE_CLOUD_REGION" \
        --range="$SUBNET_RANGE" \
        --enable-private-ip-google-access \
        --project "$GOOGLE_CLOUD_PROJECT"

7. Configure Service Account Access Policy

Cloud Run Service Account needs permissions to access model weights in the Storage Bucket you created.

gcloud storage buckets add-iam-policy-binding "gs://${MODEL_CACHE_BUCKET}" \
    --member "serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \
    --role "roles/storage.admin" \
    --project "${GOOGLE_CLOUD_PROJECT}"

8. Initialize Configuration Variables

Define the variables for both the vLLM inference engine and the Cloud Run service.

# vLLM variables
export MAX_MODEL_LEN="32767"    # 32767 to improve concurrency. Keep it empty to use model's maximim context length (256K)
export QUANTIZATION_TYPE="fp8"  # Model quantization for faster performance and lower memory usage.
export KV_CACHE_DTYPE="fp8"     # KV-cache quantization to save GPU memory.
export GPU_MEM_UTIL="0.95"      # Fraction of GPU memory to be used by the vLLM engine.
export TENSOR_PARALLEL_SIZE="1" # Partitioning model across GPUs (1 here as we have only 1 GPU).
export MAX_NUM_SEQS="8"         # Max concurrent requests vLLM processes in one batch.

# Cloud Run variables
export CLOUD_RUN_CPU_NUM=20
export CLOUD_RUN_MEMORY_GB=80
export CLOUD_RUN_MAX_INSTANCES=3
export CLOUD_RUN_CONCURRENCY=16

9. Deploy to Cloud Run

Prepare vLLM Container Command Line

vLLM requires plenty of parameters to run large models fast and efficiently. These parameters will be passed as arguments to the container deployed to Cloud Run.

CONTAINER_ARGS=(
    "vllm"
    "serve"
    "${GCS_MODEL_LOCATION}"
    "--served-model-name" "${MODEL_NAME}"
    "--enable-log-requests"
    "--enable-chunked-prefill"
    "--enable-prefix-caching"
    "--generation-config" "auto"
    "--enable-auto-tool-choice"
    "--tool-call-parser" "gemma4"
    "--reasoning-parser" "gemma4"
    "--dtype" "bfloat16"
    "--quantization" "${QUANTIZATION_TYPE}"
    "--kv-cache-dtype" "${KV_CACHE_DTYPE}"
    "--max-num-seqs" "${MAX_NUM_SEQS}"
    "--gpu-memory-utilization" "${GPU_MEM_UTIL}"
    "--tensor-parallel-size" "${TENSOR_PARALLEL_SIZE}"
    "--load-format" "runai_streamer"
    "--port" "8080"
    "--host" "0.0.0.0"
)

if [[ "${MAX_MODEL_LEN}" != "" ]]; then
    CONTAINER_ARGS+=("--max-model-len" "${MAX_MODEL_LEN}")
fi

export CONTAINER_ARGS_STR="${CONTAINER_ARGS[*]}"
echo "Deployment string: ${CONTAINER_ARGS_STR}"

Deploy Cloud Run Service

Run the following command to deploy the Cloud Run service. Note the GPU type (RTX 6000 Pro), the base image (pytorch-vllm-serve:gemma4), and the need to be authenticated to invoke the service (--no-allow-unauthenticated).

gcloud beta run deploy "${SERVICE_NAME}" \
    --image="us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
    --project "${GOOGLE_CLOUD_PROJECT}" \
    --region "${GOOGLE_CLOUD_REGION}" \
    --service-account "${SERVICE_ACCOUNT_EMAIL}" \
    --execution-environment gen2 \
    --no-allow-unauthenticated \
    --cpu="${CLOUD_RUN_CPU_NUM}" \
    --memory="${CLOUD_RUN_MEMORY_GB}Gi" \
    --gpu=1 \
    --gpu-type=nvidia-rtx-pro-6000 \
    --no-gpu-zonal-redundancy \
    --no-cpu-throttling \
    --max-instances ${CLOUD_RUN_MAX_INSTANCES} \
    --concurrency ${CLOUD_RUN_CONCURRENCY} \
    --network ${VPC_NETWORK} \
    --subnet ${VPC_SUBNET} \
    --vpc-egress all-traffic \
    --set-env-vars "MODEL_NAME=${MODEL_NAME}" \
    --set-env-vars "GOOGLE_CLOUD_PROJECT=${GOOGLE_CLOUD_PROJECT}" \
    --set-env-vars "GOOGLE_CLOUD_REGION=${GOOGLE_CLOUD_REGION}" \
    --port=8080 \
    --timeout=3600 \
    --cpu-boost \
    --startup-probe tcpSocket.port=8080,initialDelaySeconds=240,failureThreshold=40,timeoutSeconds=10,periodSeconds=15 \
    --command "bash" \
    --args="^;^-c;${CONTAINER_ARGS_STR}"

This will take a few minutes to deploy. Once done, you will have a GPU-powered environment serving Gemma 4 using a serverless infrastructure with autoscaling including scale to zero (no traffic, no cost).

10. Test the Service

Once deployed, you can interact with your Gemma 4 model using the vLLM OpenAI-compatible API.

Get Service URL

Retrieve the URL of your deployed Cloud Run service.

SERVICE_URL=$(gcloud run services describe $SERVICE_NAME --project "${GOOGLE_CLOUD_PROJECT}" --region "${GOOGLE_CLOUD_REGION}" --format 'value(status.url)')
echo "Service URL: $SERVICE_URL"

Run Inference

Send a prompt to the model using curl.

curl -s "$SERVICE_URL/v1/chat/completions" \
  -H "Authorization: Bearer $(gcloud auth print-identity-token)" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "'"${MODEL_NAME}"'",
  "messages": [
    {"role": "user", "content": "Why is the sky blue?"}
  ],
  "chat_template_kwargs": {
    "enable_thinking": true
  },
  "skip_special_tokens": false
}' | jq -r '.choices[0].message.content'

11. Congratulations!

Congratulations for completing the codelab!

We recommend reviewing the Cloud Run documentation.

What we've covered

How to deploy Gemma 4 model on a Cloud Run RTX 6000 Pro GPU
How to configure Direct VPC Egress and vLLM model streaming with Cloud Storage for faster service startup.

12. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can either delete the project or delete the individual resources.

Option 1: Delete Resources

Delete the Cloud Run Service

gcloud run services delete $SERVICE_NAME \
      --project "${GOOGLE_CLOUD_PROJECT}" \
      --region "${GOOGLE_CLOUD_REGION}" \
      --quiet

Delete the Service Account

gcloud iam service-accounts delete \
      ${SERVICE_ACCOUNT_EMAIL} \
      --project "${GOOGLE_CLOUD_PROJECT}" \
      --quiet

Delete the Cloud Storage Bucket

gcloud storage rm --recursive gs://$MODEL_CACHE_BUCKET

Delete the VPC Network and Subnet

gcloud compute networks subnets delete $VPC_SUBNET \
    --region "${GOOGLE_CLOUD_REGION}" \
    --project "${GOOGLE_CLOUD_PROJECT}" \
    --quiet

gcloud compute networks delete $VPC_NETWORK \
    --project "${GOOGLE_CLOUD_PROJECT}" \
    --quiet

Option 2: Delete the Project

To delete the entire project, go to Manage Resources, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list. If you'd like to stick to the command line, you can also use this command:

gcloud projects delete ${GOOGLE_CLOUD_PROJECT}