How to run LLM inference on Cloud Run GPUs with vLLM

1. Introduction

Overview

Cloud Run is a container platform on Google Cloud that makes it straightforward to run your code in a container, without requiring you to manage a cluster.

Cloud Run offers either a L4 or NVIDIA RTX PRO 6000 Blackwell GPU. There is one GPU per Cloud Run instance, and Cloud Run auto scaling still applies, including scaling down to zero instances when there are no requests.

One use case for GPUs is running your own open large language models (LLMs). This tutorial walks you through deploying a service that runs a LLM.

This codelab describes how to deploy Gemma 4 open models on Cloud Run using a prebuilt container with vLLM inference library.

What you'll learn

  • How to use GPUs on Cloud Run.
  • How to deploy Google's Gemma 4 2B instruction-tuned model on Cloud Run using vLLM as an inference engine.

2. Setup and Requirements

Prerequisites

3. Enable APIs and Set Environment Variables

Enable APIs

Before you can start using this codelab, there are several APIs you will need to enable. This codelab requires using the following APIs. You can enable those APIs by running the following command:

gcloud services enable run.googleapis.com \
    cloudbuild.googleapis.com \
    artifactregistry.googleapis.com

Set environment Variables

Configure your project ID below.

export PROJECT_ID=<YOUR_PROJECT_ID>

export REGION=europe-west4
export SERVICE_NAME=gemma4-cr-codelab
export SERVICE_ACCOUNT_NAME=gemma4-cr-sa
export SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com

4. Create a service account

This service account is used as the Cloud Run service identity.

gcloud iam service-accounts create $SERVICE_ACCOUNT_NAME \
  --display-name="Cloud Run gemma 4 SA"

5. Deploy the service

To deploy Gemma models on Cloud Run, use the following gcloud CLI command with the recommended settings:

CONTAINER_ARGS=(
    "serve"
    "google/gemma-4-E2B-it"
    "--enable-chunked-prefill"
    "--enable-prefix-caching"
    "--generation-config=auto"
    "--enable-auto-tool-choice"
    "--tool-call-parser=gemma4"
    "--reasoning-parser=gemma4"
    "--dtype=bfloat16"
    "--max-num-seqs=64"
    "--gpu-memory-utilization=0.95"
    "--tensor-parallel-size=1"
    "--port=8080"
    "--host=0.0.0.0"
)
gcloud beta run deploy $SERVICE_NAME \
    --image "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
    --project $PROJECT_ID \
    --region $REGION \
    --execution-environment gen2 \
    --no-allow-unauthenticated \
    --cpu 20 \
    --memory 80Gi \
    --gpu 1 \
    --gpu-type nvidia-rtx-pro-6000 \
    --no-gpu-zonal-redundancy \
    --no-cpu-throttling \
    --max-instances 3 \
    --concurrency 64 \
    --timeout 600 \
    --service-account $SERVICE_ACCOUNT_ADDRESS \
    --startup-probe tcpSocket.port=8080,initialDelaySeconds=240,failureThreshold=1,timeoutSeconds=240,periodSeconds=240 \
    --command "vllm" \
    --args=$(IFS=','; echo "${CONTAINER_ARGS[*]}")

6. Test the service

Once deployed, you can either use the Cloud Run dev proxy service which automatically adds an ID token for you or you curl the service URL directly.

Using the Cloud Run dev proxy service

First, start the proxy

gcloud run services proxy $SERVICE_NAME \
  --project $PROJECT \
  --region $REGION \
  --port=9090

Run the following command to send a request in a separate terminal tab, leaving the proxy running. The proxy runs on localhost:9090

curl http://localhost:9090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}],
    "chat_template_kwargs": {
         "enable_thinking": true
     },
     "skip_special_tokens": false
  }'

You should see output similar to the following:

{
 "id": "chatcmpl-9cf1ab1450487047",
 "object": "chat.completion",
 "created": 1774904187,
 "model": "google/gemma-4-E2B-it",
 "choices": [
   {
     "index": 0,
     "message": {
       "role": "assistant",
       "content": "The short answer is a phenomenon called **Rayleigh scattering**...",
       "function_call": null,
       "tool_calls": [],
       "reasoning": "*   Question: \"Why is the sky blue?\"\n..."
     },
     "finish_reason": "stop",
     "stop_reason": 106
   }
 ],
 "usage": {
   "prompt_tokens": 21,
   "total_tokens": 877,
   "completion_tokens": 856
 }
}

Using the service URL directly

First, retrieve the URL for the deployed service.

SERVICE_URL=$(gcloud run services describe $SERVICE_NAME --region $REGION --format 'value(status.url)')

Curl the service

curl $SERVICE_URL/v1/chat/completions \
  -H "Authorization: bearer $(gcloud auth print-identity-token)" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}],
    "chat_template_kwargs": {
         "enable_thinking": true
     },
     "skip_special_tokens": false
  }'

7. Congratulations!

Congratulations for completing the codelab!

We recommend reviewing the documentation Cloud Run

What we've covered

  • How to use GPUs on Cloud Run.
  • How to deploy Google's Gemma 4 (2B) model on Cloud Run using vLLM as an inference engine.

8. Clean up

To avoid inadvertent charges, (for example, if the Cloud Run services are inadvertently invoked more times than your monthly Cloud Run invokement allocation in the free tier), you can either delete the Cloud Run or delete the project you created in Step 2.

To delete the Cloud Run service, go to the Cloud Run Cloud Console at https://console.cloud.google.com/run and delete the gemma4-cr-codelab service. You may also want to delete the gemma4-cr-codelab-sa service account.

If you choose to delete the entire project, you can go to https://console.cloud.google.com/cloud-resource-manager, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list.