1. Introduction
Overview
Cloud Run is a container platform on Google Cloud that makes it straightforward to run your code in a container, without requiring you to manage a cluster.
Cloud Run offers either a L4 or NVIDIA RTX PRO 6000 Blackwell GPU. There is one GPU per Cloud Run instance, and Cloud Run auto scaling still applies, including scaling down to zero instances when there are no requests.
One use case for GPUs is running your own open large language models (LLMs). This tutorial walks you through deploying a service that runs a LLM.
This codelab describes how to deploy Gemma 4 open models on Cloud Run using a prebuilt container with vLLM inference library.
What you'll learn
- How to use GPUs on Cloud Run.
- How to deploy Google's Gemma 4 2B instruction-tuned model on Cloud Run using vLLM as an inference engine.
2. Setup and Requirements
Prerequisites
- You are logged into the Cloud Console.
- You have previously deployed a Cloud Run service. For example, you can follow the deploy a web service from source code quickstart to get started.
3. Enable APIs and Set Environment Variables
Enable APIs
Before you can start using this codelab, there are several APIs you will need to enable. This codelab requires using the following APIs. You can enable those APIs by running the following command:
gcloud services enable run.googleapis.com \
cloudbuild.googleapis.com \
artifactregistry.googleapis.com
Set environment Variables
Configure your project ID below.
export PROJECT_ID=<YOUR_PROJECT_ID>
export REGION=europe-west4
export SERVICE_NAME=gemma4-cr-codelab
export SERVICE_ACCOUNT_NAME=gemma4-cr-sa
export SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com
4. Create a service account
This service account is used as the Cloud Run service identity.
gcloud iam service-accounts create $SERVICE_ACCOUNT_NAME \
--display-name="Cloud Run gemma 4 SA"
5. Deploy the service
To deploy Gemma models on Cloud Run, use the following gcloud CLI command with the recommended settings:
CONTAINER_ARGS=(
"serve"
"google/gemma-4-E2B-it"
"--enable-chunked-prefill"
"--enable-prefix-caching"
"--generation-config=auto"
"--enable-auto-tool-choice"
"--tool-call-parser=gemma4"
"--reasoning-parser=gemma4"
"--dtype=bfloat16"
"--max-num-seqs=64"
"--gpu-memory-utilization=0.95"
"--tensor-parallel-size=1"
"--port=8080"
"--host=0.0.0.0"
)
gcloud beta run deploy $SERVICE_NAME \
--image "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:gemma4" \
--project $PROJECT_ID \
--region $REGION \
--execution-environment gen2 \
--no-allow-unauthenticated \
--cpu 20 \
--memory 80Gi \
--gpu 1 \
--gpu-type nvidia-rtx-pro-6000 \
--no-gpu-zonal-redundancy \
--no-cpu-throttling \
--max-instances 3 \
--concurrency 64 \
--timeout 600 \
--service-account $SERVICE_ACCOUNT_ADDRESS \
--startup-probe tcpSocket.port=8080,initialDelaySeconds=240,failureThreshold=1,timeoutSeconds=240,periodSeconds=240 \
--command "vllm" \
--args=$(IFS=','; echo "${CONTAINER_ARGS[*]}")
6. Test the service
Once deployed, you can either use the Cloud Run dev proxy service which automatically adds an ID token for you or you curl the service URL directly.
Using the Cloud Run dev proxy service
First, start the proxy
gcloud run services proxy $SERVICE_NAME \
--project $PROJECT \
--region $REGION \
--port=9090
Run the following command to send a request in a separate terminal tab, leaving the proxy running. The proxy runs on localhost:9090
curl http://localhost:9090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-E2B-it",
"messages": [{"role": "user", "content": "Why is the sky blue?"}],
"chat_template_kwargs": {
"enable_thinking": true
},
"skip_special_tokens": false
}'
You should see output similar to the following:
{
"id": "chatcmpl-9cf1ab1450487047",
"object": "chat.completion",
"created": 1774904187,
"model": "google/gemma-4-E2B-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The short answer is a phenomenon called **Rayleigh scattering**...",
"function_call": null,
"tool_calls": [],
"reasoning": "* Question: \"Why is the sky blue?\"\n..."
},
"finish_reason": "stop",
"stop_reason": 106
}
],
"usage": {
"prompt_tokens": 21,
"total_tokens": 877,
"completion_tokens": 856
}
}
Using the service URL directly
First, retrieve the URL for the deployed service.
SERVICE_URL=$(gcloud run services describe $SERVICE_NAME --region $REGION --format 'value(status.url)')
Curl the service
curl $SERVICE_URL/v1/chat/completions \
-H "Authorization: bearer $(gcloud auth print-identity-token)" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-E2B-it",
"messages": [{"role": "user", "content": "Why is the sky blue?"}],
"chat_template_kwargs": {
"enable_thinking": true
},
"skip_special_tokens": false
}'
7. Congratulations!
Congratulations for completing the codelab!
We recommend reviewing the documentation Cloud Run
What we've covered
- How to use GPUs on Cloud Run.
- How to deploy Google's Gemma 4 (2B) model on Cloud Run using vLLM as an inference engine.
8. Clean up
To avoid inadvertent charges, (for example, if the Cloud Run services are inadvertently invoked more times than your monthly Cloud Run invokement allocation in the free tier), you can either delete the Cloud Run or delete the project you created in Step 2.
To delete the Cloud Run service, go to the Cloud Run Cloud Console at https://console.cloud.google.com/run and delete the gemma4-cr-codelab service. You may also want to delete the gemma4-cr-codelab-sa service account.
If you choose to delete the entire project, you can go to https://console.cloud.google.com/cloud-resource-manager, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list.