How to run LLM inference on Cloud Run GPUs with vLLM and the OpenAI Python SDK

1. Introduction

Overview

Cloud Run recently added GPU support. It's available as a waitlisted public preview. If you're interested in trying out the feature, fill out this form to join the waitlist. Cloud Run is a container platform on Google Cloud that makes it straightforward to run your code in a container, without requiring you to manage a cluster.

Today, the GPUs we make available are Nvidia L4 GPUs with 24 GB of vRAM. There's one GPU per Cloud Run instance, and Cloud Run auto scaling still applies. That includes scaling out up to 5 instances (with quota increase available), as well as scaling down to zero instances when there are no requests.

One use case for GPUs is running your own open large language models (LLMs). This tutorial walks you through deploying a service that runs a LLM.

The service is a backend service that runs vLLM, an inference engine for production systems. This codelab uses Google's Gemma 2 with 2 billion parameters instruction-tuned model.

What you'll learn

  • How to use GPUs on Cloud Run.
  • How to use Hugging Face to retrieve a model.
  • How to deploy Google's Gemma 2 2b instruction-tuned model on Cloud Run using vLLM as an inference engine.
  • How to invoke the backend service to do sentence completion.

2. Setup and Requirements

Prerequisites

  • You are logged into the Cloud Console.
  • You have previously deployed a Cloud Run service. For example, you can follow the deploy a web service from source code quickstart to get started.
  • You have a Hugging Face account and have acknowledge the Gemma 2 2b license at https://huggingface.co/google/gemma-2-2b-it; otherwise, you will not be able to download the model.
  • You have created an access token that has access to the google/gemma-2-2b-it model.

Activate Cloud Shell

  1. From the Cloud Console, click Activate Cloud Shell d1264ca30785e435.png.

cb81e7c8e34bc8d.png

If this is your first time starting Cloud Shell, you're presented with an intermediate screen describing what it is. If you were presented with an intermediate screen, click Continue.

d95252b003979716.png

It should only take a few moments to provision and connect to Cloud Shell.

7833d5e1c5d18f54.png

This virtual machine is loaded with all the development tools needed. It offers a persistent 5 GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with a browser.

Once connected to Cloud Shell, you should see that you are authenticated and that the project is set to your project ID.

  1. Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`
  1. Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

3. Enable APIs and Set Environment Variables

Enable APIs

Before you can start using this codelab, there are several APIs you will need to enable. This codelab requires using the following APIs. You can enable those APIs by running the following command:

gcloud services enable run.googleapis.com \
    cloudbuild.googleapis.com \
    secretmanager.googleapis.com \
    artifactregistry.googleapis.com

Setup environment variables

You can set environment variables that will be used throughout this codelab.

HF_TOKEN=<YOUR_HUGGING_FACE_TOKEN>
PROJECT_ID=<YOUR_PROJECT_ID>

REGION=us-central1
SERVICE_NAME=vllm-gemma-2-2b-it
AR_REPO_NAME=vllm-gemma-2-2b-it-repo
SERVICE_ACCOUNT=vllm-gemma-2-2b-it
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com

4. Create a service account

This service account is used to build the Cloud Run service and access a secret from Secret Manager.

First, create the service account by running this command:

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="Cloud Run vllm SA to access secrete manager"

Second, grant the Vertex AI User role to the service account.

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
  --role=roles/secretmanager.secretAccessor

Now, create a secret in Secret Manager called HF_TOKEN for your Hugging Face Access Token. Cloud Build uses the service account to access this secret at build-time to pull down the Gemma 2 (2B) model from Hugging Face. You can learn more about secrets and Cloud Build here.

printf $HF_TOKEN | gcloud secrets create HF_TOKEN --data-file=-

And grant the service account access to the HF_TOKEN secret in Secret Manager.

gcloud secrets add-iam-policy-binding HF_TOKEN \
    --member serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
    --role='roles/secretmanager.secretAccessor'

5. Create the image in Artifact Registry

First, create a repository in Artifact Registry.

gcloud artifacts repositories create $AR_REPO_NAME \
  --repository-format docker \
  --location us-central1

Next, create a Dockerfile that will incorporate the secret from Secret Manager. You can learn more about Docker buildx –secrets flag here.

FROM vllm/vllm-openai:latest

ENV HF_HOME=/model-cache
RUN --mount=type=secret,id=HF_TOKEN HF_TOKEN=$(cat /run/secrets/HF_TOKEN) \
    huggingface-cli download google/gemma-2-2b-it

ENV HF_HUB_OFFLINE=1

ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server \
    --port ${PORT:-8000} \
    --model ${MODEL_NAME:-google/gemma-2-2b-it} \
    ${MAX_MODEL_LEN:+--max-model-len "$MAX_MODEL_LEN"}

Now create a cloudbuild.yaml file

steps:
- name: 'gcr.io/cloud-builders/docker'
  id: build
  entrypoint: 'bash'
  secretEnv: ['HF_TOKEN']
  args: 
    - -c
    - |
        SECRET_TOKEN="$$HF_TOKEN" docker buildx build --tag=${_IMAGE} --secret id=HF_TOKEN .

availableSecrets:
  secretManager:
  - versionName: 'projects/${PROJECT_ID}/secrets/HF_TOKEN/versions/latest'
    env: 'HF_TOKEN'

images: ["${_IMAGE}"]

substitutions:  
  _IMAGE: 'us-central1-docker.pkg.dev/${PROJECT_ID}/vllm-gemma-2-2b-it-repo/vllm-gemma-2-2b-it'

options:
  dynamicSubstitutions: true
  machineType: "E2_HIGHCPU_32"

Lastly, submit a build.

gcloud builds submit --config=cloudbuild.yaml

The build can approx 8 minutes.

6. Deploy the service

You are now ready to deploy the image to Cloud Run.

gcloud beta run deploy $SERVICE_NAME \
--image=us-central1-docker.pkg.dev/$PROJECT_ID/$AR_REPO_NAME/$SERVICE_NAME \
--service-account $SERVICE_ACCOUNT_ADDRESS \
--cpu=8 \
--memory=32Gi \
--gpu=1 --gpu-type=nvidia-l4 \
--region us-central1 \
--no-allow-unauthenticated \
--max-instances 5 \
--no-cpu-throttling

The deployment can take up to 5 minutes.

7. Test the service

Once deployed, you can either use the Cloud Run dev proxy service which automatically adds an ID token for you or you curl the service URL directly.

Using the Cloud Run dev proxy service

To use the Cloud Run dev proxy service, you can use these steps:

First, run the following command

gcloud run services proxy $SERVICE_NAME --region us-central1

Next, curl the service

curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "google/gemma-2-2b-it",
  "prompt": "Cloud Run is a",
  "max_tokens": 128,
  "temperature": 0.90
}'

Using the service URL directly

First, retrieve the URL for the deployed service.

SERVICE_URL=$(gcloud run services describe $SERVICE_NAME --region $REGION --format 'value(status.url)')

Curl the service

curl -X POST $SERVICE_URL/v1/completions \
-H "Authorization: bearer $(gcloud auth print-identity-token)" \
-H "Content-Type: application/json" \
-d '{
  "model": "google/gemma-2-2b-it",
  "prompt": "Cloud Run is a",
  "max_tokens": 128,
  "temperature": 0.90
}'

Results

You should see results similar to the following:

{"id":"cmpl-e0e6924d4bfd4d918383c87cba5e25ac","object":"text_completion","created":1723853023,"model":"google/gemma-2-2b","choices":[{"index":0,"text":" serverless compute platform that lets you write your backend code in standard languages, such as Java, Go, PHP and Python.\n\nYou can deploy your function as a REST API that scales on demand and allows you to add additional security features such as HTTPS.\n\nTo write code for an Android app with Cloud Run, you need to use the GraalVM. This is because while Node.js is a more commonly known node-based platform, GraalVM is a virtual machine (VM) to run native code in the Cloud Run environment.\n\nNow you need graal.vm/java-11-jre.jar, the","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":133,"completion_tokens":128}}

8. Congratulations!

Congratulations for completing the codelab!

We recommend reviewing the documentation Cloud Run

What we've covered

  • How to use GPUs on Cloud Run.
  • How to use Hugging Face to retrieve a model.
  • How to deploy Google's Gemma 2 (2B) model on Cloud Run using vLLM as an inference engine.
  • How to invoke the backend service to do sentence completion.

9. Clean up

To avoid inadvertent charges, (for example, if the Cloud Run services are inadvertently invoked more times than your monthly Cloud Run invokement allocation in the free tier), you can either delete the Cloud Run or delete the project you created in Step 2.

To delete the Cloud Run service, go to the Cloud Run Cloud Console at https://console.cloud.google.com/run and delete the vllm-gemma-2-2b service. You may also want to delete the vllm-gemma-2-2b service account.

If you choose to delete the entire project, you can go to https://console.cloud.google.com/cloud-resource-manager, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list.