How to host a LLM in a sidecar for a Cloud Run function

1. Introduction

Overview

In this codelab, you'll learn how to host a gemma3:4b model in a sidecar for a Cloud Run function. When a file is uploaded to a Cloud Storage bucket, it will trigger the Cloud Run function. The function will send the contents of the file to Gemma 3 in the sidecar for summarization.

What you'll learn

  • How to do inference using a Cloud Run function and a LLM hosted in a sidecar using GPUs
  • How to use Direct VPC egress configuration for a Cloud Run GPU for faster upload and serving of the model
  • How to use genkit to interface with your hosted ollama model

2. Before you begin

To use the GPUs feature, you must request a quota increase for a supported region. The quota needed is nvidia_l4_gpu_allocation_no_zonal_redundancy, which is under Cloud Run Admin API. Here is the direct link to request quota.

3. Setup and Requirements

Set environment variables that will be used throughout this codelab.

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>

AR_REPO=codelab-crf-sidecar-gpu
FUNCTION_NAME=crf-sidecar-gpu
BUCKET_GEMMA_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-gemma3
BUCKET_DOCS_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-docs
SERVICE_ACCOUNT="crf-sidecar-gpu"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
IMAGE_SIDECAR=$REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3

Create the service account by running this command:

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="SA for codelab crf sidecar with gpu"

We'll use this same service account being used as the Cloud Run function's identity as the service account for the eventarc trigger to invoke the Cloud Run function. You can create a different SA for Eventarc if you prefer.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
    --role=roles/run.invoker

Also grant the service account access to receive Eventarc events.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SERVICE_ACCOUNT_ADDRESS" \
    --role="roles/eventarc.eventReceiver"

Create a bucket that will host your finetuned model. This codelab uses a regional bucket. You can use a multi-regional bucket as well.

gsutil mb -l $REGION gs://$BUCKET_GEMMA_NAME

And then give the SA access to the bucket.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

Now create a regional bucket that will store the docs you want summarized. You can use a multi-regional bucket as well, provided you update the Eventarc trigger accordingly (shown at the end of this codelab).

gsutil mb -l $REGION gs://$BUCKET_DOCS_NAME

And then give the SA access to the Gemma 3 bucket.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

and the Docs bucket.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_DOCS_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

Create an artifact registry repository for the Ollama image that will be used in the sidecar

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for CR function and gpu sidecar" \
    --project=$PROJECT_ID

4. Download the Gemma 3 model

First, you'll want to download the Gemma 3 4b model from ollama. You can do this by installing ollama and then running the gemma3:4b model locally.

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

Now in a separate terminal window, run the following command to pull down the model. If you are using Cloud Shell, you can open an additional terminal window by clicking the plus icon in the upper right menu bar.

ollama run gemma3:4b

Once ollama is running, feel free to ask the model some questions, e.g.

"why is the sky blue?"

Once you're done chatting with ollama, you can exit the chat by running

/bye

Then, in the first terminal window, run the following command to stop serving ollama locally

# on Linux / Cloud Shell press Ctrl^C or equivalent for your shell

You can find where Ollama downloads the models depending on your operating system here.

https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored

If you are using Cloud Workstations, you can find the ollama models downloaded here /home/$USER/.ollama/models

Confirm that your models are hosted here:

ls /home/$USER/.ollama/models

now move the gemma3:4b model to your GCS bucket

gsutil cp -r /home/$USER/.ollama/models gs://$BUCKET_GEMMA_NAME

5. Create the Cloud Run function

Create a root folder for your source code.

mkdir codelab-crf-sidecar-gpu &&
cd codelab-crf-sidecar-gpu &&
mkdir cr-function &&
mkdir ollama-gemma3 &&
cd cr-function

Create a subfolder called src. Inside the folder, create a file called index.ts

mkdir src &&
touch src/index.ts

Update index.ts with the following code:

//import util from 'util';
import { cloudEvent, CloudEvent } from "@google-cloud/functions-framework";
import { StorageObjectData } from "@google/events/cloud/storage/v1/StorageObjectData";
import { Storage } from "@google-cloud/storage";

// Initialize the Cloud Storage client
const storage = new Storage();

import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';

const ai = genkit({
    plugins: [
        ollama({
            models: [
                {
                    name: 'gemma3:4b',
                    type: 'generate', // type: 'chat' | 'generate' | undefined
                },
            ],
            serverAddress: 'http://127.0.0.1:11434', // default local address
        }),
    ],
});


// Register a CloudEvent callback with the Functions Framework that will
// be triggered by Cloud Storage.

//functions.cloudEvent('helloGCS', await cloudEvent => {
cloudEvent("gcs-cloudevent", async (cloudevent: CloudEvent<StorageObjectData>) => {
    console.log("---------------\nProcessing for ", cloudevent.subject, "\n---------------");

    if (cloudevent.data) {

        const data = cloudevent.data;

        if (data && data.bucket && data.name) {
            const bucketName = cloudevent.data.bucket;
            const fileName = cloudevent.data.name;
            const filePath = `${cloudevent.data.bucket}/${cloudevent.data.name}`;

            console.log(`Attempting to download: ${filePath}`);

            try {
                // Get a reference to the bucket
                const bucket = storage.bucket(bucketName!);

                // Get a reference to the file
                const file = bucket.file(fileName!);

                // Download the file's contents
                const [content] = await file.download();

                // 'content' is a Buffer. Convert it to a string.
                const fileContent = content.toString('utf8');

                console.log(`Sending file to Gemma 3 for summarization`);
                const { text } = await ai.generate({
                    model: 'ollama/gemma3:4b',
                    prompt: `Summarize the following document in just a few sentences ${fileContent}`,
                });

                console.log(text);

            } catch (error: any) {

                console.error('An error occurred:', error.message);
            }
        } else {
            console.warn("CloudEvent bucket name is missing!", cloudevent);
        }
    } else {
        console.warn("CloudEvent data is missing!", cloudevent);
    }
});

Now in the root directory crf-sidecar-gpu, create a file called package.json with the following contents:

{
    "main": "lib/index.js",
    "name": "ingress-crf-genkit",
    "version": "1.0.0",
    "scripts": {
        "build": "tsc"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "description": "",
    "dependencies": {
        "@google-cloud/functions-framework": "^3.4.0",
        "@google-cloud/storage": "^7.0.0",
        "genkit": "^1.1.0",
        "genkitx-ollama": "^1.1.0",
        "@google/events": "^5.4.0"
    },
    "devDependencies": {
        "typescript": "^5.5.2"
    }
}

Create a tsconfig.json also at the root directory level with the following contents:

{
  "compileOnSave": true,
  "include": [
    "src"
  ],
  "compilerOptions": {
    "module": "commonjs",
    "noImplicitReturns": true,
    "outDir": "lib",
    "sourceMap": true,
    "strict": true,
    "target": "es2017",
    "skipLibCheck": true,
    "esModuleInterop": true
  }
}

6. Deploy the function

In this step, you'll deploy the Cloud Run function by running the following command.

Note: max instances should be set to a number less than or equal to your GPU quota.

gcloud beta run deploy $FUNCTION_NAME \
  --region $REGION \
  --function gcs-cloudevent \
  --base-image nodejs22 \
  --source . \
  --no-allow-unauthenticated \
  --max-instances 2 # this should be less than or equal to your GPU quota

7. Create the sidecar

You can learn more about hosting Ollama within a Cloud Run service at https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama

Move into the directory for your sidecar:

cd ../ollama-gemma3

Create a Dockerfile file with the following contents:

FROM ollama/ollama:latest

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

Build the image

gcloud builds submit \
   --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3 \
   --machine-type e2-highcpu-32

8. Update the function with the sidecar

To add a sidecar to an existing service, job, or function, you can update the YAML file to contain the sidecar.

Retrieve the YAML for the Cloud Run function that you just deployed by running:

gcloud run services describe $FUNCTION_NAME --format=export > add-sidecar-service.yaml

Now add the sidecar to the CRf by updating the YAML as follows:

  1. insert the following YAML fragment directly above the runtimeClassName: run.googleapis.com/linux-base-image-update line. The -image should align with the ingress container item -image
    - image: YOUR_IMAGE_SIDECAR:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: YOUR_BUCKET_GEMMA_NAME
          name: gcs-1
  1. Run the following command to update the YAML fragment with your environment variables:
sed -i "s|YOUR_IMAGE_SIDECAR|$IMAGE_SIDECAR|; s|YOUR_BUCKET_GEMMA_NAME|$BUCKET_GEMMA_NAME|" add-sidecar-service.yaml

Your completed YAML file should look something like this:

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:    
    run.googleapis.com/build-base-image: us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22
    run.googleapis.com/build-enable-automatic-updates: 'true'
    run.googleapis.com/build-function-target: gcs-cloudevent
    run.googleapis.com/build-id: f0122905-a556-4000-ace4-5c004a9f9ec6
    run.googleapis.com/build-image-uri:<YOUR_IMAGE_CRF>
    run.googleapis.com/build-name: <YOUR_BUILD_NAME>
    run.googleapis.com/build-source-location: <YOUR_SOURCE_LOCATION>
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
    run.googleapis.com/urls: '["<YOUR_CLOUD_RUN_FUNCTION_URLS"]'
  labels:
    cloud.googleapis.com/location: <YOUR_REGION>
  name: <YOUR_FUNCTION_NAME>
  namespace: '392295011265'
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '4'
        run.googleapis.com/base-images: '{"":"us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22"}'
        run.googleapis.com/client-name: gcloud
        run.googleapis.com/client-version: 514.0.0
        run.googleapis.com/startup-cpu-boost: 'true'
      labels:
        client.knative.dev/nonce: hzhhrhheyd
        run.googleapis.com/startupProbeType: Default
    spec:
      containerConcurrency: 80
      containers:
      - image: <YOUR_FUNCTION_IMAGE>
        ports:
        - containerPort: 8080
          name: http1
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
        startupProbe:
          failureThreshold: 1
          periodSeconds: 240
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
      - image: <YOUR_SIDECAR_IMAGE>:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: <YOUR_BUCKET_NAME>
          name: gcs-1
      runtimeClassName: run.googleapis.com/linux-base-image-update
      serviceAccountName: <YOUR_SA_ADDRESS>
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

Now update the function with the sidecar by running the following command.

gcloud run services replace add-sidecar-service.yaml

Lastly, create the eventarc trigger for the function. This command also adds it to the function.

Note: if you created a multi-regional bucket, you'll want to change the --location parameter

gcloud eventarc triggers create my-crf-summary-trigger  \
    --location=$REGION \
    --destination-run-service=$FUNCTION_NAME  \
    --destination-run-region=$REGION \
    --event-filters="type=google.cloud.storage.object.v1.finalized" \
    --event-filters="bucket=$BUCKET_DOCS_NAME" \
    --service-account=$SERVICE_ACCOUNT_ADDRESS

9. Test your function

Upload a plain text file for summarization. Don't know what to summarize? Ask Gemini for a quick 1-2 page description of the history of dogs! Then upload that plain text file to your $BUCKET_DOCS_NAME bucket for the Gemma3:4b model to write a summary to the function logs.

In the logs, you'll see something similar to the following:

---------------
Processing for objects/dogs.txt
---------------
Attempting to download: <YOUR_PROJECT_ID>-codelab-crf-sidecar-gpu-docs/dogs.txt
Sending file to Gemma 3 for summarization
...
Here's a concise summary of the document "Humanity's Best Friend":
The dog's domestication, beginning roughly 20,000-40,000 years ago, represents a unique, deeply intertwined evolutionary partnership with humans, predating the domestication of any other animal
<...>
solidifying their long-standing role as humanity's best friend.

10. Troubleshooting

Here are some errors to typos that you might encounter:

  1. If you get an error that PORT 8080 is in use, make sure your Dockerfile for your Ollama sidecar is using port 11434. Also make sure you are using the correct sidecar image in case you have multiple Ollama images in your AR repo. The Cloud Run function serves on port 8080, and if you used a different Ollama image as the sidecar that is also serving on 8080, you'll run into this error.
  2. If you get error failed to build: (error ID: 7485c5b6): function.js does not exist, make sure that your package.json and tsconfig.json files are at the same level as the src directory.
  3. If you get error ERROR: (gcloud.run.services.replace) spec.template.spec.node_selector: Max instances must be set to 4 or fewer in order to set GPU requirements., in your YAML file, change autoscaling.knative.dev/maxScale: '100' to 1 or to something less than or equal to your GPU quota.