1. Introduction
Overview
In this codelab, you'll learn how to host a gemma3:4b model in a sidecar for a Cloud Run function. When a file is uploaded to a Cloud Storage bucket, it will trigger the Cloud Run function. The function will send the contents of the file to Gemma 3 in the sidecar for summarization.
What you'll learn
- How to do inference using a Cloud Run function and a LLM hosted in a sidecar using GPUs
- How to use Direct VPC egress configuration for a Cloud Run GPU for faster upload and serving of the model
- How to use genkit to interface with your hosted ollama model
2. Before you begin
To use the GPUs feature, you must request a quota increase for a supported region. The quota needed is nvidia_l4_gpu_allocation_no_zonal_redundancy, which is under Cloud Run Admin API. Here is the direct link to request quota.
3. Setup and Requirements
Set environment variables that will be used throughout this codelab.
PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>
AR_REPO=codelab-crf-sidecar-gpu
FUNCTION_NAME=crf-sidecar-gpu
BUCKET_GEMMA_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-gemma3
BUCKET_DOCS_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-docs
SERVICE_ACCOUNT="crf-sidecar-gpu"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
IMAGE_SIDECAR=$REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3
Create the service account by running this command:
gcloud iam service-accounts create $SERVICE_ACCOUNT \
--display-name="SA for codelab crf sidecar with gpu"
We'll use this same service account being used as the Cloud Run function's identity as the service account for the eventarc trigger to invoke the Cloud Run function. You can create a different SA for Eventarc if you prefer.
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/run.invoker
Also grant the service account access to receive Eventarc events.
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$SERVICE_ACCOUNT_ADDRESS" \
--role="roles/eventarc.eventReceiver"
Create a bucket that will host your finetuned model. This codelab uses a regional bucket. You can use a multi-regional bucket as well.
gsutil mb -l $REGION gs://$BUCKET_GEMMA_NAME
And then give the SA access to the bucket.
gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin
Now create a regional bucket that will store the docs you want summarized. You can use a multi-regional bucket as well, provided you update the Eventarc trigger accordingly (shown at the end of this codelab).
gsutil mb -l $REGION gs://$BUCKET_DOCS_NAME
And then give the SA access to the Gemma 3 bucket.
gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin
and the Docs bucket.
gcloud storage buckets add-iam-policy-binding gs://$BUCKET_DOCS_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin
Create an artifact registry repository for the Ollama image that will be used in the sidecar
gcloud artifacts repositories create $AR_REPO \
--repository-format=docker \
--location=$REGION \
--description="codelab for CR function and gpu sidecar" \
--project=$PROJECT_ID
4. Download the Gemma 3 model
First, you'll want to download the Gemma 3 4b model from ollama. You can do this by installing ollama and then running the gemma3:4b model locally.
curl -fsSL https://ollama.com/install.sh | sh
ollama serve
Now in a separate terminal window, run the following command to pull down the model. If you are using Cloud Shell, you can open an additional terminal window by clicking the plus icon in the upper right menu bar.
ollama run gemma3:4b
Once ollama is running, feel free to ask the model some questions, e.g.
"why is the sky blue?"
Once you're done chatting with ollama, you can exit the chat by running
/bye
Then, in the first terminal window, run the following command to stop serving ollama locally
# on Linux / Cloud Shell press Ctrl^C or equivalent for your shell
You can find where Ollama downloads the models depending on your operating system here.
https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored
If you are using Cloud Workstations, you can find the ollama models downloaded here /home/$USER/.ollama/models
Confirm that your models are hosted here:
ls /home/$USER/.ollama/models
now move the gemma3:4b model to your GCS bucket
gsutil cp -r /home/$USER/.ollama/models gs://$BUCKET_GEMMA_NAME
5. Create the Cloud Run function
Create a root folder for your source code.
mkdir codelab-crf-sidecar-gpu &&
cd codelab-crf-sidecar-gpu &&
mkdir cr-function &&
mkdir ollama-gemma3 &&
cd cr-function
Create a subfolder called src. Inside the folder, create a file called index.ts
mkdir src &&
touch src/index.ts
Update index.ts with the following code:
//import util from 'util';
import { cloudEvent, CloudEvent } from "@google-cloud/functions-framework";
import { StorageObjectData } from "@google/events/cloud/storage/v1/StorageObjectData";
import { Storage } from "@google-cloud/storage";
// Initialize the Cloud Storage client
const storage = new Storage();
import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';
const ai = genkit({
plugins: [
ollama({
models: [
{
name: 'gemma3:4b',
type: 'generate', // type: 'chat' | 'generate' | undefined
},
],
serverAddress: 'http://127.0.0.1:11434', // default local address
}),
],
});
// Register a CloudEvent callback with the Functions Framework that will
// be triggered by Cloud Storage.
//functions.cloudEvent('helloGCS', await cloudEvent => {
cloudEvent("gcs-cloudevent", async (cloudevent: CloudEvent<StorageObjectData>) => {
console.log("---------------\nProcessing for ", cloudevent.subject, "\n---------------");
if (cloudevent.data) {
const data = cloudevent.data;
if (data && data.bucket && data.name) {
const bucketName = cloudevent.data.bucket;
const fileName = cloudevent.data.name;
const filePath = `${cloudevent.data.bucket}/${cloudevent.data.name}`;
console.log(`Attempting to download: ${filePath}`);
try {
// Get a reference to the bucket
const bucket = storage.bucket(bucketName!);
// Get a reference to the file
const file = bucket.file(fileName!);
// Download the file's contents
const [content] = await file.download();
// 'content' is a Buffer. Convert it to a string.
const fileContent = content.toString('utf8');
console.log(`Sending file to Gemma 3 for summarization`);
const { text } = await ai.generate({
model: 'ollama/gemma3:4b',
prompt: `Summarize the following document in just a few sentences ${fileContent}`,
});
console.log(text);
} catch (error: any) {
console.error('An error occurred:', error.message);
}
} else {
console.warn("CloudEvent bucket name is missing!", cloudevent);
}
} else {
console.warn("CloudEvent data is missing!", cloudevent);
}
});
Now in the root directory crf-sidecar-gpu
, create a file called package.json
with the following contents:
{
"main": "lib/index.js",
"name": "ingress-crf-genkit",
"version": "1.0.0",
"scripts": {
"build": "tsc"
},
"keywords": [],
"author": "",
"license": "ISC",
"description": "",
"dependencies": {
"@google-cloud/functions-framework": "^3.4.0",
"@google-cloud/storage": "^7.0.0",
"genkit": "^1.1.0",
"genkitx-ollama": "^1.1.0",
"@google/events": "^5.4.0"
},
"devDependencies": {
"typescript": "^5.5.2"
}
}
Create a tsconfig.json
also at the root directory level with the following contents:
{
"compileOnSave": true,
"include": [
"src"
],
"compilerOptions": {
"module": "commonjs",
"noImplicitReturns": true,
"outDir": "lib",
"sourceMap": true,
"strict": true,
"target": "es2017",
"skipLibCheck": true,
"esModuleInterop": true
}
}
6. Deploy the function
In this step, you'll deploy the Cloud Run function by running the following command.
Note: max instances should be set to a number less than or equal to your GPU quota.
gcloud beta run deploy $FUNCTION_NAME \
--region $REGION \
--function gcs-cloudevent \
--base-image nodejs22 \
--source . \
--no-allow-unauthenticated \
--max-instances 2 # this should be less than or equal to your GPU quota
7. Create the sidecar
You can learn more about hosting Ollama within a Cloud Run service at https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama
Move into the directory for your sidecar:
cd ../ollama-gemma3
Create a Dockerfile
file with the following contents:
FROM ollama/ollama:latest
# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434
# Store model weight files in /models
ENV OLLAMA_MODELS /models
# Reduce logging verbosity
ENV OLLAMA_DEBUG false
# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1
# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL
# Start Ollama
ENTRYPOINT ["ollama", "serve"]
Build the image
gcloud builds submit \
--tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3 \
--machine-type e2-highcpu-32
8. Update the function with the sidecar
To add a sidecar to an existing service, job, or function, you can update the YAML file to contain the sidecar.
Retrieve the YAML for the Cloud Run function that you just deployed by running:
gcloud run services describe $FUNCTION_NAME --format=export > add-sidecar-service.yaml
Now add the sidecar to the CRf by updating the YAML as follows:
- insert the following YAML fragment directly above the
runtimeClassName: run.googleapis.com/linux-base-image-update
line. The-image
should align with the ingress container item-image
- image: YOUR_IMAGE_SIDECAR:latest
name: gemma-sidecar
env:
- name: OLLAMA_FLASH_ATTENTION
value: '1'
resources:
limits:
cpu: 6000m
nvidia.com/gpu: '1'
memory: 16Gi
volumeMounts:
- name: gcs-1
mountPath: /root/.ollama
startupProbe:
failureThreshold: 2
httpGet:
path: /
port: 11434
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 60
nodeSelector:
run.googleapis.com/accelerator: nvidia-l4
volumes:
- csi:
driver: gcsfuse.run.googleapis.com
volumeAttributes:
bucketName: YOUR_BUCKET_GEMMA_NAME
name: gcs-1
- Run the following command to update the YAML fragment with your environment variables:
sed -i "s|YOUR_IMAGE_SIDECAR|$IMAGE_SIDECAR|; s|YOUR_BUCKET_GEMMA_NAME|$BUCKET_GEMMA_NAME|" add-sidecar-service.yaml
Your completed YAML file should look something like this:
##############################################
# DO NOT COPY - For illustration purposes only
##############################################
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
annotations:
run.googleapis.com/build-base-image: us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22
run.googleapis.com/build-enable-automatic-updates: 'true'
run.googleapis.com/build-function-target: gcs-cloudevent
run.googleapis.com/build-id: f0122905-a556-4000-ace4-5c004a9f9ec6
run.googleapis.com/build-image-uri:<YOUR_IMAGE_CRF>
run.googleapis.com/build-name: <YOUR_BUILD_NAME>
run.googleapis.com/build-source-location: <YOUR_SOURCE_LOCATION>
run.googleapis.com/ingress: all
run.googleapis.com/ingress-status: all
run.googleapis.com/urls: '["<YOUR_CLOUD_RUN_FUNCTION_URLS"]'
labels:
cloud.googleapis.com/location: <YOUR_REGION>
name: <YOUR_FUNCTION_NAME>
namespace: '392295011265'
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/maxScale: '4'
run.googleapis.com/base-images: '{"":"us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22"}'
run.googleapis.com/client-name: gcloud
run.googleapis.com/client-version: 514.0.0
run.googleapis.com/startup-cpu-boost: 'true'
labels:
client.knative.dev/nonce: hzhhrhheyd
run.googleapis.com/startupProbeType: Default
spec:
containerConcurrency: 80
containers:
- image: <YOUR_FUNCTION_IMAGE>
ports:
- containerPort: 8080
name: http1
resources:
limits:
cpu: 1000m
memory: 512Mi
startupProbe:
failureThreshold: 1
periodSeconds: 240
tcpSocket:
port: 8080
timeoutSeconds: 240
- image: <YOUR_SIDECAR_IMAGE>:latest
name: gemma-sidecar
env:
- name: OLLAMA_FLASH_ATTENTION
value: '1'
resources:
limits:
cpu: 6000m
nvidia.com/gpu: '1'
memory: 16Gi
volumeMounts:
- name: gcs-1
mountPath: /root/.ollama
startupProbe:
failureThreshold: 2
httpGet:
path: /
port: 11434
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 60
nodeSelector:
run.googleapis.com/accelerator: nvidia-l4
volumes:
- csi:
driver: gcsfuse.run.googleapis.com
volumeAttributes:
bucketName: <YOUR_BUCKET_NAME>
name: gcs-1
runtimeClassName: run.googleapis.com/linux-base-image-update
serviceAccountName: <YOUR_SA_ADDRESS>
timeoutSeconds: 300
traffic:
- latestRevision: true
percent: 100
##############################################
# DO NOT COPY - For illustration purposes only
##############################################
Now update the function with the sidecar by running the following command.
gcloud run services replace add-sidecar-service.yaml
Lastly, create the eventarc trigger for the function. This command also adds it to the function.
Note: if you created a multi-regional bucket, you'll want to change the --location
parameter
gcloud eventarc triggers create my-crf-summary-trigger \
--location=$REGION \
--destination-run-service=$FUNCTION_NAME \
--destination-run-region=$REGION \
--event-filters="type=google.cloud.storage.object.v1.finalized" \
--event-filters="bucket=$BUCKET_DOCS_NAME" \
--service-account=$SERVICE_ACCOUNT_ADDRESS
9. Test your function
Upload a plain text file for summarization. Don't know what to summarize? Ask Gemini for a quick 1-2 page description of the history of dogs! Then upload that plain text file to your $BUCKET_DOCS_NAME
bucket for the Gemma3:4b model to write a summary to the function logs.
In the logs, you'll see something similar to the following:
---------------
Processing for objects/dogs.txt
---------------
Attempting to download: <YOUR_PROJECT_ID>-codelab-crf-sidecar-gpu-docs/dogs.txt
Sending file to Gemma 3 for summarization
...
Here's a concise summary of the document "Humanity's Best Friend":
The dog's domestication, beginning roughly 20,000-40,000 years ago, represents a unique, deeply intertwined evolutionary partnership with humans, predating the domestication of any other animal
<...>
solidifying their long-standing role as humanity's best friend.
10. Troubleshooting
Here are some errors to typos that you might encounter:
- If you get an error that
PORT 8080 is in use
, make sure your Dockerfile for your Ollama sidecar is using port 11434. Also make sure you are using the correct sidecar image in case you have multiple Ollama images in your AR repo. The Cloud Run function serves on port 8080, and if you used a different Ollama image as the sidecar that is also serving on 8080, you'll run into this error. - If you get error
failed to build: (error ID: 7485c5b6): function.js does not exist
, make sure that your package.json and tsconfig.json files are at the same level as the src directory. - If you get error
ERROR: (gcloud.run.services.replace) spec.template.spec.node_selector: Max instances must be set to 4 or fewer in order to set GPU requirements.
, in your YAML file, changeautoscaling.knative.dev/maxScale: '100'
to 1 or to something less than or equal to your GPU quota.