About this codelab
1. Introduction
Overview
Cloud Run recently added GPU support. It's available as a waitlisted public preview. If you're interested in trying out the feature, fill out this form to join the waitlist. Cloud Run is a container platform on Google Cloud that makes it straightforward to run your code in a container, without requiring you to manage a cluster.
Today, the GPUs we make available are Nvidia L4 GPUs with 24 GB of vRAM. There's one GPU per Cloud Run instance, and Cloud Run auto scaling still applies. That includes scaling out up to 5 instances (with quota increase available), as well as scaling down to zero instances when there are no requests.
One use case for GPUs is running your own open large language models (LLMs). This tutorial walks you through deploying a service that runs a LLM.
In this codelab, you'll deploy a multi-container service that uses Open WebUI as a frontend ingress container and uses Ollama in a sidecar to serve a Gemma 2 2B model stored in a Google Cloud Storage bucket.
What you'll learn
- How to create a multiple container service in Cloud Run
- How to deploy Ollama as a sidecar serving a Gemma 2 2B model
- How to deploy Open WebUI as a frontend ingress container
2. Set Environment Variables and Enable APIs
Upgrade gcloud CLI
First, you will need to have a recent version of the gcloud CLI installed. You can update the CLI by running the following command:
gcloud components update
Setup environment variables
You can set environment variables that will be used throughout this codelab.
PROJECT_ID=<YOUR_PROJECT_ID> REGION=us-central1 gcloud config set project $PROJECT_ID
Enable APIs
Before you can start using this codelab, there are several APIs you will need to enable. This codelab requires using the following APIs. You can enable those APIs by running the following command:
gcloud services enable run.googleapis.com \ cloudbuild.googleapis.com \ storage.googleapis.com \ artifactregistry.googleapis.com
Create a directory for this codelab.
mkdir ollama-sidecar-codelab cd ollama-sidecar-codelab
3. Create a GCS Bucket to store the Gemma 2 2B model
First, you'll install Ollama to download the model. This will download the model to /home/$USER/.ollama/models
curl -fsSL https://ollama.com/install.sh | sh
Now run ollama by running
ollama serve
Ollama starts listening on port 11434.
Open a second terminal window to pull down the Gemma 2 2B model to
ollama pull gemma2:2b
(Optional) You can interact with Gemma from the command line by running
ollama run gemma2:2b
When you are done chatting with Gemma, you can exit by typing in
/bye
4. Create a Storage Bucket
Now that the model has downloaded, you can move the model to your GCS bucket.
First, create the bucket.
gcloud storage buckets create gs://$PROJECT_ID-gemma2-2b-codelab
Now, move the models folder to GCS.
gsutil cp -r /home/$USER/.ollama/models gs://$PROJECT_ID-gemma2-2b-codelab
5. Create the Ollama image
Create a dockerfile with the following contents
FROM ollama/ollama # Listen on all interfaces, port 11434 ENV OLLAMA_HOST 0.0.0.0:11434 # Store model weight files in /models ENV OLLAMA_MODELS /models # Reduce logging verbosity ENV OLLAMA_DEBUG false # Never unload model weights from the GPU ENV OLLAMA_KEEP_ALIVE -1 # Store the model weights in the container image ENV MODEL gemma2:2b RUN ollama serve & sleep 5 && ollama pull $MODEL # Start Ollama ENTRYPOINT ["ollama", "serve"]
Create an Artifact Registry repo to store your service images.
gcloud artifacts repositories create ollama-sidecar-codelab-repo --repository-format=docker \
--location=us-central1 --description="Ollama + OpenWebUI website demo" \
--project=$PROJECT_ID
Build the ollama sidecar image
gcloud builds submit \
--tag us-central1-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/ollama-gemma-2b \
--machine-type e2-highcpu-32
6. Create the Open WebUI frontend image
In this section, you'll create the frontend ingress container using Open WebUI.
Use docker to pull down the Open WebUI image.
docker pull ghcr.io/open-webui/open-webui:main
Then configure Docker to use your Google Cloud credentials to authenticate with Artifact Registry. This will allow you to use docker to push an image to an Artifact Registry repo.
gcloud auth configure-docker us-central1-docker.pkg.dev
Tag your image and then push to Artifact Registry.
docker tag ghcr.io/open-webui/open-webui:main us-central1-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/openwebui docker push us-central1-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/openwebui
7. Deploy the multi-container service to Cloud Run
Use a yaml file to deploy the multi-container service
Create a service.yaml
with the following contents.
apiVersion: serving.knative.dev/v1 kind: Service metadata: name: ollama-sidecar-codelab labels: cloud.googleapis.com/location: us-central1 spec: template: metadata: annotations: autoscaling.knative.dev/maxScale: '5' run.googleapis.com/cpu-throttling: 'false' run.googleapis.com/startup-cpu-boost: 'true' run.googleapis.com/container-dependencies: '{"openwebui":["ollama-sidecar"]}' spec: containerConcurrency: 80 timeoutSeconds: 300 containers: - name: openwebui image: us-central1-docker.pkg.dev/YOUR_PROJECT_ID/ollama-sidecar-codelab/openwebui ports: - name: http1 containerPort: 8080 env: - name: OLLAMA_BASE_URL value: http://localhost:11434 - name: WEBUI_AUTH value: 'false' resources: limits: memory: 1Gi cpu: 2000m volumeMounts: - name: in-memory-1 mountPath: /app/backend/data startupProbe: timeoutSeconds: 240 periodSeconds: 240 failureThreshold: 1 tcpSocket: port: 8080 - name: ollama-sidecar image: us-central1-docker.pkg.dev/YOUR_PROJECT_ID/ollama-sidecar-codelab/ollama-gemma-2b resources: limits: cpu: '6' nvidia.com/gpu: '1' memory: 16Gi volumeMounts: - name: gcs-1 mountPath: /root/.ollama startupProbe: timeoutSeconds: 1 periodSeconds: 10 failureThreshold: 3 tcpSocket: port: 11434 volumes: - name: gcs-1 csi: driver: gcsfuse.run.googleapis.com volumeAttributes: bucketName: YOUR_PROJECT_ID-gemma2-2b-codelab - name: in-memory-1 emptyDir: medium: Memory sizeLimit: 10Gi nodeSelector: run.googleapis.com/accelerator: nvidia-l4
Update service.yaml to replace PROJECT_ID with your project ID:
sed -i "s/YOUR_PROJECT_ID/${PROJECT_ID}/g" service.yaml
Deploy to Cloud Run using the following command.
gcloud beta run services replace service.yaml
Test the Cloud Run service
Now open the Service URL in your web browser.
Once the UI has completed loading, under Select a model, choose Gemma 2 2B.
Now ask Gemma a question, e.g. "Why is the sky blue?"
8. Congratulations!
Congratulations for completing the codelab!
We recommend reviewing the documentation on Cloud Run functions
What we've covered
- How to create a multiple container service in Cloud Run
- How to deploy Ollama as a sidecar serving a Gemma 2 2B model
- How to deploy Open WebUI as a frontend ingress container
9. Clean up
To avoid inadvertent charges, (for example, if the Cloud Run services are inadvertently invoked more times than your monthly Cloud Run invokement allocation in the free tier), you can either delete the Cloud Run or delete the project you created in Step 2.
To delete the Cloud Run function, go to the Cloud Run Cloud Console at https://console.cloud.google.com/run and delete the ollama-sidecar-codelab
service.
If you choose to delete the entire project, you can go to https://console.cloud.google.com/cloud-resource-manager, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list
.