How to use Ollama as a sidecar with Cloud Run GPUs and Open WebUI as a frontend ingress container

How to use Ollama as a sidecar with Cloud Run GPUs and Open WebUI as a frontend ingress container

About this codelab

subjectLast updated Sep 4, 2024
account_circleWritten by a Googler

1. Introduction

Overview

Cloud Run recently added GPU support. It's available as a waitlisted public preview. If you're interested in trying out the feature, fill out this form to join the waitlist. Cloud Run is a container platform on Google Cloud that makes it straightforward to run your code in a container, without requiring you to manage a cluster.

Today, the GPUs we make available are Nvidia L4 GPUs with 24 GB of vRAM. There's one GPU per Cloud Run instance, and Cloud Run auto scaling still applies. That includes scaling out up to 5 instances (with quota increase available), as well as scaling down to zero instances when there are no requests.

One use case for GPUs is running your own open large language models (LLMs). This tutorial walks you through deploying a service that runs a LLM.

In this codelab, you'll deploy a multi-container service that uses Open WebUI as a frontend ingress container and uses Ollama in a sidecar to serve a Gemma 2 2B model stored in a Google Cloud Storage bucket.

What you'll learn

  • How to create a multiple container service in Cloud Run
  • How to deploy Ollama as a sidecar serving a Gemma 2 2B model
  • How to deploy Open WebUI as a frontend ingress container

2. Set Environment Variables and Enable APIs

Upgrade gcloud CLI

First, you will need to have a recent version of the gcloud CLI installed. You can update the CLI by running the following command:

gcloud components update

Setup environment variables

You can set environment variables that will be used throughout this codelab.

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=us-central1
gcloud config set project $PROJECT_ID

Enable APIs

Before you can start using this codelab, there are several APIs you will need to enable. This codelab requires using the following APIs. You can enable those APIs by running the following command:

gcloud services enable run.googleapis.com \
    cloudbuild.googleapis.com \
    storage.googleapis.com \
    artifactregistry.googleapis.com

Create a directory for this codelab.

mkdir ollama-sidecar-codelab
cd ollama-sidecar-codelab

3. Create a GCS Bucket to store the Gemma 2 2B model

First, you'll install Ollama to download the model. This will download the model to /home/$USER/.ollama/models

curl -fsSL https://ollama.com/install.sh | sh

Now run ollama by running

ollama serve

Ollama starts listening on port 11434.

Open a second terminal window to pull down the Gemma 2 2B model to

ollama pull gemma2:2b

(Optional) You can interact with Gemma from the command line by running

ollama run gemma2:2b

When you are done chatting with Gemma, you can exit by typing in

/bye

4. Create a Storage Bucket

Now that the model has downloaded, you can move the model to your GCS bucket.

First, create the bucket.

gcloud storage buckets create gs://$PROJECT_ID-gemma2-2b-codelab

Now, move the models folder to GCS.

gsutil cp -r /home/$USER/.ollama/models gs://$PROJECT_ID-gemma2-2b-codelab

5. Create the Ollama image

Create a dockerfile with the following contents

FROM ollama/ollama

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1 

# Store the model weights in the container image
ENV MODEL gemma2:2b
RUN ollama serve & sleep 5 && ollama pull $MODEL 

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

Create an Artifact Registry repo to store your service images.

gcloud artifacts repositories create ollama-sidecar-codelab-repo --repository-format=docker \
   
--location=us-central1 --description="Ollama + OpenWebUI website demo" \
   
--project=$PROJECT_ID

Build the ollama sidecar image

gcloud builds submit \
   --tag us-central1-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/ollama-gemma-2b \
   --machine-type e2-highcpu-32

6. Create the Open WebUI frontend image

In this section, you'll create the frontend ingress container using Open WebUI.

Use docker to pull down the Open WebUI image.

docker pull ghcr.io/open-webui/open-webui:main

Then configure Docker to use your Google Cloud credentials to authenticate with Artifact Registry. This will allow you to use docker to push an image to an Artifact Registry repo.

gcloud auth configure-docker us-central1-docker.pkg.dev

Tag your image and then push to Artifact Registry.

docker tag ghcr.io/open-webui/open-webui:main us-central1-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/openwebui

docker push us-central1-docker.pkg.dev/$PROJECT_ID/ollama-sidecar-codelab-repo/openwebui

7. Deploy the multi-container service to Cloud Run

Use a yaml file to deploy the multi-container service

Create a service.yaml with the following contents.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: ollama-sidecar-codelab
  labels:
    cloud.googleapis.com/location: us-central1
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '5'
        run.googleapis.com/cpu-throttling: 'false'
        run.googleapis.com/startup-cpu-boost: 'true'
        run.googleapis.com/container-dependencies: '{"openwebui":["ollama-sidecar"]}'
    spec:
      containerConcurrency: 80
      timeoutSeconds: 300
      containers:
      - name: openwebui
        image: us-central1-docker.pkg.dev/YOUR_PROJECT_ID/ollama-sidecar-codelab/openwebui
        ports:
        - name: http1
          containerPort: 8080
        env:
        - name: OLLAMA_BASE_URL
          value: http://localhost:11434
        - name: WEBUI_AUTH
          value: 'false'
        resources:
          limits:
            memory: 1Gi
            cpu: 2000m
        volumeMounts:
        - name: in-memory-1
          mountPath: /app/backend/data
        startupProbe:
          timeoutSeconds: 240
          periodSeconds: 240
          failureThreshold: 1
          tcpSocket:
            port: 8080
      - name: ollama-sidecar
        image: us-central1-docker.pkg.dev/YOUR_PROJECT_ID/ollama-sidecar-codelab/ollama-gemma-2b
        resources:
          limits:
            cpu: '6'
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          timeoutSeconds: 1
          periodSeconds: 10
          failureThreshold: 3
          tcpSocket:
            port: 11434
      volumes:
      - name: gcs-1
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: YOUR_PROJECT_ID-gemma2-2b-codelab
      - name: in-memory-1
        emptyDir:
          medium: Memory
          sizeLimit: 10Gi
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4

Update service.yaml to replace PROJECT_ID with your project ID:

sed -i "s/YOUR_PROJECT_ID/${PROJECT_ID}/g" service.yaml

Deploy to Cloud Run using the following command.

gcloud beta run services replace service.yaml

Test the Cloud Run service

Now open the Service URL in your web browser.

Once the UI has completed loading, under Select a model, choose Gemma 2 2B.

Now ask Gemma a question, e.g. "Why is the sky blue?"

8. Congratulations!

Congratulations for completing the codelab!

We recommend reviewing the documentation on Cloud Run functions

What we've covered

  • How to create a multiple container service in Cloud Run
  • How to deploy Ollama as a sidecar serving a Gemma 2 2B model
  • How to deploy Open WebUI as a frontend ingress container

9. Clean up

To avoid inadvertent charges, (for example, if the Cloud Run services are inadvertently invoked more times than your monthly Cloud Run invokement allocation in the free tier), you can either delete the Cloud Run or delete the project you created in Step 2.

To delete the Cloud Run function, go to the Cloud Run Cloud Console at https://console.cloud.google.com/run and delete the ollama-sidecar-codelab service.

If you choose to delete the entire project, you can go to https://console.cloud.google.com/cloud-resource-manager, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list.