How to host Ollama as a worker pool for inference

1. Introduction

Overview

In this codelab, you'll learn how to build an event-driven, asynchronous AI processing pipeline. You'll deploy an open-source model using Ollama on a Cloud Run Worker Pool. The worker pool pulls messages from a Pub/Sub topic and processes them using a gemma3:4b model.

What you'll learn

  • How to use worker pools with a Pub/Sub Pull subscription
  • How to use Ollama to do inference as a worker pool

2. Before you begin

Enable APIs

Before you can start using this codelab, enable the following APIs by running:

gcloud services enable run.googleapis.com \
    cloudbuild.googleapis.com \
    artifactregistry.googleapis.com \
    pubsub.googleapis.com \
    storage.googleapis.com

3. Setup and Requirements

To set up the required resources, follow these steps:

  1. Set environment variables for this codelab:
export PROJECT_ID=<YOUR_PROJECT_ID>
export REGION=<YOUR_REGION>

export BUCKET_NAME=$PROJECT_ID-gemma3-4b
export SERVICE_ACCOUNT_NAME=ollama-worker-sa
export SERVICE_ACCOUNT_EMAIL=${SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com
export TOPIC_NAME=ollama-prompts
export SUBSCRIPTION_NAME=ollama-prompts-sub
export AR_REPO_NAME=ollama-worker-repo
export PULL_MSG_IMAGE_NAME=pubsub-pull-msg
export OLLAMA_IMAGE_NAME=ollama-coordinator
  1. Create a service account for the worker pool
gcloud iam service-accounts create ${SERVICE_ACCOUNT_NAME} \
  --display-name="Ollama Worker Service Account"
  1. Grant the SA access to Pub/Sub
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \
  --role="roles/pubsub.subscriber"
  1. Create an AR repository for the worker pool image
gcloud artifacts repositories create ${AR_REPO_NAME} \
  --repository-format=docker \
  --location=${REGION}
  1. Create the PubSub topic and subscription
gcloud pubsub topics create $TOPIC_NAME
gcloud pubsub subscriptions create $SUBSCRIPTION_NAME --topic $TOPIC_NAME

4. Download and Host the Model on GCS

Instead of pulling the model directly inside the container during the build process, which can be slow and inefficient, we'll pull the model to a local machine using the Ollama CLI and then upload the model files to a GCS bucket. The worker pool will then mount this bucket to access the model.

  1. Install Ollama on your local machine:

Run the following command to install Ollama on Linux. For other operating systems, please refer to the Ollama website.

curl -fsSL https://ollama.com/install.sh | sh
  1. Start the Ollama service and pull the model:

First, start the Ollama service in the background.

ollama serve &
ollama pull gemma3:4b
  1. Create a GCS bucket:

Create the GCS bucket using the BUCKET_NAME environment variable you set earlier.

gsutil mb gs://${BUCKET_NAME}
  1. Upload the model files to your GCS bucket:

Ollama stores model files in the ~/.ollama/models directory. Upload the contents of this directory to your GCS bucket. This will copy all models you have downloaded.

gsutil -m cp -r ~/.ollama/models/* gs://${BUCKET_NAME}/
  1. Grant the SA access to the Cloud Storage bucket
gcloud storage buckets add-iam-policy-binding gs://${BUCKET_NAME} \
     --member=serviceAccount:${SERVICE_ACCOUNT_EMAIL} \
     --role=roles/storage.objectViewer

5. Create the Cloud Run job

The Cloud Run job uses 2 containers:

  • ollama-coordinator - for hosting ollama and serving the gemma 3 4B model
  • pubsub-pull-msg - for pulling from pubsub subscription and passing the message to the ollama-coordinator container

First, you'll create the ollama-coordinator container.

  1. Create a parent directory for the codelab:
mkdir codelab-ollama-wp
cd codelab-ollama-wp
  1. Create a directory for the ollama-coordinator container
mkdir ollama-coordinator
cd ollama-coordinator
  1. Create a Dockerfile with the following contents
# Use the official Ollama image as a base image
FROM ollama/ollama

# Expose the port that Ollama listens on
EXPOSE 11434

# Set the entrypoint to start the Ollama server
ENTRYPOINT ["ollama", "serve"]
  1. Build the ollama container
gcloud builds submit --tag ${REGION}-docker.pkg.dev/${PROJECT_ID}/${AR_REPO_NAME}/${OLLAMA_IMAGE_NAME} --timeout=20m

Next, you'll create the pubsub-pull-msg container.

  1. Create a directory for the pubsub-pull-msg container
cd ..
mkdir pubsub-pull-msg
cd pubsub-pull-msg
  1. Create a Dockerfile
# Use the official Python image as a base image
FROM python:3.9-slim

# Set the working directory in the container
WORKDIR /app

# Copy the requirements file into the container
COPY requirements.txt .

# Install the required Python packages
RUN pip install --no-cache-dir -r requirements.txt

# Copy the Python script into the container
COPY main.py .

# Set the entrypoint to run the Python script
CMD ["python", "main.py"]
  1. Create a requirements.txt file with the following contents
google-cloud-pubsub
requests
  1. Create a main.py file with the following contents
import os
import sys
import requests
import json
from google.cloud import pubsub_v1

# --- Main Application Logic ---
print("--- Sidecar container script started ---")

# --- Environment and Configuration ---
project_id = os.environ.get("PROJECT_ID")
subscription_name = os.environ.get("SUBSCRIPTION_NAME")
ollama_api_url = "http://localhost:11434/api/generate"

if not project_id or not subscription_name:
    print("FATAL: PROJECT_ID and SUBSCRIPTION_NAME must be set.")
    sys.exit(1)

print(f"PROJECT_ID: {project_id}")
print(f"SUBSCRIPTION_NAME: {subscription_name}")

def callback(message):
    """Processes a single Pub/Sub message."""
    print(f"Received message ID: {message.message_id}")
    try:
        prompt = message.data.decode("utf-8")
        print(f"Decoded prompt: '{prompt}'")
        
        data = {"model": "gemma3:4b", "prompt": prompt, "stream": False}
        
        print("Sending request to Ollama...")
        response = requests.post(ollama_api_url, json=data, timeout=300)
        response.raise_for_status()
        
        print("Successfully received response from Ollama.")
        ollama_response = response.json()
        print(f"Ollama response: {json.dumps(ollama_response)[:200]}...")

        message.ack()
        print(f"Message {message.message_id} acknowledged.")

    except requests.exceptions.RequestException as e:
        print(f"Error calling Ollama API: {e}")
        message.nack()
        print(f"Message {message.message_id} not acknowledged.")
    except Exception as e:
        print(f"An unexpected error occurred in callback: {e}")
        message.nack()
        print(f"Message {message.message_id} not acknowledged.")

def main():
    """Starts the Pub/Sub subscriber."""
    subscriber = pubsub_v1.SubscriberClient()
    subscription_path = subscriber.subscription_path(project_id, subscription_name)
    
    streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
    print(f"Subscribed to {subscription_path}. Listening for messages...")

    try:
        # .result() will block indefinitely.
        streaming_pull_future.result()
    except Exception as e:
        print(f"A fatal error occurred in the subscriber: {e}")
        streaming_pull_future.cancel()
        streaming_pull_future.result()

if __name__ == "__main__":
    main()
  1. Now build the pubsub-pull-msg container
gcloud builds submit --tag ${REGION}-docker.pkg.dev/${PROJECT_ID}/${AR_REPO_NAME}/${PULL_MSG_IMAGE_NAME}

6. Deploy and execute the job

In this step, you'll create the Cloud Run job by deploying a yaml file.

Move to the root folder to create the yaml file.

cd ..
  1. Create a file worker-pool.template.yaml with the following content
apiVersion: run.googleapis.com/v1
kind: WorkerPool
metadata:
  name: codelab-ollama-wp
  labels:
    cloud.googleapis.com/location: europe-west1
  annotations:
    run.googleapis.com/launch-stage: BETA
    run.googleapis.com/scalingMode: manual
    run.googleapis.com/manualInstanceCount: '1'
    run.googleapis.com/gcs-fuse-mounter-enabled: "true"
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/gpu: "1"
        run.googleapis.com/gpu-zonal-redundancy-disabled: 'true'        
    spec:
      serviceAccountName: ${SERVICE_ACCOUNT_EMAIL}
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
      - name: gcs-bucket
        csi:
          driver: gcsfuse.run.googleapis.com
          readOnly: true
          volumeAttributes: 
            bucketName: ${BUCKET_NAME}
      containers:
      - image: ${REGION}-docker.pkg.dev/${PROJECT_ID}/${AR_REPO_NAME}/${PULL_MSG_IMAGE_NAME}
        name: pubsub-pull-msg
        env:
        - name: PROJECT_ID
          value: ${PROJECT_ID}
        - name: SUBSCRIPTION_NAME
          value: "ollama-prompts-sub"
        - name: PYTHONUNBUFFERED
          value: "1"
        resources:
          limits:
            cpu: '1'
            memory: 1Gi
      - image: ${REGION}-docker.pkg.dev/${PROJECT_ID}/${AR_REPO_NAME}/${OLLAMA_IMAGE_NAME}
        name: ollama-coordinator
        env:
        - name: OLLAMA_MODELS
          value: /mnt/models
        volumeMounts:
        - name: gcs-bucket
          mountPath: /mnt/models
        resources:
          limits:
            cpu: '6'
            nvidia.com/gpu: '1'
            memory: 16Gi

Then, define the full image URLs and use sed to substitute the variables in the template file, creating the final worker-pool.yaml.

sed -e "s|\${SERVICE_ACCOUNT_EMAIL}|${SERVICE_ACCOUNT_EMAIL}|g" \
     -e "s|\${BUCKET_NAME}|${BUCKET_NAME}|g" \
     -e "s|\${PULL_MSG_IMAGE_NAME}|${PULL_MSG_IMAGE_NAME}|g" \
     -e "s|\${OLLAMA_IMAGE_NAME}|${OLLAMA_IMAGE_NAME}|g" \
     -e "s|\${PROJECT_ID}|${PROJECT_ID}|g" \
     -e "s|\${REGION}|${REGION}|g" \
     -e "s|\${AR_REPO_NAME}|${AR_REPO_NAME}|g" \
     worker-pool.template.yaml > worker-pool.yaml

Now you can Deploy

gcloud beta run worker-pools replace worker-pool.yaml

And Test

gcloud pubsub topics publish ${TOPIC_NAME} --message="What is 1 + 1?"

And then view the logs. You may need to wait a minute or you can go to the Cloud Console worker pool page and watch the logs in real time.

gcloud alpha run worker-pools logs read "codelab-ollama-wp" --limit 10

and you should see something that says

Ollama response: {"model": "gemma3:4b", "created_at": "2025-11-06T23:48:39.572079369Z", "response": "1 + 1 = 2\n", ...

7. Congratulations!

Congratulations for completing the codelab!

We recommend reviewing the Cloud Run documentation.

What we've covered

  • How to use Cloud Run worker pools with a Pub/Sub Pull subscription
  • How to use Ollama to do inference as a Cloud Run worker pool

8. Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Deleting the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.
  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Deleting individual resources

To delete the individual resources, run the following commands:

  1. Delete the Cloud Run worker pool:
gcloud beta run worker-pools delete codelab-ollama-wp --region ${REGION}
  1. Delete the GCS bucket:
gsutil -m rm -r gs://${BUCKET_NAME}
  1. Delete the Pub/Sub subscription and topic:
gcloud pubsub subscriptions delete ${SUBSCRIPTION_NAME}
gcloud pubsub topics delete ${TOPIC_NAME}
  1. Delete the Artifact Registry repository:
gcloud artifacts repositories delete ${AR_REPO_NAME} --location=${REGION} --quiet
  1. Delete the service account:
gcloud iam service-accounts delete ${SERVICE_ACCOUNT_EMAIL} --quiet

Cleaning up local files

To clean up local files, do the following:

  1. Stop the local Ollama service:If you started Ollama with ollama serve &, you can stop it by finding its process ID (PID) and then using the kill command.
    # Find the process ID of the Ollama server
    pgrep ollama
    
    # Replace <PID> with the actual process ID obtained from the previous command
    kill <PID>
    
  2. Delete the downloaded models:
rm -rf ~/.ollama/models
  1. Uninstall Ollama:

Follow the instructions on the Ollama website to uninstall Ollama from your local machine.