Serverless AI: EmbeddingGemma with Cloud Run

1. Introduction

In this codelab, you will learn how to deploy EmbeddingGemma, a powerful multilingual text embedding model, on Cloud Run using GPUs. You will then use this deployed service to generate embeddings for a semantic search application.

Unlike traditional Large Language Models (LLMs) that generate text, embedding models convert text into numerical vectors. These vectors are crucial for building Retrieval-Augmented Generation (RAG) systems, which allow you to find the most relevant documents for a user's query.

What you'll do

  • Containerize the EmbeddingGemma model using Ollama.
  • Deploy the container to Cloud Run with GPU acceleration.
  • Test the deployed model by generating embeddings for sample text.
  • Build a lightweight semantic search system using your deployed service.

What you'll need

  • A Google Cloud project with billing enabled.
  • Basic familiarity with Docker and the command line.

2. Before you begin

Project Setup

  1. If you don't already have a Google Account, you must create a Google Account.
    • Use a personal account instead of a work or school account. Work and school accounts may have restrictions that prevent you from enabling the APIs needed for this lab.
  2. Sign-in to the Google Cloud Console.
  3. Enable billing in the Cloud Console.
    • Completing this lab should cost less than $1 USD in Cloud resources.
    • You can follow the steps at the end of this lab to delete resources to avoid further charges.
    • New users are eligible for the $300 USD Free Trial.
  4. Create a new project or choose to reuse an existing project.
    • If you see an error about project quota, reuse an existing project or delete an existing project to create a new project.

Start Cloud Shell

Cloud Shell is a command-line environment running in Google Cloud that comes preloaded with necessary tools.

  1. Click Activate Cloud Shell at the top of the Google Cloud console.
  2. Once connected to Cloud Shell, verify your authentication:
    gcloud auth list
    
  3. Confirm your project is selected:
    gcloud config get project
    
  4. Set it if needed:
    gcloud config set project <YOUR_PROJECT_ID>
    

Enable APIs

Run this command to enable all the required APIs:

gcloud services enable \
  run.googleapis.com \
  artifactregistry.googleapis.com \
  cloudbuild.googleapis.com

3. Containerize the Model

To run EmbeddingGemma serverlessly, we need to package it into a container. We will use Ollama, a lightweight framework for running LLMs, and Docker.

Create the Dockerfile

In Cloud Shell, create a new directory for your project and navigate into it:

mkdir embedding-gemma-codelab
cd embedding-gemma-codelab

Create a file named Dockerfile with the following content:

FROM ollama/ollama:latest

# Listen on all interfaces, port 8080
ENV OLLAMA_HOST=0.0.0.0:8080

# Store model weight files in /models
ENV OLLAMA_MODELS=/models

# Reduce logging verbosity
ENV OLLAMA_DEBUG=false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE=-1

# Store the model weights in the container image
ENV MODEL=embeddinggemma:latest
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

This Dockerfile does the following:

  1. Starts from the official Ollama base image.
  2. Configures Ollama to listen on port 8080 (Cloud Run's default).
  3. The RUN command starts ollama server and downloads the embeddinggemma model during the build process so it's baked into the image.
  4. Sets OLLAMA_KEEP_ALIVE=-1 to ensure the model stays loaded in GPU memory for faster subsequent requests.

4. Build and Deploy

We will use Cloud Run source deployment to build and deploy our container in a single step. This command builds the image using Cloud Build, stores it in Artifact Registry, and deploys it to Cloud Run.

Run the following command to deploy:

gcloud run deploy embedding-gemma \
  --source . \
  --region europe-west1 \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 1 \
  --memory 32Gi \
  --no-allow-unauthenticated \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --timeout=600 \
  --labels dev-tutorial=codelab-embedding-gemma

Understanding the Configuration

  • --source . specifies the current directory as the source for the build.
  • --region europe-west1 we use a region that supports GPUs on Cloud Run.
  • --concurrency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.
  • --gpu 1 with --gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.
  • --max-instances 1 specifies the maximum number of instances to scale to. It has to be equal to or lower than your project's NVIDIA L4 GPU quota.
  • --no-allow-unauthenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run's built-in Identity and Access Management (IAM) authentication for service-to-service communication.
  • --no-cpu-throttling is required for enabling GPU.
  • --no-gpu-zonal-redundancy set zonal redundancy options depending on your zonal failover requirements and available quota.

Region Considerations

GPUs on Cloud Run are available in specific regions. You can check the supported regions in the documentation.

Deployment Output

After a few minutes, the deployment will complete and you will see a message like:

Service [embedding-gemma] revision [embedding-gemma-12345-abc] has been deployed and is serving 100 percent of traffic.
Service URL: https://embedding-gemma-123456789012.europe-west1.run.app

5. Test the Deployment

Since we deployed the service with --no-allow-unauthenticated, we cannot simply curl the public URL. We first need to grant ourselves permission to access the service and use the auth token in the request.

  1. Give your user account permission to call the service:
    gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
        --member=user:$(gcloud config get-value account) \
        --role='roles/run.invoker'
    
  2. Save your Google Cloud credentials and project number in environment variables for use in the request:
    export PROJECT_NUMBER=$(gcloud projects describe $GOOGLE_CLOUD_PROJECT --format="value(projectNumber)")
    export ID_TOKEN=$(gcloud auth print-identity-token)
    
  3. Run the following command to generate an embedding for "Sample text":
    curl -X POST "https://embedding-gemma-$PROJECT_NUMBER.europe-west1.run.app/api/embed" \
        -H "Authorization: Bearer $ID_TOKEN" \
        -H "Content-Type: application/json" \
        -d '{
            "model": "embeddinggemma",
            "input": "Sample text"
        }'
    

You should see a JSON response containing a vector (a long list of numbers) under the embedding field. This confirms your serverless GPU-backed embedding model is working!

The response will look similar to this: EmbeddingGemma Curl Output

Python Client

You can also use Python to interact with the service. Create a file named test_client.py:

import urllib.request
import urllib.parse
import json
import os

# 1. Setup the URL and Payload
url = f"https://embedding-gemma-{os.environ['PROJECT_NUMBER']}.europe-west1.run.app/api/embed"
payload = {
    "model": "embeddinggemma",
    "input": "Sample text"
}

# 2. Create the Request object
# Note: Providing 'data' automatically makes this a POST request
req = urllib.request.Request(
    url,
    data=json.dumps(payload).encode("utf-8"),
    headers={
        "Authorization": f"Bearer {os.environ['ID_TOKEN']}",
        "Content-Type": "application/json"
    }
)

# 3. Execute and print the response
response = urllib.request.urlopen(req)
result = json.loads(response.read().decode("utf-8"))
print(result)

Run it:

python test_client.py

6. Build a Semantic Search Application

Now that we have a working embedding service, let's build a simple Semantic Search application. We will use the generated embeddings to find the most relevant document for a given query.

Dependencies

We will use chromadb as our vector database and the ollama client library.

uv init semantic-search --description "Semantic Search Application"
cd semantic-search
uv add chromadb ollama

Create the Search Application

Create a file named semantic_search.py with the following code:

import ollama
import chromadb
import os

# 1. Define our knowledge base
documents = [
    "Poland is a country located in Central Europe.",
    "The capital and largest city of Poland is Warsaw.",
    "Poland's official language is Polish, which is a West Slavic language.",
    "Marie Curie, the pioneering scientist who conducted groundbreaking research on radioactivity, was born in Warsaw, Poland.",
    "Poland is famous for its traditional dish called pierogi, which are filled dumplings.",
    "The Białowieża Forest in Poland is one of the last and largest remaining parts of the immense primeval forest that once stretched across the European Plain.",
]

print("Initializing Vector Database...")
client = chromadb.Client()
collection = client.create_collection(name="docs")

# Configure the client to point to our Cloud Run proxy
ollama_client = ollama.Client(
    host=f"https://embedding-gemma-{os.environ['PROJECT_NUMBER']}.europe-west1.run.app",
    headers={'Authorization': 'Bearer ' + os.environ['ID_TOKEN']}
)

print("Generating embeddings and indexing documents...")
# 2. Store each document in the vector database
for i, d in enumerate(documents):
    # This calls our Cloud Run service to get the embedding
    response = ollama_client.embed(model="embeddinggemma", input=d)
    embeddings = response["embeddings"]
    collection.add(ids=[str(i)], embeddings=embeddings, documents=[d])

print("Indexing complete.\n")

# 3. Perform a Semantic Search
question = "What is Poland's official language?"
print(f"Query: {question}")

# Generate an embedding for the question
response = ollama_client.embed(model="embeddinggemma", input=question)

# Query the database for the most similar document
results = collection.query(
    query_embeddings=[response["embeddings"][0]],
    n_results=1
)

best_match = results["documents"][0][0]
print(f"Best Match Document: {best_match}")

Run the Application

Execute the script:

uv run semantic_search.py

You should see output similar to this:

Initializing Vector Database...
Generating embeddings and indexing documents...
Indexing complete.

Query: What is Poland's official language?
Best Match Document: Poland's official language is Polish, which is a West Slavic language.

This script demonstrates the core of a RAG system: using your serverless EmbeddingGemma service to convert both documents and queries into vectors, enabling you to find the exact information needed to answer a user's question.

7. Clean up

To avoid ongoing charges to your Google Cloud account, delete the resources created during this codelab.

Delete the Cloud Run Service

gcloud run services delete embedding-gemma --region europe-west1 --quiet

Delete the Container Image

gcloud artifacts docker images delete \
    europe-west1-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/cloud-run-source-deploy/embedding-gemma \
    --quiet

8. Congratulations

Congratulations! You have successfully deployed EmbeddingGemma on Cloud Run with GPUs and used it to power a semantic search application.

You now have a scalable, serverless foundation for building AI applications that require understanding text meaning.

What you've learned

  • How to containerize an Ollama model with Docker.
  • How to deploy a GPU-enabled service to Cloud Run.
  • How to use the deployed model for semantic search (RAG).

Next steps

Reference docs