如何在 Cloud Run 函式的附屬程式中託管 LLM

程式碼研究室簡介

上次更新時間：3月 27, 2025

作者：Google 員工

1. 簡介

總覽

在本程式碼研究室中，您將瞭解如何在 Cloud Run 函式的附屬程式中代管 gemma3:4b 模型。檔案上傳至 Cloud Storage 值區時，會觸發 Cloud Run 函式。這個函式會將檔案內容傳送至 sidecar 中的 Gemma 3，以便進行摘要。

課程內容

如何使用 Cloud Run 函式和在 sidecar 中以 GPU 代管的 LLM 進行推論
如何為 Cloud Run GPU 使用直接虛擬私有雲輸出設定，以便更快上傳及提供模型
如何使用 genkit 與代管的 ollama 模型進行介面連結

2. 事前準備

如要使用 GPU 功能，您必須為支援的區域申請提高配額。所需配額為 nvidia_l4_gpu_allocation_no_zonal_redundancy，位於 Cloud Run Admin API 下。請按這裡，直接前往要求配額的頁面。

3. 設定和需求

設定在本程式碼研究室中會用到的環境變數。

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>

AR_REPO=codelab-crf-sidecar-gpu
FUNCTION_NAME=crf-sidecar-gpu
BUCKET_GEMMA_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-gemma3
BUCKET_DOCS_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-docs
SERVICE_ACCOUNT="crf-sidecar-gpu"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
IMAGE_SIDECAR=$REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3

執行下列指令，建立服務帳戶：

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="SA for codelab crf sidecar with gpu"

我們會使用這個做為 Cloud Run 函式身分的服務帳戶，做為 eventarc 觸發條件用來叫用 Cloud Run 函式的服務帳戶。如有需要，您可以為 Eventarc 建立其他 SA。

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
    --role=roles/run.invoker

並授予服務帳戶接收 Eventarc 事件的存取權。

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SERVICE_ACCOUNT_ADDRESS" \
    --role="roles/eventarc.eventReceiver"

建立值區來代管微調後的模型。本程式碼研究室使用區域值區。您也可以使用跨區域值區。

gsutil mb -l $REGION gs://$BUCKET_GEMMA_NAME

然後授予 SA 對值區的存取權。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

接下來，請建立區域值區，用來儲存要匯總的文件。您也可以使用多區域值區，前提是您已相應更新 Eventarc 觸發條件 (如本程式碼研究室結尾所示)。

gsutil mb -l $REGION gs://$BUCKET_DOCS_NAME

然後將 Gemma 3 值區的存取權授予 SA。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

和「文件」bucket。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_DOCS_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

為將用於側載程式的 Ollama 映像檔建立 Artifact Registry 存放區

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for CR function and gpu sidecar" \
    --project=$PROJECT_ID

4. 下載 Gemma 3 模型

首先，請從 ollama 下載 Gemma 3 4b 模型。您可以安裝 ollama，然後在本機執行 gemma3:4b 模型。

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

接著，在另一個終端機視窗中執行下列指令，下載模型。如果您使用的是 Cloud Shell，只要按一下右上方選單列中的加號圖示，即可開啟其他終端機視窗。

ollama run gemma3:4b

在 ollama 執行後，您可以向模型提出一些問題，例如：

"why is the sky blue?"

與 ollama 的對話結束後，您可以執行以下指令退出即時通訊：

/bye

接著，在第一個終端機視窗中執行下列指令，停止在本機上提供 ollama

# on Linux / Cloud Shell press Ctrl^C or equivalent for your shell

您可以在這裡查看 Ollama 下載模型的位置，具體取決於您的作業系統。

https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored

如果你使用 Cloud Workstations，可以在這裡找到下載的 ollama 模型 /home/$USER/.ollama/models

確認模型是否已託管在此處：

ls /home/$USER/.ollama/models

接著將 gemma3:4b 模型移至 GCS 值區

gsutil cp -r /home/$USER/.ollama/models gs://$BUCKET_GEMMA_NAME

5. 建立 Cloud Run 函式

建立原始碼的根目錄。

mkdir codelab-crf-sidecar-gpu &&
cd codelab-crf-sidecar-gpu &&
mkdir cr-function &&
mkdir ollama-gemma3 &&
cd cr-function

建立名為 src 的子資料夾。在資料夾中建立名為 index.ts 的檔案

mkdir src &&
touch src/index.ts

使用下列程式碼更新 index.ts：

//import util from 'util';
import { cloudEvent, CloudEvent } from "@google-cloud/functions-framework";
import { StorageObjectData } from "@google/events/cloud/storage/v1/StorageObjectData";
import { Storage } from "@google-cloud/storage";

// Initialize the Cloud Storage client
const storage = new Storage();

import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';

const ai = genkit({
    plugins: [
        ollama({
            models: [
                {
                    name: 'gemma3:4b',
                    type: 'generate', // type: 'chat' | 'generate' | undefined
                },
            ],
            serverAddress: 'http://127.0.0.1:11434', // default local address
        }),
    ],
});


// Register a CloudEvent callback with the Functions Framework that will
// be triggered by Cloud Storage.

//functions.cloudEvent('helloGCS', await cloudEvent => {
cloudEvent("gcs-cloudevent", async (cloudevent: CloudEvent<StorageObjectData>) => {
    console.log("---------------\nProcessing for ", cloudevent.subject, "\n---------------");

    if (cloudevent.data) {

        const data = cloudevent.data;

        if (data && data.bucket && data.name) {
            const bucketName = cloudevent.data.bucket;
            const fileName = cloudevent.data.name;
            const filePath = `${cloudevent.data.bucket}/${cloudevent.data.name}`;

            console.log(`Attempting to download: ${filePath}`);

            try {
                // Get a reference to the bucket
                const bucket = storage.bucket(bucketName!);

                // Get a reference to the file
                const file = bucket.file(fileName!);

                // Download the file's contents
                const [content] = await file.download();

                // 'content' is a Buffer. Convert it to a string.
                const fileContent = content.toString('utf8');

                console.log(`Sending file to Gemma 3 for summarization`);
                const { text } = await ai.generate({
                    model: 'ollama/gemma3:4b',
                    prompt: `Summarize the following document in just a few sentences ${fileContent}`,
                });

                console.log(text);

            } catch (error: any) {

                console.error('An error occurred:', error.message);
            }
        } else {
            console.warn("CloudEvent bucket name is missing!", cloudevent);
        }
    } else {
        console.warn("CloudEvent data is missing!", cloudevent);
    }
});

接著在根目錄 crf-sidecar-gpu 中，建立名為 package.json 的檔案，並在當中加入下列內容：

{
    "main": "lib/index.js",
    "name": "ingress-crf-genkit",
    "version": "1.0.0",
    "scripts": {
        "build": "tsc"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "description": "",
    "dependencies": {
        "@google-cloud/functions-framework": "^3.4.0",
        "@google-cloud/storage": "^7.0.0",
        "genkit": "^1.1.0",
        "genkitx-ollama": "^1.1.0",
        "@google/events": "^5.4.0"
    },
    "devDependencies": {
        "typescript": "^5.5.2"
    }
}

在根目錄層級建立 tsconfig.json，並加入以下內容：

{
  "compileOnSave": true,
  "include": [
    "src"
  ],
  "compilerOptions": {
    "module": "commonjs",
    "noImplicitReturns": true,
    "outDir": "lib",
    "sourceMap": true,
    "strict": true,
    "target": "es2017",
    "skipLibCheck": true,
    "esModuleInterop": true
  }
}

6. 部署函式

在這個步驟中，您將執行下列指令來部署 Cloud Run 函式。

注意：執行個體數量上限應設為小於或等於 GPU 配額的數字。

gcloud beta run deploy $FUNCTION_NAME \
  --region $REGION \
  --function gcs-cloudevent \
  --base-image nodejs22 \
  --source . \
  --no-allow-unauthenticated \
  --max-instances 2 # this should be less than or equal to your GPU quota

7. 建立 Sidecar

如要進一步瞭解如何在 Cloud Run 服務中代管 Ollama，請前往 https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama

前往副駕駛目錄：

cd ../ollama-gemma3

使用以下內容建立 Dockerfile 檔案：

FROM ollama/ollama:latest

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

建構映像檔

gcloud builds submit \
   --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3 \
   --machine-type e2-highcpu-32

8. 使用副程式更新函式

如要在現有服務、工作或函式中新增附屬程式，您可以更新 YAML 檔案，讓檔案包含附屬程式。

請執行下列指令，擷取剛部署的 Cloud Run 函式 YAML：

gcloud run services describe $FUNCTION_NAME --format=export > add-sidecar-service.yaml

接著，請按照下列方式更新 YAML，將補充元件新增至 CRf：

直接在 runtimeClassName: run.googleapis.com/linux-base-image-update 行上方插入下列 YAML 片段。-image 應與 ingress 容器項目 -image 對齊

    - image: YOUR_IMAGE_SIDECAR:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: YOUR_BUCKET_GEMMA_NAME
          name: gcs-1

執行下列指令，使用環境變數更新 YAML 片段：

sed -i "s|YOUR_IMAGE_SIDECAR|$IMAGE_SIDECAR|; s|YOUR_BUCKET_GEMMA_NAME|$BUCKET_GEMMA_NAME|" add-sidecar-service.yaml

完成的 YAML 檔案應如下所示：

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:    
    run.googleapis.com/build-base-image: us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22
    run.googleapis.com/build-enable-automatic-updates: 'true'
    run.googleapis.com/build-function-target: gcs-cloudevent
    run.googleapis.com/build-id: f0122905-a556-4000-ace4-5c004a9f9ec6
    run.googleapis.com/build-image-uri:<YOUR_IMAGE_CRF>
    run.googleapis.com/build-name: <YOUR_BUILD_NAME>
    run.googleapis.com/build-source-location: <YOUR_SOURCE_LOCATION>
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
    run.googleapis.com/urls: '["<YOUR_CLOUD_RUN_FUNCTION_URLS"]'
  labels:
    cloud.googleapis.com/location: <YOUR_REGION>
  name: <YOUR_FUNCTION_NAME>
  namespace: '392295011265'
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '4'
        run.googleapis.com/base-images: '{"":"us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22"}'
        run.googleapis.com/client-name: gcloud
        run.googleapis.com/client-version: 514.0.0
        run.googleapis.com/startup-cpu-boost: 'true'
      labels:
        client.knative.dev/nonce: hzhhrhheyd
        run.googleapis.com/startupProbeType: Default
    spec:
      containerConcurrency: 80
      containers:
      - image: <YOUR_FUNCTION_IMAGE>
        ports:
        - containerPort: 8080
          name: http1
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
        startupProbe:
          failureThreshold: 1
          periodSeconds: 240
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
      - image: <YOUR_SIDECAR_IMAGE>:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: <YOUR_BUCKET_NAME>
          name: gcs-1
      runtimeClassName: run.googleapis.com/linux-base-image-update
      serviceAccountName: <YOUR_SA_ADDRESS>
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

接著執行下列指令，使用附加函式更新函式。

gcloud run services replace add-sidecar-service.yaml

最後，請為函式建立 Eventarc 觸發條件。這個指令也會將其加入函式。

注意：如果您建立了多區域值區，請變更 --location 參數

gcloud eventarc triggers create my-crf-summary-trigger  \
    --location=$REGION \
    --destination-run-service=$FUNCTION_NAME  \
    --destination-run-region=$REGION \
    --event-filters="type=google.cloud.storage.object.v1.finalized" \
    --event-filters="bucket=$BUCKET_DOCS_NAME" \
    --service-account=$SERVICE_ACCOUNT_ADDRESS

9. 測試函式

上傳要摘要的純文字檔案。不知道要摘要什麼內容嗎？請 Gemini 提供 1 到 2 頁的狗狗歷史簡介！接著，將該純文字檔案上傳至 Gemma3:4b 模型的 $BUCKET_DOCS_NAME 值區，以便將摘要寫入函式記錄。

您會在記錄中看到類似以下的內容：

---------------
Processing for objects/dogs.txt
---------------
Attempting to download: <YOUR_PROJECT_ID>-codelab-crf-sidecar-gpu-docs/dogs.txt
Sending file to Gemma 3 for summarization
...
Here's a concise summary of the document "Humanity's Best Friend":
The dog's domestication, beginning roughly 20,000-40,000 years ago, represents a unique, deeply intertwined evolutionary partnership with humans, predating the domestication of any other animal
<...>
solidifying their long-standing role as humanity's best friend.

10. 疑難排解

以下是您可能會遇到的錯字錯誤：

如果您收到 PORT 8080 is in use 錯誤，請確認 Ollama 附屬程式的 Dockerfile 是否使用 11434 通訊端口。此外，如果 AR 存放區中有多個 Ollama 圖片，請確認您使用的是正確的側載圖片。Cloud Run 函式會在 8080 通訊埠上提供服務，如果您使用其他 Ollama 映像檔做為副車，且該副車也在 8080 上提供服務，就會發生這個錯誤。
如果您收到 failed to build: (error ID: 7485c5b6): function.js does not exist 錯誤，請確認 package.json 和 tsconfig.json 檔案與 src 目錄位於相同層級。
如果您收到 ERROR: (gcloud.run.services.replace) spec.template.spec.node_selector: Max instances must be set to 4 or fewer in order to set GPU requirements. 錯誤，請在 YAML 檔案中將 autoscaling.knative.dev/maxScale: '100' 變更為 1，或變更為小於或等於 GPU 配額的值。

回報錯誤