如何在 Cloud Run 函式的附屬程式中託管 LLM

1. 簡介

總覽

在本程式碼研究室中，您將瞭解如何在 Cloud Run 函式的 Sidecar 中託管 gemma3:4b 模型。檔案上傳至 Cloud Storage bucket 時，系統會觸發 Cloud Run 函式。函式會將檔案內容傳送至 Sidecar 中的 Gemma 3，以產生摘要。

課程內容

如何使用 Cloud Run 函式和 GPU 託管的 Sidecar LLM 執行推論
如何為 Cloud Run GPU 使用直接虛擬私有雲輸出設定，加快模型上傳和服務速度
如何使用 Genkit 與代管的 Ollama 模型互動

2. 事前準備

如要使用 GPU 功能，請為支援的區域申請提高配額。您需要 nvidia_l4_gpu_allocation_no_zonal_redundancy 配額，這項配額位於 Cloud Run Admin API 下方。如要要求配額，請按一下這個直接連結。

3. 設定和需求

設定本程式碼實驗室全程會用到的環境變數。

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>

AR_REPO=codelab-crf-sidecar-gpu
FUNCTION_NAME=crf-sidecar-gpu
BUCKET_GEMMA_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-gemma3
BUCKET_DOCS_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-docs
SERVICE_ACCOUNT="crf-sidecar-gpu"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
IMAGE_SIDECAR=$REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3

執行下列指令來建立服務帳戶：

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="SA for codelab crf sidecar with gpu"

我們會使用與 Cloud Run 函式身分相同的服務帳戶，做為 eventarc 觸發條件的服務帳戶，藉此叫用 Cloud Run 函式。您也可以視需要為 Eventarc 建立其他 SA。

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
    --role=roles/run.invoker

並授予服務帳戶接收 Eventarc 事件的權限。

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SERVICE_ACCOUNT_ADDRESS" \
    --role="roles/eventarc.eventReceiver"

建立用來代管微調模型的 bucket。本程式碼研究室使用區域 bucket。您也可以使用多區域 bucket。

gsutil mb -l $REGION gs://$BUCKET_GEMMA_NAME

然後授予服務帳戶值區存取權。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

現在請建立區域 bucket，用來儲存要匯總的文件。您也可以使用多區域值區，但必須相應更新 Eventarc 觸發條件 (如本程式碼研究室結尾所示)。

gsutil mb -l $REGION gs://$BUCKET_DOCS_NAME

然後授予 SA Gemma 3 值區的存取權。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

以及 Docs bucket。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_DOCS_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

為側車中使用的 Ollama 映像檔建立 Artifact Registry 存放區

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for CR function and gpu sidecar" \
    --project=$PROJECT_ID

4. 下載 Gemma 3 模型

首先，請從 ollama 下載 Gemma 3 4b 模型。方法是安裝 ollama，然後在本機執行 gemma3:4b 模型。

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

現在在另一個終端機視窗中，執行下列指令來下拉模型。如果您使用 Cloud Shell，可以點按右上選單列中的加號圖示，開啟額外的終端機視窗。

ollama run gemma3:4b

ollama 執行後，歡迎向模型提出問題，例如：

"why is the sky blue?"

與 ollama 對話完畢後，執行

/bye

接著在第一個終端機視窗中執行下列指令，停止在本機提供 ollama 服務

# on Linux / Cloud Shell press Ctrl^C or equivalent for your shell

您可以根據作業系統，在這裡查看 Ollama 下載模型的路徑。

https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored

如果您使用 Cloud Workstations，可以在這裡找到下載的 ollama 模型 /home/$USER/.ollama/models

確認模型是否託管於下列位置：

ls /home/$USER/.ollama/models

現在將 gemma3:4b 模型移至 GCS bucket

gsutil cp -r /home/$USER/.ollama/models gs://$BUCKET_GEMMA_NAME

5. 建立 Cloud Run 函式

為原始碼建立根資料夾。

mkdir codelab-crf-sidecar-gpu &&
cd codelab-crf-sidecar-gpu &&
mkdir cr-function &&
mkdir ollama-gemma3 &&
cd cr-function

建立名為 src 的子資料夾。在資料夾中，建立名為 index.ts 的檔案

mkdir src &&
touch src/index.ts

使用下列程式碼更新 index.ts：

//import util from 'util';
import { cloudEvent, CloudEvent } from "@google-cloud/functions-framework";
import { StorageObjectData } from "@google/events/cloud/storage/v1/StorageObjectData";
import { Storage } from "@google-cloud/storage";

// Initialize the Cloud Storage client
const storage = new Storage();

import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';

const ai = genkit({
    plugins: [
        ollama({
            models: [
                {
                    name: 'gemma3:4b',
                    type: 'generate', // type: 'chat' | 'generate' | undefined
                },
            ],
            serverAddress: 'http://127.0.0.1:11434', // default local address
        }),
    ],
});


// Register a CloudEvent callback with the Functions Framework that will
// be triggered by Cloud Storage.

//functions.cloudEvent('helloGCS', await cloudEvent => {
cloudEvent("gcs-cloudevent", async (cloudevent: CloudEvent<StorageObjectData>) => {
    console.log("---------------\nProcessing for ", cloudevent.subject, "\n---------------");

    if (cloudevent.data) {

        const data = cloudevent.data;

        if (data && data.bucket && data.name) {
            const bucketName = cloudevent.data.bucket;
            const fileName = cloudevent.data.name;
            const filePath = `${cloudevent.data.bucket}/${cloudevent.data.name}`;

            console.log(`Attempting to download: ${filePath}`);

            try {
                // Get a reference to the bucket
                const bucket = storage.bucket(bucketName!);

                // Get a reference to the file
                const file = bucket.file(fileName!);

                // Download the file's contents
                const [content] = await file.download();

                // 'content' is a Buffer. Convert it to a string.
                const fileContent = content.toString('utf8');

                console.log(`Sending file to Gemma 3 for summarization`);
                const { text } = await ai.generate({
                    model: 'ollama/gemma3:4b',
                    prompt: `Summarize the following document in just a few sentences ${fileContent}`,
                });

                console.log(text);

            } catch (error: any) {

                console.error('An error occurred:', error.message);
            }
        } else {
            console.warn("CloudEvent bucket name is missing!", cloudevent);
        }
    } else {
        console.warn("CloudEvent data is missing!", cloudevent);
    }
});

現在在根目錄 crf-sidecar-gpu 中，建立名為 package.json 的檔案，並加入以下內容：

{
    "main": "lib/index.js",
    "name": "ingress-crf-genkit",
    "version": "1.0.0",
    "scripts": {
        "build": "tsc"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "description": "",
    "dependencies": {
        "@google-cloud/functions-framework": "^3.4.0",
        "@google-cloud/storage": "^7.0.0",
        "genkit": "^1.1.0",
        "genkitx-ollama": "^1.1.0",
        "@google/events": "^5.4.0"
    },
    "devDependencies": {
        "typescript": "^5.5.2"
    }
}

在根目錄層級建立 tsconfig.json，並加入以下內容：

{
  "compileOnSave": true,
  "include": [
    "src"
  ],
  "compilerOptions": {
    "module": "commonjs",
    "noImplicitReturns": true,
    "outDir": "lib",
    "sourceMap": true,
    "strict": true,
    "target": "es2017",
    "skipLibCheck": true,
    "esModuleInterop": true
  }
}

6. 部署函式

在這個步驟中，您將執行下列指令，部署 Cloud Run 函式。

注意：執行個體數量上限應設為小於或等於 GPU 配額的數字。

gcloud beta run deploy $FUNCTION_NAME \
  --region $REGION \
  --function gcs-cloudevent \
  --base-image nodejs22 \
  --source . \
  --no-allow-unauthenticated \
  --max-instances 2 # this should be less than or equal to your GPU quota

7. 建立 Sidecar

如要進一步瞭解如何在 Cloud Run 服務中代管 Ollama，請參閱 https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama

移至 Sidecar 的目錄：

cd ../ollama-gemma3

使用以下內容建立 Dockerfile 檔案：

FROM ollama/ollama:latest

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

建構映像檔

gcloud builds submit \
   --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3 \
   --machine-type e2-highcpu-32

8. 使用 Sidecar 更新函式

如要將 Sidecar 新增至現有服務、工作或函式，可以更新 YAML 檔案，加入 Sidecar。

執行下列指令，擷取您剛部署的 Cloud Run 函式 YAML：

gcloud run services describe $FUNCTION_NAME --format=export > add-sidecar-service.yaml

現在請更新 YAML，將補充元件新增至 CRf，如下所示：

在 runtimeClassName: run.googleapis.com/linux-base-image-update 行的正上方插入下列 YAML 片段。-image 應與進入容器項目 -image 對齊

    - image: YOUR_IMAGE_SIDECAR:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: YOUR_BUCKET_GEMMA_NAME
          name: gcs-1

執行下列指令，使用環境變數更新 YAML 片段：

sed -i "s|YOUR_IMAGE_SIDECAR|$IMAGE_SIDECAR|; s|YOUR_BUCKET_GEMMA_NAME|$BUCKET_GEMMA_NAME|" add-sidecar-service.yaml

完成的 YAML 檔案應如下所示：

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:    
    run.googleapis.com/build-base-image: us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22
    run.googleapis.com/build-enable-automatic-updates: 'true'
    run.googleapis.com/build-function-target: gcs-cloudevent
    run.googleapis.com/build-id: f0122905-a556-4000-ace4-5c004a9f9ec6
    run.googleapis.com/build-image-uri:<YOUR_IMAGE_CRF>
    run.googleapis.com/build-name: <YOUR_BUILD_NAME>
    run.googleapis.com/build-source-location: <YOUR_SOURCE_LOCATION>
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
    run.googleapis.com/urls: '["<YOUR_CLOUD_RUN_FUNCTION_URLS"]'
  labels:
    cloud.googleapis.com/location: <YOUR_REGION>
  name: <YOUR_FUNCTION_NAME>
  namespace: '392295011265'
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '4'
        run.googleapis.com/base-images: '{"":"us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22"}'
        run.googleapis.com/client-name: gcloud
        run.googleapis.com/client-version: 514.0.0
        run.googleapis.com/startup-cpu-boost: 'true'
      labels:
        client.knative.dev/nonce: hzhhrhheyd
        run.googleapis.com/startupProbeType: Default
    spec:
      containerConcurrency: 80
      containers:
      - image: <YOUR_FUNCTION_IMAGE>
        ports:
        - containerPort: 8080
          name: http1
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
        startupProbe:
          failureThreshold: 1
          periodSeconds: 240
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
      - image: <YOUR_SIDECAR_IMAGE>:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: <YOUR_BUCKET_NAME>
          name: gcs-1
      runtimeClassName: run.googleapis.com/linux-base-image-update
      serviceAccountName: <YOUR_SA_ADDRESS>
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

現在執行下列指令，使用 Sidecar 更新函式。

gcloud run services replace add-sidecar-service.yaml

最後，為函式建立 eventarc 觸發條件。這個指令也會將其新增至函式。

注意：如果您建立的是多區域 bucket，請變更 --location 參數

gcloud eventarc triggers create my-crf-summary-trigger  \
    --location=$REGION \
    --destination-run-service=$FUNCTION_NAME  \
    --destination-run-region=$REGION \
    --event-filters="type=google.cloud.storage.object.v1.finalized" \
    --event-filters="bucket=$BUCKET_DOCS_NAME" \
    --service-account=$SERVICE_ACCOUNT_ADDRESS

9. 測試函式

上傳純文字檔案以生成摘要。不知道要摘要哪些內容嗎？請 Gemini 快速提供 1 到 2 頁的狗歷史說明。接著，將該純文字檔上傳至 $BUCKET_DOCS_NAME bucket，讓 Gemma3:4b 模型將摘要寫入函式記錄。

記錄檔中會顯示類似下列的內容：

---------------
Processing for objects/dogs.txt
---------------
Attempting to download: <YOUR_PROJECT_ID>-codelab-crf-sidecar-gpu-docs/dogs.txt
Sending file to Gemma 3 for summarization
...
Here's a concise summary of the document "Humanity's Best Friend":
The dog's domestication, beginning roughly 20,000-40,000 years ago, represents a unique, deeply intertwined evolutionary partnership with humans, predating the domestication of any other animal
<...>
solidifying their long-standing role as humanity's best friend.

10. 疑難排解

以下是您可能會遇到的錯別字錯誤：

如果收到 PORT 8080 is in use 錯誤，請確認 Ollama Sidecar 的 Dockerfile 使用的是 11434 連接埠。此外，如果 AR 存放區有多個 Ollama 映像檔，請務必使用正確的 Sidecar 映像檔。Cloud Run 函式會在通訊埠 8080 上提供服務，如果您使用的 Ollama 映像檔是也透過通訊埠 8080 提供服務的 Sidecar，就會發生這個錯誤。
如果收到 failed to build: (error ID: 7485c5b6): function.js does not exist 錯誤，請確認 package.json 和 tsconfig.json 檔案與 src 目錄位於相同層級。
如果收到 ERROR: (gcloud.run.services.replace) spec.template.spec.node_selector: Max instances must be set to 4 or fewer in order to set GPU requirements. 錯誤，請在 YAML 檔案中將 autoscaling.knative.dev/maxScale: '100' 變更為 1，或小於或等於 GPU 配額的值。