Cloud Run 関数のサイドカーで LLM をホストする方法

この Codelab について

最終更新: 3月 27, 2025

Google 社員により作成

このページは Cloud Translation API によって翻訳されました。

1. はじめに

概要

この Codelab では、Cloud Run 関数のサイドカーに gemma3:4b モデルをホストする方法について説明します。ファイルが Cloud Storage バケットにアップロードされると、Cloud Run 関数がトリガーされます。この関数は、ファイルの内容をサイドカーの Gemma 3 に送信して要約します。

学習内容

Cloud Run 関数と GPU を使用してサイドカーにホストされている LLM を使用して推論を行う方法
Cloud Run GPU でダイレクト VPC 下り（外向き）構成を使用して、モデルのアップロードとサービングを高速化する方法
genkit を使用してホストされた ollama モデルとやり取りする方法

2. 始める前に

GPU 機能を使用するには、サポートされているリージョンの割り当ての増加をリクエストする必要があります。必要な割り当ては nvidia_l4_gpu_allocation_no_zonal_redundancy で、Cloud Run Admin API にあります。割り当てをリクエストするための直接リンクはこちらです。

3. 設定と要件

この Codelab 全体で使用する環境変数を設定します。

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>

AR_REPO=codelab-crf-sidecar-gpu
FUNCTION_NAME=crf-sidecar-gpu
BUCKET_GEMMA_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-gemma3
BUCKET_DOCS_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-docs
SERVICE_ACCOUNT="crf-sidecar-gpu"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
IMAGE_SIDECAR=$REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3

次のコマンドを実行して、サービスアカウントを作成します。

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="SA for codelab crf sidecar with gpu"

Cloud Run 関数を呼び出す eventarc トリガーのサービスアカウントとして、Cloud Run 関数の ID として使用されているサービスアカウントを使用します。必要に応じて、Eventarc 用に別の SA を作成できます。

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
    --role=roles/run.invoker

また、サービスアカウントに Eventarc イベントを受信するアクセス権を付与します。

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SERVICE_ACCOUNT_ADDRESS" \
    --role="roles/eventarc.eventReceiver"

ファインチューニングされたモデルをホストするバケットを作成します。この Codelab ではリージョンバケットを使用します。マルチリージョンバケットを使用することもできます。

gsutil mb -l $REGION gs://$BUCKET_GEMMA_NAME

次に、SA にバケットへのアクセス権を付与します。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

次に、要約するドキュメントを保存するリージョンバケットを作成します。マルチリージョンバケットも使用できます。ただし、Eventarc トリガーを適切に更新する必要があります（この Codelab の最後で説明します）。

gsutil mb -l $REGION gs://$BUCKET_DOCS_NAME

次に、SA に Gemma 3 バケットへのアクセス権を付与します。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

と Docs バケット。

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_DOCS_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

サイドカーで使用される Ollama イメージ用の Artifact Registry リポジトリを作成する

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for CR function and gpu sidecar" \
    --project=$PROJECT_ID

4. Gemma 3 モデルをダウンロードする

まず、ollama から Gemma 3 4b モデルをダウンロードします。これを行うには、ollama をインストールして、gemma3:4b モデルをローカルで実行します。

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

別のターミナルウィンドウで、次のコマンドを実行してモデルをプルダウンします。Cloud Shell を使用している場合は、右上のメニューバーにあるプラスアイコンをクリックして、追加のターミナルウィンドウを開くことができます。

ollama run gemma3:4b

ollama が実行されたら、モデルに質問できます。たとえば、次のような質問ができます。

"why is the sky blue?"

ollama とのチャットが完了したら、

/bye

次に、最初のターミナルウィンドウで次のコマンドを実行して、ollama のローカルサービングを停止します。

# on Linux / Cloud Shell press Ctrl^C or equivalent for your shell

Ollama がモデルをダウンロードする場所は、オペレーティングシステムによって異なります。こちらをご覧ください。

https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored

Cloud Workstations を使用している場合は、ダウンロードした ollama モデルをこちらで確認できます。/home/$USER/.ollama/models

モデルがホストされていることを確認します。

ls /home/$USER/.ollama/models

gemma3:4b モデルを GCS バケットに移動します。

gsutil cp -r /home/$USER/.ollama/models gs://$BUCKET_GEMMA_NAME

5. Cloud Run functions を作成する

ソースコードのルートフォルダを作成します。

mkdir codelab-crf-sidecar-gpu &&
cd codelab-crf-sidecar-gpu &&
mkdir cr-function &&
mkdir ollama-gemma3 &&
cd cr-function

src というサブフォルダを作成します。フォルダ内に index.ts というファイルを作成します。

mkdir src &&
touch src/index.ts

index.ts を次のコードで更新します。

//import util from 'util';
import { cloudEvent, CloudEvent } from "@google-cloud/functions-framework";
import { StorageObjectData } from "@google/events/cloud/storage/v1/StorageObjectData";
import { Storage } from "@google-cloud/storage";

// Initialize the Cloud Storage client
const storage = new Storage();

import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';

const ai = genkit({
    plugins: [
        ollama({
            models: [
                {
                    name: 'gemma3:4b',
                    type: 'generate', // type: 'chat' | 'generate' | undefined
                },
            ],
            serverAddress: 'http://127.0.0.1:11434', // default local address
        }),
    ],
});


// Register a CloudEvent callback with the Functions Framework that will
// be triggered by Cloud Storage.

//functions.cloudEvent('helloGCS', await cloudEvent => {
cloudEvent("gcs-cloudevent", async (cloudevent: CloudEvent<StorageObjectData>) => {
    console.log("---------------\nProcessing for ", cloudevent.subject, "\n---------------");

    if (cloudevent.data) {

        const data = cloudevent.data;

        if (data && data.bucket && data.name) {
            const bucketName = cloudevent.data.bucket;
            const fileName = cloudevent.data.name;
            const filePath = `${cloudevent.data.bucket}/${cloudevent.data.name}`;

            console.log(`Attempting to download: ${filePath}`);

            try {
                // Get a reference to the bucket
                const bucket = storage.bucket(bucketName!);

                // Get a reference to the file
                const file = bucket.file(fileName!);

                // Download the file's contents
                const [content] = await file.download();

                // 'content' is a Buffer. Convert it to a string.
                const fileContent = content.toString('utf8');

                console.log(`Sending file to Gemma 3 for summarization`);
                const { text } = await ai.generate({
                    model: 'ollama/gemma3:4b',
                    prompt: `Summarize the following document in just a few sentences ${fileContent}`,
                });

                console.log(text);

            } catch (error: any) {

                console.error('An error occurred:', error.message);
            }
        } else {
            console.warn("CloudEvent bucket name is missing!", cloudevent);
        }
    } else {
        console.warn("CloudEvent data is missing!", cloudevent);
    }
});

ルートディレクトリ crf-sidecar-gpu に、次の内容の package.json というファイルを作成します。

{
    "main": "lib/index.js",
    "name": "ingress-crf-genkit",
    "version": "1.0.0",
    "scripts": {
        "build": "tsc"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "description": "",
    "dependencies": {
        "@google-cloud/functions-framework": "^3.4.0",
        "@google-cloud/storage": "^7.0.0",
        "genkit": "^1.1.0",
        "genkitx-ollama": "^1.1.0",
        "@google/events": "^5.4.0"
    },
    "devDependencies": {
        "typescript": "^5.5.2"
    }
}

ルートディレクトリレベルにも、次の内容の tsconfig.json を作成します。

{
  "compileOnSave": true,
  "include": [
    "src"
  ],
  "compilerOptions": {
    "module": "commonjs",
    "noImplicitReturns": true,
    "outDir": "lib",
    "sourceMap": true,
    "strict": true,
    "target": "es2017",
    "skipLibCheck": true,
    "esModuleInterop": true
  }
}

6. 関数をデプロイする

このステップでは、次のコマンドを実行して Cloud Run 関数をデプロイします。

注: 最大インスタンス数は、GPU 割り当て以下に設定する必要があります。

gcloud beta run deploy $FUNCTION_NAME \
  --region $REGION \
  --function gcs-cloudevent \
  --base-image nodejs22 \
  --source . \
  --no-allow-unauthenticated \
  --max-instances 2 # this should be less than or equal to your GPU quota

7. サイドカーを作成する

Cloud Run サービス内で Ollama をホストする方法については、https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama をご覧ください。

サイドカーのディレクトリに移動します。

cd ../ollama-gemma3

Dockerfile ファイルを作成し、次の内容を追加します。

FROM ollama/ollama:latest

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

イメージをビルドする

gcloud builds submit \
   --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3 \
   --machine-type e2-highcpu-32

8. サイドカーを使用して関数を更新する

既存のサービス、ジョブ、関数にサイドカーを追加するには、YAML ファイルを更新してサイドカーを含めます。

次のコマンドを実行して、デプロイした Cloud Run 関数の YAML を取得します。

gcloud run services describe $FUNCTION_NAME --format=export > add-sidecar-service.yaml

次のように YAML を更新して、サイドカーを CRf に追加します。

runtimeClassName: run.googleapis.com/linux-base-image-update 行のすぐ上に次の YAML フラグメントを挿入します。-image は、Ingress コンテナアイテム -image と一致する必要があります。

    - image: YOUR_IMAGE_SIDECAR:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: YOUR_BUCKET_GEMMA_NAME
          name: gcs-1

次のコマンドを実行して、YAML フラグメントを環境変数で更新します。

sed -i "s|YOUR_IMAGE_SIDECAR|$IMAGE_SIDECAR|; s|YOUR_BUCKET_GEMMA_NAME|$BUCKET_GEMMA_NAME|" add-sidecar-service.yaml

完成した YAML ファイルは次のようになります。

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:    
    run.googleapis.com/build-base-image: us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22
    run.googleapis.com/build-enable-automatic-updates: 'true'
    run.googleapis.com/build-function-target: gcs-cloudevent
    run.googleapis.com/build-id: f0122905-a556-4000-ace4-5c004a9f9ec6
    run.googleapis.com/build-image-uri:<YOUR_IMAGE_CRF>
    run.googleapis.com/build-name: <YOUR_BUILD_NAME>
    run.googleapis.com/build-source-location: <YOUR_SOURCE_LOCATION>
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
    run.googleapis.com/urls: '["<YOUR_CLOUD_RUN_FUNCTION_URLS"]'
  labels:
    cloud.googleapis.com/location: <YOUR_REGION>
  name: <YOUR_FUNCTION_NAME>
  namespace: '392295011265'
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '4'
        run.googleapis.com/base-images: '{"":"us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22"}'
        run.googleapis.com/client-name: gcloud
        run.googleapis.com/client-version: 514.0.0
        run.googleapis.com/startup-cpu-boost: 'true'
      labels:
        client.knative.dev/nonce: hzhhrhheyd
        run.googleapis.com/startupProbeType: Default
    spec:
      containerConcurrency: 80
      containers:
      - image: <YOUR_FUNCTION_IMAGE>
        ports:
        - containerPort: 8080
          name: http1
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
        startupProbe:
          failureThreshold: 1
          periodSeconds: 240
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
      - image: <YOUR_SIDECAR_IMAGE>:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: <YOUR_BUCKET_NAME>
          name: gcs-1
      runtimeClassName: run.googleapis.com/linux-base-image-update
      serviceAccountName: <YOUR_SA_ADDRESS>
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

次のコマンドを実行して、サイドカーを使用して関数を更新します。

gcloud run services replace add-sidecar-service.yaml

最後に、関数の Eventarc トリガーを作成します。このコマンドは、関数にも追加します。

注: マルチリージョンバケットを作成した場合は、--location パラメータを変更する必要があります。

gcloud eventarc triggers create my-crf-summary-trigger  \
    --location=$REGION \
    --destination-run-service=$FUNCTION_NAME  \
    --destination-run-region=$REGION \
    --event-filters="type=google.cloud.storage.object.v1.finalized" \
    --event-filters="bucket=$BUCKET_DOCS_NAME" \
    --service-account=$SERVICE_ACCOUNT_ADDRESS

9. 関数をテストする

要約するプレーンテキストファイルをアップロードします。何を要約すればよいかわからない場合は、犬の歴史について 1 ～ 2 ページの簡単な説明を Gemini に尋ねてみましょう。次に、そのテキストファイルを Gemma3:4b モデルの $BUCKET_DOCS_NAME バケットにアップロードして、関数ログに概要を書き込みます。

ログには、次のようなメッセージが表示されます。

---------------
Processing for objects/dogs.txt
---------------
Attempting to download: <YOUR_PROJECT_ID>-codelab-crf-sidecar-gpu-docs/dogs.txt
Sending file to Gemma 3 for summarization
...
Here's a concise summary of the document "Humanity's Best Friend":
The dog's domestication, beginning roughly 20,000-40,000 years ago, represents a unique, deeply intertwined evolutionary partnership with humans, predating the domestication of any other animal
<...>
solidifying their long-standing role as humanity's best friend.

10. トラブルシューティング

次のようなスペルミスによるエラーが発生する可能性があります。

PORT 8080 is in use というエラーが発生した場合は、Ollama サイドカーの Dockerfile でポート 11434 が使用されていることを確認します。また、AR リポジトリに複数の Ollama イメージがある場合は、正しいサイドカーイメージを使用していることを確認してください。Cloud Run 関数はポート 8080 で提供されます。8080 で提供されるサイドカーとして別の Ollama イメージを使用した場合、このエラーが発生します。
エラー failed to build: (error ID: 7485c5b6): function.js does not exist が発生した場合は、package.json ファイルと tsconfig.json ファイルが src ディレクトリと同じレベルにあることを確認してください。
エラー ERROR: (gcloud.run.services.replace) spec.template.spec.node_selector: Max instances must be set to 4 or fewer in order to set GPU requirements. が発生した場合は、YAML ファイルで autoscaling.knative.dev/maxScale: '100' を 1 または GPU 割り当て以下に変更します。

誤りを報告