วิธีโฮสต์ LLM ในไซด์คาร์สำหรับฟังก์ชัน Cloud Run

1. บทนำ

ภาพรวม

ใน Codelab นี้ คุณจะได้เรียนรู้วิธีโฮสต์โมเดล gemma3:4b ใน Sidecar สำหรับฟังก์ชัน Cloud Run เมื่ออัปโหลดไฟล์ไปยัง Bucket ของ Cloud Storage ระบบจะทริกเกอร์ฟังก์ชัน Cloud Run ฟังก์ชันจะส่งเนื้อหาของไฟล์ไปยัง Gemma 3 ใน Sidecar เพื่อสรุป

สิ่งที่คุณจะได้เรียนรู้

วิธีทำการอนุมานโดยใช้ฟังก์ชัน Cloud Run และ LLM ที่โฮสต์ใน Sidecar โดยใช้ GPU
วิธีกำหนดค่าข้อมูลขาออก VPC โดยตรงสำหรับ GPU ของ Cloud Run เพื่อให้การอัปโหลดและการแสดงโมเดลเร็วขึ้น
วิธีใช้ Genkit เพื่อเชื่อมต่อกับโมเดล Ollama ที่โฮสต์

2. ก่อนเริ่มต้น

หากต้องการใช้ฟีเจอร์ GPU คุณต้องขอเพิ่มโควต้าสำหรับภูมิภาคที่รองรับ โควต้าที่จำเป็นคือ nvidia_l4_gpu_allocation_no_zonal_redundancy ซึ่งอยู่ภายใต้ Cloud Run Admin API นี่คือลิงก์โดยตรงเพื่อขอโควต้า

3. การตั้งค่าและข้อกำหนด

ตั้งค่าตัวแปรสภาพแวดล้อมที่จะใช้ตลอดทั้ง Codelab นี้

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>

AR_REPO=codelab-crf-sidecar-gpu
FUNCTION_NAME=crf-sidecar-gpu
BUCKET_GEMMA_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-gemma3
BUCKET_DOCS_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-docs
SERVICE_ACCOUNT="crf-sidecar-gpu"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
IMAGE_SIDECAR=$REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3

สร้างบัญชีบริการโดยเรียกใช้คำสั่งนี้

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="SA for codelab crf sidecar with gpu"

เราจะใช้บัญชีบริการเดียวกันนี้ซึ่งใช้เป็นข้อมูลประจำตัวของฟังก์ชัน Cloud Run เป็นบัญชีบริการสำหรับทริกเกอร์ Eventarc เพื่อเรียกใช้ฟังก์ชัน Cloud Run คุณสร้าง SA อื่นสำหรับ Eventarc ได้หากต้องการ

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
    --role=roles/run.invoker

นอกจากนี้ ให้สิทธิ์เข้าถึงบัญชีบริการเพื่อรับเหตุการณ์ Eventarc ด้วย

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SERVICE_ACCOUNT_ADDRESS" \
    --role="roles/eventarc.eventReceiver"

สร้าง Bucket ที่จะโฮสต์โมเดลที่ปรับแต่งแล้ว Codelab นี้ใช้ Bucket ระดับภูมิภาค คุณยังใช้ที่เก็บข้อมูลแบบหลายภูมิภาคได้ด้วย

gsutil mb -l $REGION gs://$BUCKET_GEMMA_NAME

จากนั้นให้สิทธิ์ SA เข้าถึง Bucket

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

ตอนนี้ให้สร้างที่เก็บข้อมูลระดับภูมิภาคที่จะจัดเก็บเอกสารที่คุณต้องการสรุป คุณยังใช้ที่เก็บข้อมูลแบบหลายภูมิภาคได้ด้วย โดยจะต้องอัปเดตทริกเกอร์ Eventarc ตามนั้น (แสดงที่ส่วนท้ายของ Codelab นี้)

gsutil mb -l $REGION gs://$BUCKET_DOCS_NAME

จากนั้นให้สิทธิ์เข้าถึง Bucket Gemma 3 แก่ SA

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

และที่เก็บข้อมูลเอกสาร

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_DOCS_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

สร้างที่เก็บ Artifact Registry สำหรับอิมเมจ Ollama ที่จะใช้ใน Sidecar

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for CR function and gpu sidecar" \
    --project=$PROJECT_ID

4. ดาวน์โหลดโมเดล Gemma 3

ก่อนอื่น คุณจะต้องดาวน์โหลดโมเดล Gemma 3 4b จาก Ollama โดยทำได้ด้วยการติดตั้ง Ollama แล้วเรียกใช้โมเดล gemma3:4b ในเครื่อง

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

ตอนนี้ในหน้าต่างเทอร์มินัลแยกต่างหาก ให้เรียกใช้คำสั่งต่อไปนี้เพื่อดึงโมเดลลงมา หากใช้ Cloud Shell คุณจะเปิดหน้าต่างเทอร์มินัลเพิ่มเติมได้โดยคลิกไอคอนเครื่องหมายบวกในแถบเมนูด้านขวาบน

ollama run gemma3:4b

เมื่อ Ollama ทำงานแล้ว คุณสามารถถามคำถามโมเดลได้ เช่น

"why is the sky blue?"

เมื่อแชทกับ Ollama เสร็จแล้ว คุณจะออกจากแชทได้โดยเรียกใช้

/bye

จากนั้นในหน้าต่างเทอร์มินัลแรก ให้เรียกใช้คำสั่งต่อไปนี้เพื่อหยุดการแสดง Ollama ในเครื่อง

# on Linux / Cloud Shell press Ctrl^C or equivalent for your shell

คุณดูตำแหน่งที่ Ollama ดาวน์โหลดโมเดลได้ที่นี่ โดยขึ้นอยู่กับระบบปฏิบัติการของคุณ

https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored

หากใช้ Cloud Workstations คุณจะดูโมเดล Ollama ที่ดาวน์โหลดได้ที่นี่ /home/$USER/.ollama/models

ตรวจสอบว่าโมเดลของคุณโฮสต์อยู่ที่นี่

ls /home/$USER/.ollama/models

ตอนนี้ย้ายโมเดล gemma3:4b ไปยัง Bucket ของ GCS

gsutil cp -r /home/$USER/.ollama/models gs://$BUCKET_GEMMA_NAME

5. สร้างฟังก์ชัน Cloud Run

สร้างโฟลเดอร์รูทสำหรับซอร์สโค้ด

mkdir codelab-crf-sidecar-gpu &&
cd codelab-crf-sidecar-gpu &&
mkdir cr-function &&
mkdir ollama-gemma3 &&
cd cr-function

สร้างโฟลเดอร์ย่อยชื่อ src สร้างไฟล์ชื่อ index.ts ภายในโฟลเดอร์

mkdir src &&
touch src/index.ts

อัปเดต index.ts ด้วยโค้ดต่อไปนี้

//import util from 'util';
import { cloudEvent, CloudEvent } from "@google-cloud/functions-framework";
import { StorageObjectData } from "@google/events/cloud/storage/v1/StorageObjectData";
import { Storage } from "@google-cloud/storage";

// Initialize the Cloud Storage client
const storage = new Storage();

import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';

const ai = genkit({
    plugins: [
        ollama({
            models: [
                {
                    name: 'gemma3:4b',
                    type: 'generate', // type: 'chat' | 'generate' | undefined
                },
            ],
            serverAddress: 'http://127.0.0.1:11434', // default local address
        }),
    ],
});


// Register a CloudEvent callback with the Functions Framework that will
// be triggered by Cloud Storage.

//functions.cloudEvent('helloGCS', await cloudEvent => {
cloudEvent("gcs-cloudevent", async (cloudevent: CloudEvent<StorageObjectData>) => {
    console.log("---------------\nProcessing for ", cloudevent.subject, "\n---------------");

    if (cloudevent.data) {

        const data = cloudevent.data;

        if (data && data.bucket && data.name) {
            const bucketName = cloudevent.data.bucket;
            const fileName = cloudevent.data.name;
            const filePath = `${cloudevent.data.bucket}/${cloudevent.data.name}`;

            console.log(`Attempting to download: ${filePath}`);

            try {
                // Get a reference to the bucket
                const bucket = storage.bucket(bucketName!);

                // Get a reference to the file
                const file = bucket.file(fileName!);

                // Download the file's contents
                const [content] = await file.download();

                // 'content' is a Buffer. Convert it to a string.
                const fileContent = content.toString('utf8');

                console.log(`Sending file to Gemma 3 for summarization`);
                const { text } = await ai.generate({
                    model: 'ollama/gemma3:4b',
                    prompt: `Summarize the following document in just a few sentences ${fileContent}`,
                });

                console.log(text);

            } catch (error: any) {

                console.error('An error occurred:', error.message);
            }
        } else {
            console.warn("CloudEvent bucket name is missing!", cloudevent);
        }
    } else {
        console.warn("CloudEvent data is missing!", cloudevent);
    }
});

ตอนนี้ในไดเรกทอรีราก crf-sidecar-gpu ให้สร้างไฟล์ชื่อ package.json โดยมีเนื้อหาดังนี้

{
    "main": "lib/index.js",
    "name": "ingress-crf-genkit",
    "version": "1.0.0",
    "scripts": {
        "build": "tsc"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "description": "",
    "dependencies": {
        "@google-cloud/functions-framework": "^3.4.0",
        "@google-cloud/storage": "^7.0.0",
        "genkit": "^1.1.0",
        "genkitx-ollama": "^1.1.0",
        "@google/events": "^5.4.0"
    },
    "devDependencies": {
        "typescript": "^5.5.2"
    }
}

สร้าง tsconfig.json ที่ระดับไดเรกทอรีรากด้วยเนื้อหาต่อไปนี้

{
  "compileOnSave": true,
  "include": [
    "src"
  ],
  "compilerOptions": {
    "module": "commonjs",
    "noImplicitReturns": true,
    "outDir": "lib",
    "sourceMap": true,
    "strict": true,
    "target": "es2017",
    "skipLibCheck": true,
    "esModuleInterop": true
  }
}

6. ทำให้ฟังก์ชันใช้งานได้

ในขั้นตอนนี้ คุณจะทำให้ฟังก์ชัน Cloud Run ใช้งานได้โดยการเรียกใช้คำสั่งต่อไปนี้

หมายเหตุ: ควรตั้งค่าอินสแตนซ์สูงสุดเป็นตัวเลขที่น้อยกว่าหรือเท่ากับโควต้า GPU

gcloud beta run deploy $FUNCTION_NAME \
  --region $REGION \
  --function gcs-cloudevent \
  --base-image nodejs22 \
  --source . \
  --no-allow-unauthenticated \
  --max-instances 2 # this should be less than or equal to your GPU quota

7. สร้างไฟล์ Sidecar

ดูข้อมูลเพิ่มเติมเกี่ยวกับการโฮสต์ Ollama ภายในบริการ Cloud Run ได้ที่ https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama

ย้ายไปที่ไดเรกทอรีสำหรับไฟล์ Sidecar

cd ../ollama-gemma3

สร้างไฟล์ Dockerfile ที่มีเนื้อหาต่อไปนี้

FROM ollama/ollama:latest

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

สร้างอิมเมจ

gcloud builds submit \
   --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3 \
   --machine-type e2-highcpu-32

8. อัปเดตฟังก์ชันด้วย Sidecar

หากต้องการเพิ่ม Sidecar ลงในบริการ งาน หรือฟังก์ชันที่มีอยู่ คุณสามารถอัปเดตไฟล์ YAML ให้มี Sidecar ได้

เรียกข้อมูล YAML สำหรับฟังก์ชัน Cloud Run ที่คุณเพิ่งทำให้ใช้งานได้โดยการเรียกใช้คำสั่งต่อไปนี้

gcloud run services describe $FUNCTION_NAME --format=export > add-sidecar-service.yaml

ตอนนี้ให้เพิ่ม Sidecar ลงใน CRf โดยอัปเดต YAML ดังนี้

แทรกส่วน YAML ต่อไปนี้เหนือบรรทัด runtimeClassName: run.googleapis.com/linux-base-image-update โดยตรง -image ควรตรงกับรายการคอนเทนเนอร์ขาเข้า -image

    - image: YOUR_IMAGE_SIDECAR:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: YOUR_BUCKET_GEMMA_NAME
          name: gcs-1

เรียกใช้คำสั่งต่อไปนี้เพื่ออัปเดตส่วน YAML ด้วยตัวแปรสภาพแวดล้อม

sed -i "s|YOUR_IMAGE_SIDECAR|$IMAGE_SIDECAR|; s|YOUR_BUCKET_GEMMA_NAME|$BUCKET_GEMMA_NAME|" add-sidecar-service.yaml

ไฟล์ YAML ที่เสร็จสมบูรณ์แล้วควรมีลักษณะดังนี้

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:    
    run.googleapis.com/build-base-image: us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22
    run.googleapis.com/build-enable-automatic-updates: 'true'
    run.googleapis.com/build-function-target: gcs-cloudevent
    run.googleapis.com/build-id: f0122905-a556-4000-ace4-5c004a9f9ec6
    run.googleapis.com/build-image-uri:<YOUR_IMAGE_CRF>
    run.googleapis.com/build-name: <YOUR_BUILD_NAME>
    run.googleapis.com/build-source-location: <YOUR_SOURCE_LOCATION>
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
    run.googleapis.com/urls: '["<YOUR_CLOUD_RUN_FUNCTION_URLS"]'
  labels:
    cloud.googleapis.com/location: <YOUR_REGION>
  name: <YOUR_FUNCTION_NAME>
  namespace: '392295011265'
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '4'
        run.googleapis.com/base-images: '{"":"us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22"}'
        run.googleapis.com/client-name: gcloud
        run.googleapis.com/client-version: 514.0.0
        run.googleapis.com/startup-cpu-boost: 'true'
      labels:
        client.knative.dev/nonce: hzhhrhheyd
        run.googleapis.com/startupProbeType: Default
    spec:
      containerConcurrency: 80
      containers:
      - image: <YOUR_FUNCTION_IMAGE>
        ports:
        - containerPort: 8080
          name: http1
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
        startupProbe:
          failureThreshold: 1
          periodSeconds: 240
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
      - image: <YOUR_SIDECAR_IMAGE>:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: <YOUR_BUCKET_NAME>
          name: gcs-1
      runtimeClassName: run.googleapis.com/linux-base-image-update
      serviceAccountName: <YOUR_SA_ADDRESS>
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

ตอนนี้ให้อัปเดตฟังก์ชันด้วย Sidecar โดยเรียกใช้คำสั่งต่อไปนี้

gcloud run services replace add-sidecar-service.yaml

สุดท้าย ให้สร้างทริกเกอร์ Eventarc สำหรับฟังก์ชัน คำสั่งนี้จะเพิ่มฟังก์ชันลงในฟังก์ชันด้วย

หมายเหตุ: หากสร้างที่เก็บข้อมูลแบบหลายภูมิภาค คุณจะต้องเปลี่ยนพารามิเตอร์ --location

gcloud eventarc triggers create my-crf-summary-trigger  \
    --location=$REGION \
    --destination-run-service=$FUNCTION_NAME  \
    --destination-run-region=$REGION \
    --event-filters="type=google.cloud.storage.object.v1.finalized" \
    --event-filters="bucket=$BUCKET_DOCS_NAME" \
    --service-account=$SERVICE_ACCOUNT_ADDRESS

9. ทดสอบฟังก์ชัน

อัปโหลดไฟล์ข้อความธรรมดาเพื่อสรุป หากไม่รู้ว่าจะสรุปอะไร ขอให้ Gemini อธิบายประวัติของสุนัขแบบสั้นๆ 1-2 หน้า จากนั้นให้อัปโหลดไฟล์ข้อความธรรมดานั้นไปยังที่เก็บข้อมูล $BUCKET_DOCS_NAME เพื่อให้โมเดล Gemma3:4b เขียนสรุปไปยังบันทึกฟังก์ชัน

ในบันทึก คุณจะเห็นข้อมูลที่มีลักษณะคล้ายด้านล่าง

---------------
Processing for objects/dogs.txt
---------------
Attempting to download: <YOUR_PROJECT_ID>-codelab-crf-sidecar-gpu-docs/dogs.txt
Sending file to Gemma 3 for summarization
...
Here's a concise summary of the document "Humanity's Best Friend":
The dog's domestication, beginning roughly 20,000-40,000 years ago, represents a unique, deeply intertwined evolutionary partnership with humans, predating the domestication of any other animal
<...>
solidifying their long-standing role as humanity's best friend.

10. การแก้ปัญหา

ข้อผิดพลาดที่อาจพบมีดังนี้

หากได้รับข้อผิดพลาดที่ระบุว่า PORT 8080 is in use ให้ตรวจสอบว่า Dockerfile สำหรับ Ollama Sidecar ใช้พอร์ต 11434 นอกจากนี้ โปรดตรวจสอบว่าคุณใช้รูปภาพเสริมที่ถูกต้องในกรณีที่มีรูปภาพ Ollama หลายรูปในที่เก็บ AR ฟังก์ชัน Cloud Run จะทำงานในพอร์ต 8080 และหากคุณใช้รูปภาพ Ollama อื่นเป็น Sidecar ที่ทำงานในพอร์ต 8080 เช่นกัน คุณจะพบข้อผิดพลาดนี้
หากได้รับข้อผิดพลาด failed to build: (error ID: 7485c5b6): function.js does not exist ให้ตรวจสอบว่าไฟล์ package.json และ tsconfig.json อยู่ในระดับเดียวกับไดเรกทอรี src
หากได้รับข้อผิดพลาด ERROR: (gcloud.run.services.replace) spec.template.spec.node_selector: Max instances must be set to 4 or fewer in order to set GPU requirements. ในไฟล์ YAML ให้เปลี่ยน autoscaling.knative.dev/maxScale: '100' เป็น 1 หรือเป็นค่าที่น้อยกว่าหรือเท่ากับโควต้า GPU