Cloud Run फ़ंक्शन के लिए, साइडकार में एलएलएम को होस्ट करने का तरीका

1. परिचय

खास जानकारी

इस कोडलैब में, Cloud Run फ़ंक्शन के लिए साइडकार में gemma3:4b मॉडल को होस्ट करने का तरीका जानें. जब किसी फ़ाइल को Cloud Storage बकेट में अपलोड किया जाता है, तो इससे Cloud Run फ़ंक्शन ट्रिगर होता है. यह फ़ंक्शन, फ़ाइल का कॉन्टेंट साइडकार में मौजूद Gemma 3 को भेजेगा, ताकि वह खास जानकारी तैयार कर सके.

आपको क्या सीखने को मिलेगा

Cloud Run फ़ंक्शन और GPU का इस्तेमाल करके साइडकार में होस्ट किए गए एलएलएम का इस्तेमाल करके अनुमान लगाने का तरीका
मॉडल को तेज़ी से अपलोड करने और उसे उपलब्ध कराने के लिए, Cloud Run GPU के लिए डायरेक्ट वीपीसी इग्रेस कॉन्फ़िगरेशन का इस्तेमाल कैसे करें
होस्ट किए गए ollama मॉडल के साथ इंटरफ़ेस करने के लिए, genkit का इस्तेमाल कैसे करें

2. शुरू करने से पहले

GPUs सुविधा का इस्तेमाल करने के लिए, आपको उस क्षेत्र के लिए कोटा बढ़ाने का अनुरोध करना होगा जहां यह सुविधा काम करती है. इसके लिए, nvidia_l4_gpu_allocation_no_zonal_redundancy कोटा की ज़रूरत होती है. यह Cloud Run Admin API के तहत आता है. कोटा का अनुरोध करने के लिए यहां डायरेक्ट लिंक दिया गया है.

3. सेटअप और ज़रूरी शर्तें

ऐसे एनवायरमेंट वैरिएबल सेट करें जिनका इस्तेमाल इस पूरे कोडलैब में किया जाएगा.

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>

AR_REPO=codelab-crf-sidecar-gpu
FUNCTION_NAME=crf-sidecar-gpu
BUCKET_GEMMA_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-gemma3
BUCKET_DOCS_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-docs
SERVICE_ACCOUNT="crf-sidecar-gpu"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
IMAGE_SIDECAR=$REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3

इस कमांड को चलाकर सेवा खाता बनाएं:

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="SA for codelab crf sidecar with gpu"

हम Cloud Run फ़ंक्शन की पहचान के तौर पर इस्तेमाल किए जा रहे इसी सेवा खाते का इस्तेमाल, Eventarc ट्रिगर के लिए सेवा खाते के तौर पर करेंगे. इससे Cloud Run फ़ंक्शन को चालू किया जा सकेगा. अगर आपको Eventarc के लिए कोई दूसरा SA बनाना है, तो ऐसा किया जा सकता है.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
    --role=roles/run.invoker

साथ ही, सेवा खाते को Eventarc इवेंट पाने का ऐक्सेस भी दें.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SERVICE_ACCOUNT_ADDRESS" \
    --role="roles/eventarc.eventReceiver"

एक ऐसा बकेट बनाएं जिसमें फ़ाइनट्यून किया गया मॉडल होस्ट किया जाएगा. इस कोडलैब में, रीजनल बकेट का इस्तेमाल किया जाता है. एक से ज़्यादा क्षेत्रों के लिए बकेट का इस्तेमाल भी किया जा सकता है.

gsutil mb -l $REGION gs://$BUCKET_GEMMA_NAME

इसके बाद, SA को बकेट का ऐक्सेस दें.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

अब एक रीजनल बकेट बनाएं. इसमें वे दस्तावेज़ सेव किए जाएंगे जिनकी आपको खास जानकारी चाहिए. मल्टी-रीजनल बकेट का भी इस्तेमाल किया जा सकता है. हालांकि, इसके लिए आपको Eventarc ट्रिगर को अपडेट करना होगा. इसके बारे में इस कोडलैब के आखिर में बताया गया है.

gsutil mb -l $REGION gs://$BUCKET_DOCS_NAME

इसके बाद, SA को Gemma 3 बकेट का ऐक्सेस दें.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

और Docs बकेट में सेव किया जाता है.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_DOCS_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

Ollama इमेज के लिए, Artifact Registry में एक डेटाबेस बनाएं. इस इमेज का इस्तेमाल साइडकार में किया जाएगा

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for CR function and gpu sidecar" \
    --project=$PROJECT_ID

4. Gemma 3 मॉडल डाउनलोड करना

सबसे पहले, आपको ollama से Gemma 3 4b मॉडल डाउनलोड करना होगा. इसके लिए, ollama को इंस्टॉल करें. इसके बाद, gemma3:4b मॉडल को स्थानीय तौर पर चलाएं.

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

अब एक अलग टर्मिनल विंडो में, मॉडल को पुल डाउन करने के लिए यह कमांड चलाएं. Cloud Shell का इस्तेमाल करने पर, ऊपर दाईं ओर मौजूद मेन्यू बार में मौजूद प्लस आइकॉन पर क्लिक करके, एक और टर्मिनल विंडो खोली जा सकती है.

ollama run gemma3:4b

ollama के चालू होने के बाद, मॉडल से कुछ सवाल पूछें. जैसे,

"why is the sky blue?"

ollama के साथ चैट पूरी होने के बाद, इस कमांड को चलाकर चैट से बाहर निकलें

/bye

इसके बाद, पहली टर्मिनल विंडो में, ollama को स्थानीय तौर पर बंद करने के लिए यह कमांड चलाएं

# on Linux / Cloud Shell press Ctrl^C or equivalent for your shell

यहां आपको यह जानकारी मिलेगी कि Ollama, आपके ऑपरेटिंग सिस्टम के हिसाब से मॉडल कहां डाउनलोड करता है.

https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored

अगर Cloud Workstations का इस्तेमाल किया जा रहा है, तो डाउनलोड किए गए ollama मॉडल यहां /home/$USER/.ollama/models देखे जा सकते हैं

पुष्टि करें कि आपके मॉडल यहां होस्ट किए गए हों:

ls /home/$USER/.ollama/models

अब gemma3:4b मॉडल को अपने GCS बकेट में ले जाएं

gsutil cp -r /home/$USER/.ollama/models gs://$BUCKET_GEMMA_NAME

5. Cloud Run फ़ंक्शन बनाना

अपने सोर्स कोड के लिए रूट फ़ोल्डर बनाएं.

mkdir codelab-crf-sidecar-gpu &&
cd codelab-crf-sidecar-gpu &&
mkdir cr-function &&
mkdir ollama-gemma3 &&
cd cr-function

src नाम का एक सबफ़ोल्डर बनाएं. फ़ोल्डर में, index.ts नाम की एक फ़ाइल बनाएं

mkdir src &&
touch src/index.ts

index.ts फ़ाइल को इस कोड से अपडेट करें:

//import util from 'util';
import { cloudEvent, CloudEvent } from "@google-cloud/functions-framework";
import { StorageObjectData } from "@google/events/cloud/storage/v1/StorageObjectData";
import { Storage } from "@google-cloud/storage";

// Initialize the Cloud Storage client
const storage = new Storage();

import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';

const ai = genkit({
    plugins: [
        ollama({
            models: [
                {
                    name: 'gemma3:4b',
                    type: 'generate', // type: 'chat' | 'generate' | undefined
                },
            ],
            serverAddress: 'http://127.0.0.1:11434', // default local address
        }),
    ],
});


// Register a CloudEvent callback with the Functions Framework that will
// be triggered by Cloud Storage.

//functions.cloudEvent('helloGCS', await cloudEvent => {
cloudEvent("gcs-cloudevent", async (cloudevent: CloudEvent<StorageObjectData>) => {
    console.log("---------------\nProcessing for ", cloudevent.subject, "\n---------------");

    if (cloudevent.data) {

        const data = cloudevent.data;

        if (data && data.bucket && data.name) {
            const bucketName = cloudevent.data.bucket;
            const fileName = cloudevent.data.name;
            const filePath = `${cloudevent.data.bucket}/${cloudevent.data.name}`;

            console.log(`Attempting to download: ${filePath}`);

            try {
                // Get a reference to the bucket
                const bucket = storage.bucket(bucketName!);

                // Get a reference to the file
                const file = bucket.file(fileName!);

                // Download the file's contents
                const [content] = await file.download();

                // 'content' is a Buffer. Convert it to a string.
                const fileContent = content.toString('utf8');

                console.log(`Sending file to Gemma 3 for summarization`);
                const { text } = await ai.generate({
                    model: 'ollama/gemma3:4b',
                    prompt: `Summarize the following document in just a few sentences ${fileContent}`,
                });

                console.log(text);

            } catch (error: any) {

                console.error('An error occurred:', error.message);
            }
        } else {
            console.warn("CloudEvent bucket name is missing!", cloudevent);
        }
    } else {
        console.warn("CloudEvent data is missing!", cloudevent);
    }
});

अब रूट डायरेक्ट्री crf-sidecar-gpu में, package.json नाम की एक फ़ाइल बनाएं और उसमें यह कॉन्टेंट शामिल करें:

{
    "main": "lib/index.js",
    "name": "ingress-crf-genkit",
    "version": "1.0.0",
    "scripts": {
        "build": "tsc"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "description": "",
    "dependencies": {
        "@google-cloud/functions-framework": "^3.4.0",
        "@google-cloud/storage": "^7.0.0",
        "genkit": "^1.1.0",
        "genkitx-ollama": "^1.1.0",
        "@google/events": "^5.4.0"
    },
    "devDependencies": {
        "typescript": "^5.5.2"
    }
}

रूट डायरेक्ट्री लेवल पर भी tsconfig.json बनाएं और उसमें यह कॉन्टेंट शामिल करें:

{
  "compileOnSave": true,
  "include": [
    "src"
  ],
  "compilerOptions": {
    "module": "commonjs",
    "noImplicitReturns": true,
    "outDir": "lib",
    "sourceMap": true,
    "strict": true,
    "target": "es2017",
    "skipLibCheck": true,
    "esModuleInterop": true
  }
}

6. फ़ंक्शन डिप्लॉय करना

इस चरण में, आपको Cloud Run फ़ंक्शन को डिप्लॉय करना होगा. इसके लिए, यहां दिया गया कमांड चलाएं.

ध्यान दें: ज़्यादा से ज़्यादा इंस्टेंस की संख्या, आपके जीपीयू कोटा के बराबर या उससे कम होनी चाहिए.

gcloud beta run deploy $FUNCTION_NAME \
  --region $REGION \
  --function gcs-cloudevent \
  --base-image nodejs22 \
  --source . \
  --no-allow-unauthenticated \
  --max-instances 2 # this should be less than or equal to your GPU quota

7. साइडबार बनाना

Cloud Run सेवा में Ollama को होस्ट करने के बारे में ज़्यादा जानने के लिए, https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama पर जाएं

अपनी साइडकार डायरेक्ट्री में जाएं:

cd ../ollama-gemma3

नीचे दिए गए कॉन्टेंट के साथ एक Dockerfile फ़ाइल बनाएं:

FROM ollama/ollama:latest

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

इमेज बनाना

gcloud builds submit \
   --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3 \
   --machine-type e2-highcpu-32

8. साइडकार की मदद से फ़ंक्शन को अपडेट करना

किसी मौजूदा सेवा, नौकरी या फ़ंक्शन में साइडकार जोड़ने के लिए, YAML फ़ाइल को अपडेट किया जा सकता है, ताकि उसमें साइडकार शामिल हो.

अभी डिप्लॉय किए गए Cloud Run फ़ंक्शन के लिए, यह कमांड चलाकर YAML फ़ाइल पाएं:

gcloud run services describe $FUNCTION_NAME --format=export > add-sidecar-service.yaml

अब सीआरएफ़ में साइडकार जोड़ें. इसके लिए, YAML को इस तरह अपडेट करें:

नीचे दिए गए YAML फ़्रैगमेंट को सीधे तौर पर runtimeClassName: run.googleapis.com/linux-base-image-update लाइन के ऊपर डालें. -image को इनग्रेस कंटेनर आइटम -image के साथ अलाइन किया जाना चाहिए

    - image: YOUR_IMAGE_SIDECAR:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: YOUR_BUCKET_GEMMA_NAME
          name: gcs-1

अपने एनवायरमेंट वैरिएबल के साथ YAML फ़्रैगमेंट को अपडेट करने के लिए, यह कमांड चलाएं:

sed -i "s|YOUR_IMAGE_SIDECAR|$IMAGE_SIDECAR|; s|YOUR_BUCKET_GEMMA_NAME|$BUCKET_GEMMA_NAME|" add-sidecar-service.yaml

आपकी पूरी की गई YAML फ़ाइल कुछ ऐसी दिखनी चाहिए:

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:    
    run.googleapis.com/build-base-image: us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22
    run.googleapis.com/build-enable-automatic-updates: 'true'
    run.googleapis.com/build-function-target: gcs-cloudevent
    run.googleapis.com/build-id: f0122905-a556-4000-ace4-5c004a9f9ec6
    run.googleapis.com/build-image-uri:<YOUR_IMAGE_CRF>
    run.googleapis.com/build-name: <YOUR_BUILD_NAME>
    run.googleapis.com/build-source-location: <YOUR_SOURCE_LOCATION>
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
    run.googleapis.com/urls: '["<YOUR_CLOUD_RUN_FUNCTION_URLS"]'
  labels:
    cloud.googleapis.com/location: <YOUR_REGION>
  name: <YOUR_FUNCTION_NAME>
  namespace: '392295011265'
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '4'
        run.googleapis.com/base-images: '{"":"us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22"}'
        run.googleapis.com/client-name: gcloud
        run.googleapis.com/client-version: 514.0.0
        run.googleapis.com/startup-cpu-boost: 'true'
      labels:
        client.knative.dev/nonce: hzhhrhheyd
        run.googleapis.com/startupProbeType: Default
    spec:
      containerConcurrency: 80
      containers:
      - image: <YOUR_FUNCTION_IMAGE>
        ports:
        - containerPort: 8080
          name: http1
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
        startupProbe:
          failureThreshold: 1
          periodSeconds: 240
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
      - image: <YOUR_SIDECAR_IMAGE>:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: <YOUR_BUCKET_NAME>
          name: gcs-1
      runtimeClassName: run.googleapis.com/linux-base-image-update
      serviceAccountName: <YOUR_SA_ADDRESS>
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

अब साइडकार की मदद से फ़ंक्शन को अपडेट करें. इसके लिए, यह कमांड चलाएं.

gcloud run services replace add-sidecar-service.yaml

आखिर में, फ़ंक्शन के लिए Eventarc ट्रिगर बनाएं. इस कमांड से, इसे फ़ंक्शन में भी जोड़ा जाता है.

ध्यान दें: अगर आपने एक से ज़्यादा क्षेत्रों के लिए बकेट बनाई है, तो आपको --location पैरामीटर बदलना होगा

gcloud eventarc triggers create my-crf-summary-trigger  \
    --location=$REGION \
    --destination-run-service=$FUNCTION_NAME  \
    --destination-run-region=$REGION \
    --event-filters="type=google.cloud.storage.object.v1.finalized" \
    --event-filters="bucket=$BUCKET_DOCS_NAME" \
    --service-account=$SERVICE_ACCOUNT_ADDRESS

9. अपने फ़ंक्शन की जांच करना

सारांश बनाने के लिए, सामान्य टेक्स्ट वाली फ़ाइल अपलोड करें. क्या आपको नहीं पता कि किस कॉन्टेंट की खास जानकारी जनरेट करनी है? Gemini से कुत्तों के इतिहास के बारे में एक या दो पेज में जानकारी देने के लिए कहें! इसके बाद, उस सादे टेक्स्ट वाली फ़ाइल को अपने $BUCKET_DOCS_NAME बकेट में अपलोड करें, ताकि Gemma3:4b मॉडल फ़ंक्शन लॉग की खास जानकारी लिख सके.

आपको लॉग में कुछ ऐसा दिखेगा:

---------------
Processing for objects/dogs.txt
---------------
Attempting to download: <YOUR_PROJECT_ID>-codelab-crf-sidecar-gpu-docs/dogs.txt
Sending file to Gemma 3 for summarization
...
Here's a concise summary of the document "Humanity's Best Friend":
The dog's domestication, beginning roughly 20,000-40,000 years ago, represents a unique, deeply intertwined evolutionary partnership with humans, predating the domestication of any other animal
<...>
solidifying their long-standing role as humanity's best friend.

10. समस्या का हल

टाइप करने में हुई गड़बड़ियों से जुड़ी कुछ समस्याएं यहां दी गई हैं:

अगर आपको PORT 8080 is in use गड़बड़ी का मैसेज मिलता है, तो पक्का करें कि Ollama साइडकार के लिए Dockerfile, पोर्ट 11434 का इस्तेमाल कर रहा हो. यह भी पक्का करें कि एआर रेपो में एक से ज़्यादा Ollama इमेज होने पर, सही साइडकार इमेज का इस्तेमाल किया जा रहा हो. Cloud Run फ़ंक्शन, पोर्ट 8080 पर काम करता है. अगर आपने साइडकार के तौर पर किसी दूसरी Ollama इमेज का इस्तेमाल किया है, जो 8080 पर भी काम कर रही है, तो आपको यह गड़बड़ी दिखेगी.
अगर आपको गड़बड़ी failed to build: (error ID: 7485c5b6): function.js does not exist मिलती है, तो पक्का करें कि आपकी package.json और tsconfig.json फ़ाइलें, src डायरेक्ट्री के साथ एक ही लेवल पर हों.
अगर आपको गड़बड़ी ERROR: (gcloud.run.services.replace) spec.template.spec.node_selector: Max instances must be set to 4 or fewer in order to set GPU requirements. मिलती है, तो अपनी YAML फ़ाइल में autoscaling.knative.dev/maxScale: '100' को 1 पर सेट करें या इसे अपने जीपीयू कोटा से कम या उसके बराबर पर सेट करें.