Cloud Run 함수의 사이드카에서 LLM을 호스팅하는 방법

이 Codelab 정보

최종 업데이트: 3월 27, 2025

작성자: Google 직원

이 페이지는 Cloud Translation API를 통해 번역되었습니다.

1. 소개

개요

이 Codelab에서는 Cloud Run 함수의 사이드카에서 gemma3:4b 모델을 호스팅하는 방법을 알아봅니다. 파일이 Cloud Storage 버킷에 업로드되면 Cloud Run 함수가 트리거됩니다. 이 함수는 요약을 위해 파일의 콘텐츠를 사이드카의 Gemma 3로 전송합니다.

학습할 내용

GPU를 사용하여 사이드카에 호스팅된 Cloud Run 함수와 LLM을 사용하여 추론하는 방법
Cloud Run GPU에 직접 VPC 이그레스 구성을 사용하여 모델을 더 빠르게 업로드하고 제공하는 방법
genkit을 사용하여 호스팅된 ollama 모델과 상호작용하는 방법

2. 시작하기 전에

GPU 기능을 사용하려면 지원되는 리전에 대한 할당량 상향을 요청해야 합니다. 필요한 할당량은 Cloud Run Admin API에 있는 nvidia_l4_gpu_allocation_no_zonal_redundancy입니다. 할당량 요청 바로가기 링크입니다.

3. 설정 및 요구사항

이 Codelab 전체에서 사용할 환경 변수를 설정합니다.

PROJECT_ID=<YOUR_PROJECT_ID>
REGION=<YOUR_REGION>

AR_REPO=codelab-crf-sidecar-gpu
FUNCTION_NAME=crf-sidecar-gpu
BUCKET_GEMMA_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-gemma3
BUCKET_DOCS_NAME=$PROJECT_ID-codelab-crf-sidecar-gpu-docs
SERVICE_ACCOUNT="crf-sidecar-gpu"
SERVICE_ACCOUNT_ADDRESS=$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
IMAGE_SIDECAR=$REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3

다음 명령어를 실행하여 서비스 계정을 만듭니다.

gcloud iam service-accounts create $SERVICE_ACCOUNT \
  --display-name="SA for codelab crf sidecar with gpu"

Cloud Run 함수의 ID로 사용되는 동일한 서비스 계정을 Cloud Run 함수를 호출하는 eventarc 트리거의 서비스 계정으로 사용합니다. 원하는 경우 Eventarc에 다른 SA를 만들 수 있습니다.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
    --role=roles/run.invoker

또한 서비스 계정에 Eventarc 이벤트를 수신할 수 있는 액세스 권한을 부여합니다.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:$SERVICE_ACCOUNT_ADDRESS" \
    --role="roles/eventarc.eventReceiver"

미세 조정된 모델을 호스팅할 버킷을 만듭니다. 이 Codelab에서는 지역 버킷을 사용합니다. 멀티 리전 버킷도 사용할 수 있습니다.

gsutil mb -l $REGION gs://$BUCKET_GEMMA_NAME

그런 다음 SA에 버킷에 대한 액세스 권한을 부여합니다.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

이제 요약할 문서를 저장할 리전 버킷을 만듭니다. Eventarc 트리거를 적절하게 업데이트하면 멀티 리전 버킷도 사용할 수 있습니다 (이 Codelab 끝에 표시됨).

gsutil mb -l $REGION gs://$BUCKET_DOCS_NAME

그런 다음 SA에 Gemma 3 버킷에 대한 액세스 권한을 부여합니다.

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_GEMMA_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

및 Docs 버킷을

gcloud storage buckets add-iam-policy-binding gs://$BUCKET_DOCS_NAME \
--member=serviceAccount:$SERVICE_ACCOUNT_ADDRESS \
--role=roles/storage.objectAdmin

사이드카에 사용될 Ollama 이미지의 아티팩트 저장소 저장소 만들기

gcloud artifacts repositories create $AR_REPO \
    --repository-format=docker \
    --location=$REGION \
    --description="codelab for CR function and gpu sidecar" \
    --project=$PROJECT_ID

4. Gemma 3 모델 다운로드

먼저 ollama에서 Gemma 3 4b 모델을 다운로드합니다. 이를 위해 ollama를 설치한 다음 gemma3:4b 모델을 로컬에서 실행하면 됩니다.

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

이제 별도의 터미널 창에서 다음 명령어를 실행하여 모델을 가져옵니다. Cloud Shell을 사용하는 경우 오른쪽 상단 메뉴 바에 있는 더하기 아이콘을 클릭하여 터미널 창을 추가로 열 수 있습니다.

ollama run gemma3:4b

ollama가 실행되면 언제든지 모델에게 질문할 수 있습니다.예를 들면 다음과 같습니다.

"why is the sky blue?"

올라마와 채팅을 완료한 후 다음을 실행하여 채팅을 종료할 수 있습니다.

/bye

그런 다음 첫 번째 터미널 창에서 다음 명령어를 실행하여 로컬에서 ollama 제공을 중지합니다.

# on Linux / Cloud Shell press Ctrl^C or equivalent for your shell

여기에서 운영체제에 따라 Ollama가 모델을 다운로드하는 위치를 확인할 수 있습니다.

https://github.com/ollama/ollama/blob/main/docs/faq.md#where-are-models-stored

Cloud Workstations를 사용하는 경우 여기에서 다운로드한 ollama 모델을 찾을 수 있습니다. /home/$USER/.ollama/models

모델이 다음 위치에 호스팅되어 있는지 확인합니다.

ls /home/$USER/.ollama/models

이제 gemma3:4b 모델을 GCS 버킷으로 이동합니다.

gsutil cp -r /home/$USER/.ollama/models gs://$BUCKET_GEMMA_NAME

5. Cloud Run 함수 만들기

소스 코드의 루트 폴더를 만듭니다.

mkdir codelab-crf-sidecar-gpu &&
cd codelab-crf-sidecar-gpu &&
mkdir cr-function &&
mkdir ollama-gemma3 &&
cd cr-function

src라는 하위 폴더를 만듭니다. 폴더 내에 index.ts라는 파일을 만듭니다.

mkdir src &&
touch src/index.ts

다음 코드를 사용하여 index.ts를 업데이트합니다.

//import util from 'util';
import { cloudEvent, CloudEvent } from "@google-cloud/functions-framework";
import { StorageObjectData } from "@google/events/cloud/storage/v1/StorageObjectData";
import { Storage } from "@google-cloud/storage";

// Initialize the Cloud Storage client
const storage = new Storage();

import { genkit } from 'genkit';
import { ollama } from 'genkitx-ollama';

const ai = genkit({
    plugins: [
        ollama({
            models: [
                {
                    name: 'gemma3:4b',
                    type: 'generate', // type: 'chat' | 'generate' | undefined
                },
            ],
            serverAddress: 'http://127.0.0.1:11434', // default local address
        }),
    ],
});


// Register a CloudEvent callback with the Functions Framework that will
// be triggered by Cloud Storage.

//functions.cloudEvent('helloGCS', await cloudEvent => {
cloudEvent("gcs-cloudevent", async (cloudevent: CloudEvent<StorageObjectData>) => {
    console.log("---------------\nProcessing for ", cloudevent.subject, "\n---------------");

    if (cloudevent.data) {

        const data = cloudevent.data;

        if (data && data.bucket && data.name) {
            const bucketName = cloudevent.data.bucket;
            const fileName = cloudevent.data.name;
            const filePath = `${cloudevent.data.bucket}/${cloudevent.data.name}`;

            console.log(`Attempting to download: ${filePath}`);

            try {
                // Get a reference to the bucket
                const bucket = storage.bucket(bucketName!);

                // Get a reference to the file
                const file = bucket.file(fileName!);

                // Download the file's contents
                const [content] = await file.download();

                // 'content' is a Buffer. Convert it to a string.
                const fileContent = content.toString('utf8');

                console.log(`Sending file to Gemma 3 for summarization`);
                const { text } = await ai.generate({
                    model: 'ollama/gemma3:4b',
                    prompt: `Summarize the following document in just a few sentences ${fileContent}`,
                });

                console.log(text);

            } catch (error: any) {

                console.error('An error occurred:', error.message);
            }
        } else {
            console.warn("CloudEvent bucket name is missing!", cloudevent);
        }
    } else {
        console.warn("CloudEvent data is missing!", cloudevent);
    }
});

이제 루트 디렉터리 crf-sidecar-gpu에서 다음 내용이 포함된 package.json라는 파일을 만듭니다.

{
    "main": "lib/index.js",
    "name": "ingress-crf-genkit",
    "version": "1.0.0",
    "scripts": {
        "build": "tsc"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "description": "",
    "dependencies": {
        "@google-cloud/functions-framework": "^3.4.0",
        "@google-cloud/storage": "^7.0.0",
        "genkit": "^1.1.0",
        "genkitx-ollama": "^1.1.0",
        "@google/events": "^5.4.0"
    },
    "devDependencies": {
        "typescript": "^5.5.2"
    }
}

다음 콘텐츠로 루트 디렉터리 수준에서도 tsconfig.json를 만듭니다.

{
  "compileOnSave": true,
  "include": [
    "src"
  ],
  "compilerOptions": {
    "module": "commonjs",
    "noImplicitReturns": true,
    "outDir": "lib",
    "sourceMap": true,
    "strict": true,
    "target": "es2017",
    "skipLibCheck": true,
    "esModuleInterop": true
  }
}

6. 함수 배포하기

이 단계에서는 다음 명령어를 실행하여 Cloud Run 함수를 배포합니다.

참고: 최대 인스턴스는 GPU 할당량 이하로 설정해야 합니다.

gcloud beta run deploy $FUNCTION_NAME \
  --region $REGION \
  --function gcs-cloudevent \
  --base-image nodejs22 \
  --source . \
  --no-allow-unauthenticated \
  --max-instances 2 # this should be less than or equal to your GPU quota

7. 사이드카 만들기

Cloud Run 서비스 내에서 Ollama를 호스팅하는 방법에 대한 자세한 내용은 https://cloud.google.com/run/docs/tutorials/gpu-gemma-with-ollama에서 확인하세요.

사이드카의 디렉터리로 이동합니다.

cd ../ollama-gemma3

다음 콘텐츠로 Dockerfile이라는 파일을 만듭니다.

FROM ollama/ollama:latest

# Listen on all interfaces, port 11434
ENV OLLAMA_HOST 0.0.0.0:11434

# Store model weight files in /models
ENV OLLAMA_MODELS /models

# Reduce logging verbosity
ENV OLLAMA_DEBUG false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Store the model weights in the container image
ENV MODEL gemma3:4b
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

이미지 빌드

gcloud builds submit \
   --tag $REGION-docker.pkg.dev/$PROJECT_ID/$AR_REPO/ollama-gemma3 \
   --machine-type e2-highcpu-32

8. 사이드카로 함수 업데이트

기존 서비스, 작업 또는 함수에 사이드카를 추가하려면 사이드카가 포함되도록 YAML 파일을 업데이트하면 됩니다.

다음을 실행하여 방금 배포한 Cloud Run 함수의 YAML을 가져옵니다.

gcloud run services describe $FUNCTION_NAME --format=export > add-sidecar-service.yaml

이제 다음과 같이 YAML을 업데이트하여 CRf에 사이드카를 추가합니다.

runtimeClassName: run.googleapis.com/linux-base-image-update 줄 바로 위에 다음 YAML 프래그먼트를 삽입합니다. -image가 인그레스 컨테이너 항목 -image와 일치해야 합니다.

    - image: YOUR_IMAGE_SIDECAR:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: YOUR_BUCKET_GEMMA_NAME
          name: gcs-1

다음 명령어를 실행하여 YAML 프래그먼트를 환경 변수로 업데이트합니다.

sed -i "s|YOUR_IMAGE_SIDECAR|$IMAGE_SIDECAR|; s|YOUR_BUCKET_GEMMA_NAME|$BUCKET_GEMMA_NAME|" add-sidecar-service.yaml

완성된 YAML 파일은 다음과 같이 표시됩니다.

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:    
    run.googleapis.com/build-base-image: us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22
    run.googleapis.com/build-enable-automatic-updates: 'true'
    run.googleapis.com/build-function-target: gcs-cloudevent
    run.googleapis.com/build-id: f0122905-a556-4000-ace4-5c004a9f9ec6
    run.googleapis.com/build-image-uri:<YOUR_IMAGE_CRF>
    run.googleapis.com/build-name: <YOUR_BUILD_NAME>
    run.googleapis.com/build-source-location: <YOUR_SOURCE_LOCATION>
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
    run.googleapis.com/urls: '["<YOUR_CLOUD_RUN_FUNCTION_URLS"]'
  labels:
    cloud.googleapis.com/location: <YOUR_REGION>
  name: <YOUR_FUNCTION_NAME>
  namespace: '392295011265'
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/maxScale: '4'
        run.googleapis.com/base-images: '{"":"us-central1-docker.pkg.dev/serverless-runtimes/google-22/runtimes/nodejs22"}'
        run.googleapis.com/client-name: gcloud
        run.googleapis.com/client-version: 514.0.0
        run.googleapis.com/startup-cpu-boost: 'true'
      labels:
        client.knative.dev/nonce: hzhhrhheyd
        run.googleapis.com/startupProbeType: Default
    spec:
      containerConcurrency: 80
      containers:
      - image: <YOUR_FUNCTION_IMAGE>
        ports:
        - containerPort: 8080
          name: http1
        resources:
          limits:
            cpu: 1000m
            memory: 512Mi
        startupProbe:
          failureThreshold: 1
          periodSeconds: 240
          tcpSocket:
            port: 8080
          timeoutSeconds: 240
      - image: <YOUR_SIDECAR_IMAGE>:latest
        name: gemma-sidecar
        env:
        - name: OLLAMA_FLASH_ATTENTION
          value: '1'
        resources:
          limits:
            cpu: 6000m
            nvidia.com/gpu: '1'
            memory: 16Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 60
          timeoutSeconds: 60
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
      volumes:
        - csi:
            driver: gcsfuse.run.googleapis.com
            volumeAttributes:
              bucketName: <YOUR_BUCKET_NAME>
          name: gcs-1
      runtimeClassName: run.googleapis.com/linux-base-image-update
      serviceAccountName: <YOUR_SA_ADDRESS>
      timeoutSeconds: 300
  traffic:
  - latestRevision: true
    percent: 100

##############################################
# DO NOT COPY - For illustration purposes only
##############################################

이제 다음 명령어를 실행하여 사이드카로 함수를 업데이트합니다.

gcloud run services replace add-sidecar-service.yaml

마지막으로 함수의 Eventarc 트리거를 만듭니다. 이 명령어는 변수를 함수에 추가하기도 합니다.

참고: 멀티 리전 버킷을 만든 경우 --location 매개변수를 변경해야 합니다.

gcloud eventarc triggers create my-crf-summary-trigger  \
    --location=$REGION \
    --destination-run-service=$FUNCTION_NAME  \
    --destination-run-region=$REGION \
    --event-filters="type=google.cloud.storage.object.v1.finalized" \
    --event-filters="bucket=$BUCKET_DOCS_NAME" \
    --service-account=$SERVICE_ACCOUNT_ADDRESS

9. 함수 테스트

요약할 일반 텍스트 파일을 업로드합니다. 무엇을 요약해야 할지 모르시나요? Gemini에게 개 역사에 관한 1~2페이지 분량의 간단한 설명을 요청해 보세요. 그런 다음 Gemma3:4b 모델이 함수 로그에 요약을 작성할 수 있도록 이 텍스트 파일을 $BUCKET_DOCS_NAME 버킷에 업로드합니다.

로그에 다음과 유사한 내용이 표시됩니다.

---------------
Processing for objects/dogs.txt
---------------
Attempting to download: <YOUR_PROJECT_ID>-codelab-crf-sidecar-gpu-docs/dogs.txt
Sending file to Gemma 3 for summarization
...
Here's a concise summary of the document "Humanity's Best Friend":
The dog's domestication, beginning roughly 20,000-40,000 years ago, represents a unique, deeply intertwined evolutionary partnership with humans, predating the domestication of any other animal
<...>
solidifying their long-standing role as humanity's best friend.

10. 문제 해결

다음은 발생할 수 있는 오타 오류입니다.

PORT 8080 is in use 오류가 발생하면 Ollama 사이드카의 Dockerfile이 포트 11434를 사용하고 있는지 확인하세요. 또한 AR 저장소에 Ollama 이미지가 여러 개 있는 경우 올바른 사이드카 이미지를 사용하고 있는지 확인합니다. Cloud Run 함수는 포트 8080에서 제공되며, 8080에서도 제공되는 사이드카로 다른 Ollama 이미지를 사용한 경우 이 오류가 발생합니다.
failed to build: (error ID: 7485c5b6): function.js does not exist 오류가 발생하면 package.json 및 tsconfig.json 파일이 src 디렉터리와 같은 수준에 있는지 확인합니다.
ERROR: (gcloud.run.services.replace) spec.template.spec.node_selector: Max instances must be set to 4 or fewer in order to set GPU requirements. 오류가 발생하면 YAML 파일에서 autoscaling.knative.dev/maxScale: '100'를 1 또는 GPU 할당량 이하로 변경합니다.

오류 신고