Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

GKE에서 llm-d를 사용하여 분리된 TPU vLLM 추론 배포

1. 소개

이 Codelab에서는 Google Cloud TPU를 사용하여 Google Kubernetes Engine (GKE)에 고성능의 분할 추론 서비스를 배포하는 방법을 알아봅니다. 분산 LLM 서빙을 위한 오픈소스 프레임워크인 llm-d를 사용하여 여러 TPU 호스트에서 미리 채우기 및 디코딩 단계를 분리하고, 공유 KV 캐싱 및 GKE 추론 게이트웨이를 설정합니다.

이 설정은 높은 처리량과 짧은 지연 시간으로 Qwen3-32B 와 같은 대규모 모델을 제공하기 위한 프로덕션 환경을 시뮬레이션합니다.

실습할 내용

가속기 트래픽에 최적화된 MTU로 커스텀 VPC 네트워크를 만듭니다.
GCS Fuse CSI 드라이버 및 Ray Operator 부가기능으로 GKE 클러스터를 프로비저닝합니다.
TPU v6e 슬라이스 (총 32개 칩)를 위한 전용 노드 풀 8개를 만듭니다.
GCS 액세스를 위한 워크로드 아이덴티티 및 권한을 구성합니다.
Qwen3-32B 모델의 분할 서빙을 관리하기 위해 llm-d를 배포합니다.
벤치마크 테스트로 배포를 확인합니다.

아키텍처

![미리 채우기의 2x2 복제본 4개로 분할된 모델과 디코딩의 동일한 모델을 보여주는 llm-d 분할 서빙 아키텍처]

필요한 항목

결제가 사용 설정된 Google Cloud 프로젝트.
TPU v6e 리소스 (32개 칩, ct6e-standard-4t)를 위한 Google Cloud 예약.
모델 가중치를 다운로드하기 위한 Hugging Face 사용자 액세스 토큰.
gcloud, kubectl, helm이 설치된 Cloud Shell 또는 로컬 터미널.

예상 소요 시간: 60분
예상 비용: 이 실습에는 상당한 TPU 리소스가 포함되어 있으며 프로젝트를 완료하는 데 최소 60달러가 소요됩니다. 실습을 완료한 후 즉시 삭제 단계를 따르세요.

2. 시작하기 전에

Google Cloud 프로젝트 만들기 또는 선택

Google Cloud 콘솔에서 Google Cloud 프로젝트를 선택하거나 생성합니다.
Cloud 프로젝트에 결제가 사용 설정되어 있는지 확인합니다.

Cloud Shell 시작

Google Cloud 콘솔 상단에서 Cloud Shell 활성화 를 클릭합니다.
인증 확인:

gcloud auth list

프로젝트 확인:

gcloud config get project

필요한 경우 설정:

export PROJECT_ID=<YOUR_PROJECT_ID>
gcloud config set project $PROJECT_ID

API 사용 설정

필요한 Google Cloud 서비스를 사용 설정합니다.

gcloud services enable \
    container.googleapis.com \
    compute.googleapis.com \
    iam.googleapis.com \
    cloudresourcemanager.googleapis.com

환경 변수 설정

셸에서 다음 변수를 정의합니다. <YOUR_ZONE>을 할당된 TPU 영역으로, <YOUR_RESERVATION_NAME>을 예약 ID로, <YOUR_HUGGING_FACE_TOKEN>을 토큰으로 바꿉니다.

export PROJECT_ID=$(gcloud config get-value project)
export ZONE="<YOUR_ZONE>" # e.g., us-east5-a
export REGION=${ZONE%-*}
export NAMESPACE=default
export CLUSTER_NAME="qwen-serving-cluster"
export GVNIC_NETWORK_PREFIX="qwen-serving"
export RESERVATION_NAME="<YOUR_RESERVATION_NAME>"
export HF_TOKEN="<YOUR_HUGGING_FACE_TOKEN>"

3. 커스텀 네트워킹 만들기

분할 서빙에는 미리 채우기 노드와 디코딩 노드 간의 고대역폭 트래픽을 처리하기 위한 특정 네트워크 구성이 필요합니다.

효율적인 가속기 통신을 위해 큰 MTU (8896)로 VPC 네트워크를 만듭니다.

gcloud compute --project=${PROJECT_ID} \
    networks create ${GVNIC_NETWORK_PREFIX}-main \
    --subnet-mode=auto \
    --bgp-routing-mode=regional \
    --mtu=8896

클러스터의 서브넷을 만듭니다.

gcloud compute --project=${PROJECT_ID} \
    networks subnets create ${GVNIC_NETWORK_PREFIX}-tpu \
    --network=${GVNIC_NETWORK_PREFIX}-main \
    --region=${REGION} \
    --range=10.10.0.0/18

GKE Gateway API에 필요한 프록시 전용 서브넷을 만듭니다.

gcloud compute networks subnets create ${GVNIC_NETWORK_PREFIX}-proxy \
    --purpose=REGIONAL_MANAGED_PROXY \
    --role=ACTIVE \
    --region=${REGION} \
    --network=${GVNIC_NETWORK_PREFIX}-main \
    --range=172.16.0.0/26

내부 통신을 허용하도록 방화벽 규칙을 만듭니다.

gcloud compute --project=${PROJECT_ID} firewall-rules create ${GVNIC_NETWORK_PREFIX}-allow-internal \
    --network=${GVNIC_NETWORK_PREFIX}-main \
    --allow=all \
    --source-ranges=172.16.0.0/12,10.0.0.0/8 \
    --description="Allow all internal traffic within the network."

4. GKE 클러스터 프로비저닝

GCS Fuse 마운트 및 Ray Operator 워크로드를 지원하도록 구성된 표준 GKE 클러스터를 만듭니다.

클러스터를 만듭니다:

gcloud container clusters create ${CLUSTER_NAME} \
    --project=${PROJECT_ID} \
    --location=${REGION} \
    --release-channel=rapid \
    --machine-type=e2-standard-4 \
    --network=${GVNIC_NETWORK_PREFIX}-main \
    --subnetwork=${GVNIC_NETWORK_PREFIX}-tpu \
    --num-nodes=1 \
    --gateway-api=standard \
    --enable-managed-prometheus \
    --enable-dataplane-v2 \
    --enable-dataplane-v2-metrics \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --addons=HttpLoadBalancing,GcsFuseCsiDriver,RayOperator,HorizontalPodAutoscaling,NodeLocalDNS \
    --enable-ip-alias

클러스터 사용자 인증 정보를 검색합니다:

gcloud container clusters get-credentials ${CLUSTER_NAME} --region=${REGION}

Hugging Face 보안 비밀을 만듭니다:

kubectl create secret generic llm-d-hf-token \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

5. 예약된 TPU 노드 풀 만들기

예약을 사용하여 TPU v6e 슬라이스를 위한 전용 노드 풀 8개를 프로비저닝합니다.

다음 루프를 실행하여 노드 풀 8개를 만듭니다.

for i in {1..8}
do
  gcloud beta container node-pools create "tpu-v6e-single-$i" \
    --project=${PROJECT_ID} \
    --cluster=${CLUSTER_NAME} \
    --region=${REGION} \
    --node-locations=${ZONE} \
    --machine-type=ct6e-standard-4t \
    --tpu-topology=2x2 \
    --num-nodes=1 \
    --reservation-affinity=specific \
    --reservation=${RESERVATION_NAME} \
    --workload-metadata=GKE_METADATA &
done

모든 노드가 생성되고 클러스터에 조인될 때까지 기다립니다. kubectl get nodes로 상태를 확인할 수 있습니다.

6. llm-d 서비스 배포

이제 llm-d 프레임워크를 배포하여 분할 서빙을 관리합니다.

llm-d 차트를 배포하려면 Helm을 설치합니다.

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-4
chmod 700 get_helm.sh
./get_helm.sh

llm-d를 클론 하고 필요한 종속 항목을 설치합니다.

git clone https://github.com/llm-d/llm-d.git
# When using yq alongside Helm, you almost always want the version by Mike Farah (mikefarah/yq).  We remove the most common yq installation before reinstalling
sudo rm -rf /usr/local/bin/yq
cd llm-d
./helpers/client-setup/install-deps.sh

커스텀 values_tpu.yaml을 준비 하여 클러스터의 분할 서빙을 구성합니다.

cat <<EOF > llm-d/guides/pd-disaggregation/ms-pd/values_tpu.yaml
multinode: false

# Configure accelerator type for Google TPU
accelerator:
type: google

modelArtifacts:
uri: "hf://Qwen/Qwen3-32B"
size: 200Gi
authSecretName: "llm-d-hf-token"
name: "Qwen/Qwen3-32B"
labels:
    llm-d.ai/inference-serving: "true"
    llm-d.ai/guide: "pd-disaggregation"
    llm-d.ai/hardware-variant: "tpu"
    llm-d.ai/hardware-vendor: "google"
    llm-d.ai/model: "Qwen3-32B"

tracing:
enabled: true
otlpEndpoint: "localhost:4317"
serviceNames:
    routingProxy: "routing-proxy"
sampling:
    sampler: "always_off"
    samplerArg: "0"

routing:
servicePort: 8000
proxy:
    image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.5.0
    connector: nixlv2
    secure: false

decode:
parallelism:
    tensor: 4
create: true
replicas: 4
modelCommand: custom
extraConfig:
    nodeSelector:
    cloud.google.com/gke-tpu-accelerator: "tpu-v6e-slice"
    cloud.google.com/gke-tpu-topology: "2x2"
monitoring:
    podmonitor:
    enabled: true
    portName: "vllm"
    path: "/metrics"
    interval: "30s"
containers:
    - name: "vllm"
    image: "vllm/vllm-tpu:nightly"
    command:
        - "/bin/bash"
        - "-c"
        - |
            # ROLE: kv_consumer (Receives KV cache from prefill)
            KV_CONFIG="{\"kv_connector\":\"TPUConnector\", \"kv_connector_module_path\" : \"tpu_inference.distributed.tpu_connector\", \"kv_role\":\"kv_consumer\", \"kv_ip\" : \"$POD_IP\"}"
            echo "KV_CONFIG=$KV_CONFIG"
            python3 -m vllm.entrypoints.openai.api_server \
            --model "Qwen/Qwen3-32B" \
            --port 8200 \
            --tensor-parallel-size 4 \
            --kv-transfer-config "${KV_CONFIG}" \
            --disable-uvicorn-access-log \
            --max-num-seqs 256 \
            --block-size 128 \
            --gpu-memory-utilization 0.90 \
            --max-model-len 8192
    env:
        - name: POD_IP
        valueFrom:
            fieldRef:
            fieldPath: status.podIP
        - name: TPU_SIDE_CHANNEL_PORT
        value: "9600"
        - name: TPU_KV_TRANSFER_PORT
        value: "9100"
    ports:
        - containerPort: 8200
        name: vllm
        protocol: TCP
        - containerPort: 9100
        name: tpu-kv-transfer
        protocol: TCP
        - containerPort: 9600
        name: tpu-coord
        protocol: TCP
    resources:
        limits:
        memory: 64Gi
        cpu: "16"
        google.com/tpu: 4
        requests:
        memory: 64Gi
        cpu: "16"
        google.com/tpu: 4
    mountModelVolume: true
    volumeMounts:
        - name: metrics-volume
        mountPath: /.config
        - name: shm
        mountPath: /dev/shm
        - name: torch-compile-cache
        mountPath: /.cache
    startupProbe:
        httpGet:
        path: /health
        port: vllm
        initialDelaySeconds: 15
        periodSeconds: 30
        timeoutSeconds: 5
        failureThreshold: 120
    livenessProbe:
        httpGet:
        path: /health
        port: vllm
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3
    readinessProbe:
        httpGet:
        path: /v1/models
        port: vllm
        periodSeconds: 5
        timeoutSeconds: 2
        failureThreshold: 3
volumes:
    - name: metrics-volume
    emptyDir: {}
    - name: shm
    emptyDir:
        medium: Memory
        sizeLimit: "16Gi"
    - name: torch-compile-cache
    emptyDir: {}

prefill:
parallelism:
    tensor: 4
create: true
replicas: 4
modelCommand: custom
extraConfig:
    nodeSelector:
    cloud.google.com/gke-tpu-accelerator: "tpu-v6e-slice"
    cloud.google.com/gke-tpu-topology: "2x2"
monitoring:
    podmonitor:
    enabled: true
    portName: "vllm"
    path: "/metrics"
    interval: "30s"
containers:
    - name: "vllm"
    image: "vllm/vllm-tpu:nightly"
    command:
        - "/bin/bash"
        - "-c"
        - |
            # ROLE: kv_producer (Sends KV cache to decode)
            KV_CONFIG="{\"kv_connector\":\"TPUConnector\", \"kv_connector_module_path\" : \"tpu_inference.distributed.tpu_connector\", \"kv_role\":\"kv_producer\", \"kv_ip\" : \"$POD_IP\"}"
            echo "KV_CONFIG=$KV_CONFIG"
            python3 -m vllm.entrypoints.openai.api_server \
            --model "Qwen/Qwen3-32B" \
            --port 8200 \
            --tensor-parallel-size 4 \
            --kv-transfer-config "${KV_CONFIG}" \
            --disable-uvicorn-access-log \
            --enable-chunked-prefill \
            --block-size 128 \
            --gpu-memory-utilization 0.90 \
            --max-model-len 8192
    env:
        - name: POD_IP
        valueFrom:
            fieldRef:
            fieldPath: status.podIP
        - name: TPU_SIDE_CHANNEL_PORT
        value: "9600"
        - name: TPU_KV_TRANSFER_PORT
        value: "9100"
    ports:
        - containerPort: 8200
        name: vllm
        protocol: TCP
        - containerPort: 9100
        name: tpu-kv-transfer
        protocol: TCP
        - containerPort: 9600
        name: tpu-coord
        protocol: TCP
    resources:
        limits:
        memory: 64Gi
        cpu: "16"
        google.com/tpu: 4
        requests:
        memory: 64Gi
        cpu: "16"
        google.com/tpu: 4
    mountModelVolume: true
    volumeMounts:
        - name: metrics-volume
        mountPath: /.config
        - name: shm
        mountPath: /dev/shm
        - name: torch-compile-cache
        mountPath: /.cache
    startupProbe:
        httpGet:
        path: /health
        port: vllm
        initialDelaySeconds: 15
        periodSeconds: 30
        timeoutSeconds: 5
        failureThreshold: 120
    livenessProbe:
        httpGet:
        path: /health
        port: vllm
        periodSeconds: 10
        timeoutSeconds: 5
        failureThreshold: 3
    readinessProbe:
        httpGet:
        path: /v1/models
        port: vllm
        periodSeconds: 5
        timeoutSeconds: 2
        failureThreshold: 3
volumes:
    - name: metrics-volume
    emptyDir: {}
    - name: shm
    emptyDir:
        medium: Memory
        sizeLimit: "16Gi"
    - name: torch-compile-cache
    emptyDir: {}
EOF

llm-d의 Helm 차트를 사용하여 서비스 및 게이트웨이를 배포합니다.

cd llm-d/guides/pd-disaggregation/
helmfile apply -e gke_tpu -n $NAMESPACE
kubectl apply -f ./httproute.gke.yaml

vLLM 서비스가 시작될 때까지 기다립니다'INFO: Application startup complete.'가 표시될 때까지 디코딩 및 미리 채우기 POD 로그를 확인합니다.

DECODE_POD=$(kubectl get pods -l llm-d.ai/modelservice-role=decode -o jsonpath='{.items[0].metadata.name}')

# Get the first Prefill pod name
PREFILL_POD=$(kubectl get pods -l llm-d.ai/modelservice-role=prefill -o jsonpath='{.items[0].metadata.name}')

echo "Run each of these until vLLM starts successfully and then ctrl-C out"
echo "kubectl logs -f $DECODE_POD -c vllm"
echo "kubectl logs -f $PREFILL_POD -c vllm"

7. 배포 응답 테스트

아래 스크립트는 GKE 추론 게이트웨이를 통해 서빙 클러스터에 대한 연결을 테스트한 후 벤치마크 테스트를 실행합니다.

연결 테스트 및 벤치마크 실행:

cat <<EOBF > ./run_benchmark.sh
#!/bin/bash

# Configuration
NAMESPACE="default"
JOB_NAME="qwen3-pd-benchmark"
MODEL_NAME="Qwen/Qwen3-32B"

echo "🔍 Discovering Gateway IP..."
GATEWAY_IP=$(kubectl get gateway -n ${NAMESPACE} -o jsonpath='{.items[0].status.addresses[0].value}')

if [ -z "$GATEWAY_IP" ]; then
    echo "❌ Error: Could not find Gateway IP. Check 'kubectl get gateway'."
    exit 1
fi

TARGET_URL="http://${GATEWAY_IP}"
echo "✅ Found Gateway at: $TARGET_URL"

echo "🗑️  Cleaning up old benchmark jobs..."
kubectl delete job $JOB_NAME --ignore-not-found=true

echo "🚀 Generating and applying Benchmark Job..."
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: $JOB_NAME
namespace: $NAMESPACE
spec:
template:
    spec:
    containers:
    - name: llm-benchmark
        image: vllm/vllm-openai:latest
        command: ["/bin/bash", "-c"]
        args:
        - |
            # 1. Download dataset
            if [ ! -f /data/sharegpt.json ]; then
            echo "Downloading ShareGPT dataset..."
            curl -L "https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json" -o /data/sharegpt.json
            fi

            # 2. Wait for Gateway readiness
            echo "Checking connectivity to $MODEL_NAME..."
            until curl -s "$TARGET_URL/v1/models" | grep -q "$MODEL_NAME"; do
            echo "Waiting for Gateway backends to sync..."
            sleep 10
            done

            # 3. Run Benchmark
            vllm bench serve \\
            --base-url "$TARGET_URL" \\
            --model "$MODEL_NAME" \\
            --dataset-name "sharegpt" \\
            --dataset-path "/data/sharegpt.json" \\
            --request-rate 80.0 \\
            --num-prompts 2000 \\
            --tokenizer "$MODEL_NAME"
        volumeMounts:
        - name: dataset-volume
        mountPath: /data
    restartPolicy: Never
    volumes:
    - name: dataset-volume
        emptyDir: {}
EOF

echo "⏳ Job submitted. Follow logs with:"
echo "kubectl logs -f job/$JOB_NAME"
EOBF

chmod a+x ./run_benchmark.sh

./run_benchmark.sh

요청이 처리되고 지연 시간 측정항목이 표시되는 출력이 표시됩니다.

8. 정리

Google Cloud 계정에 지속적으로 요금이 청구되지 않도록 하려면 이 Codelab 중에 만든 리소스를 삭제합니다.

다음 단계를 실행하여 애셋을 정리합니다.

# 1. Delete LeaderWorkerSet and Helm release
kubectl delete leaderworkerset qwen-simple-anywhere-cache --ignore-not-found
helm uninstall lws --namespace lws-system 2>/dev/null
kubectl delete namespace lws-system --ignore-not-found

# 2. Delete GKE Node Pools
# Note: Usually deleting the cluster deletes the node pools, 
# but explicit deletion ensures it's gone before the cluster teardown begins.
for i in {1..8}
do
	gcloud container node-pools delete "tpu-v6e-single-$i" \
	    --cluster="${CLUSTER_NAME}" \
	    --region="${REGION}" \
	    --project="${PROJECT_ID}" --quiet

done

# 3. Delete GKE Cluster
gcloud container clusters delete "${CLUSTER_NAME}" \
    --region="${REGION}" \
    --project="${PROJECT_ID}" --quiet

echo "--- Starting IAM and Service Account Cleanup ---"

# 1. Define the full Service Account email for clarity
SA_EMAIL="tpu-reader-sa@${PROJECT_ID}.iam.gserviceaccount.com"

# 2. Remove Storage Bucket IAM Binding
# This removes the 'objectViewer' role from the specific bucket
gcloud storage buckets remove-iam-policy-binding gs://inf-demo-model-storage \
    --member="serviceAccount:${SA_EMAIL}" \
    --role="roles/storage.objectViewer" --quiet

# 3. Remove Workload Identity Binding
# This severs the link between the GKE KSA and the GCP SA
gcloud iam service-accounts remove-iam-policy-binding "${SA_EMAIL}" \
    --role="roles/iam.workloadIdentityUser" \
    --member="serviceAccount:${PROJECT_ID}.svc.id.goog[default/default]" --quiet

# 4. Delete the Service Account
gcloud iam service-accounts delete "${SA_EMAIL}" --project="${PROJECT_ID}" --quiet

echo "IAM cleanup complete!"

echo "--- Starting Network and Firewall Cleanup ---"

# 4. Delete Firewall Rules (Must go before the Network)
gcloud compute firewall-rules delete \
    "${GVNIC_NETWORK_PREFIX}-allow-ssh" \
    "${GVNIC_NETWORK_PREFIX}-allow-icmp" \
    "${GVNIC_NETWORK_PREFIX}-allow-internal" \
    "ray-allow-internal" \
    --project="${PROJECT_ID}" --quiet

# 5. Delete Subnets (Must go before the Network)
gcloud compute networks subnets delete "${GVNIC_NETWORK_PREFIX}-tpu" \
    --region="${REGION}" \
    --project="${PROJECT_ID}" --quiet

gcloud compute networks subnets delete "${GVNIC_NETWORK_PREFIX}-proxy-sub" \
    --region="${REGION}" \
    --project="${PROJECT_ID}" --quiet

gcloud compute networks subnets delete "proxy-only-subnet" \
    --region="${REGION}" \
    --project="${PROJECT_ID}" --quiet

# 6. Finally, delete the VPC Network
gcloud compute networks delete "${GVNIC_NETWORK_PREFIX}-main" \
    --project="${PROJECT_ID}" --quiet

echo "Cleanup complete!"

9. 축하합니다

축하합니다. llm-d 및 GKE를 사용하여 분할 v6e TPU에 Qwen3-32B를 성공적으로 배포했습니다.

학습한 내용

고속 TPU 트래픽을 위한 커스텀 네트워킹을 구성하는 방법
GKE에서 예약된 TPU 노드 풀을 프로비저닝하는 방법
미리 채우기 및 디코딩 워크로드를 분리하기 위해 llm-d를 배포하는 방법