Deploy Disaggregated TPU vLLM Inferencing with llm-d on GKE

1. Introduction

In this codelab, you will learn how to deploy high-performance, disaggregated inferencing services on Google Kubernetes Engine (GKE) using Google Cloud TPUs. You will use llm-d, an open-source framework for distributed LLM serving, to separate prefill and decode phases across multiple TPU hosts, set up shared KV caching, and GKE Inference Gateway.

This setup simulates a production environment for serving large models like Qwen3-32B with high throughput and low latency.

What you'll do

  • Create a custom VPC network with optimized MTU for accelerator traffic.
  • Provision a GKE cluster with GCS Fuse CSI driver and Ray Operator addons.
  • Create 8 dedicated node pools for TPU v6e slices (32 chips total).
  • Configure Workload Identity and permissions for GCS access.
  • Deploy llm-d to manage disaggregated serving of the Qwen3-32B model.
  • Verify the deployment with a benchmark test.

Architecture

![llm-d disaggregated serving architecture showing model split into 4 2x2 replicas of prefill and the same for decode]

What you'll need

  • A Google Cloud project with billing enabled.
  • A Google Cloud Reservation for TPU v6e resources (32 chips, ct6e-standard-4t).
  • A Hugging Face User Access Token to download model weights.
  • Cloud Shell or a local terminal with gcloud, kubectl, and helm installed.
  • Estimated Duration: 60 minutes
  • Estimated Cost: This lab involves significant TPU resources and will cost a minimum of $60 to finish the project. Ensure you follow the clean-up steps immediately after completing the exercises.

2. Before you begin

Create or Select a Google Cloud Project

  1. In the Google Cloud Console, select or create a Google Cloud project.
  2. Ensure that billing is enabled for your Cloud project.

Start Cloud Shell

  1. Click Activate Cloud Shell at the top of the Google Cloud console.
  2. Verify authentication:
gcloud auth list
  1. Confirm your project:
gcloud config get project
  1. Set it if needed:
export PROJECT_ID=<YOUR_PROJECT_ID>
gcloud config set project $PROJECT_ID

Enable APIs

Enable the required Google Cloud services:

gcloud services enable \
    container.googleapis.com \
    compute.googleapis.com \
    iam.googleapis.com \
    cloudresourcemanager.googleapis.com

Set Environment Variables

Define the following variables in your shell. Replace <YOUR_ZONE> with your allocated TPU zone, <YOUR_RESERVATION_NAME> with your reservation ID, and <YOUR_HUGGING_FACE_TOKEN> with your token.

export PROJECT_ID=$(gcloud config get-value project)
export ZONE="<YOUR_ZONE>" # e.g., us-east5-a
export REGION=${ZONE%-*}
export NAMESPACE=default
export CLUSTER_NAME="qwen-serving-cluster"
export GVNIC_NETWORK_PREFIX="qwen-serving"
export RESERVATION_NAME="<YOUR_RESERVATION_NAME>"
export HF_TOKEN="<YOUR_HUGGING_FACE_TOKEN>"

3. Create Custom Networking

Disaggregated serving requires specific network configurations to handle high-bandwidth traffic between prefill and decode nodes.

  1. Create the VPC network with a large MTU (8896) for efficient accelerator communication:
    gcloud compute --project=${PROJECT_ID} \
        networks create ${GVNIC_NETWORK_PREFIX}-main \
        --subnet-mode=auto \
        --bgp-routing-mode=regional \
        --mtu=8896
    
  2. Create the subnet for the cluster:
    gcloud compute --project=${PROJECT_ID} \
        networks subnets create ${GVNIC_NETWORK_PREFIX}-tpu \
        --network=${GVNIC_NETWORK_PREFIX}-main \
        --region=${REGION} \
        --range=10.10.0.0/18
    
  3. Create a proxy-only subnet required for GKE Gateway API:
    gcloud compute networks subnets create ${GVNIC_NETWORK_PREFIX}-proxy \
        --purpose=REGIONAL_MANAGED_PROXY \
        --role=ACTIVE \
        --region=${REGION} \
        --network=${GVNIC_NETWORK_PREFIX}-main \
        --range=172.16.0.0/26
    
  4. Create firewall rules to allow internal communication:
    gcloud compute --project=${PROJECT_ID} firewall-rules create ${GVNIC_NETWORK_PREFIX}-allow-internal \
        --network=${GVNIC_NETWORK_PREFIX}-main \
        --allow=all \
        --source-ranges=172.16.0.0/12,10.0.0.0/8 \
        --description="Allow all internal traffic within the network."
    

4. Provision GKE Cluster

Create a Standard GKE cluster configured to support GCS Fuse mounts and Ray Operator workloads.

  1. Create the cluster:
    gcloud container clusters create ${CLUSTER_NAME} \
        --project=${PROJECT_ID} \
        --location=${REGION} \
        --release-channel=rapid \
        --machine-type=e2-standard-4 \
        --network=${GVNIC_NETWORK_PREFIX}-main \
        --subnetwork=${GVNIC_NETWORK_PREFIX}-tpu \
        --num-nodes=1 \
        --gateway-api=standard \
        --enable-managed-prometheus \
        --enable-dataplane-v2 \
        --enable-dataplane-v2-metrics \
        --workload-pool=${PROJECT_ID}.svc.id.goog \
        --addons=HttpLoadBalancing,GcsFuseCsiDriver,RayOperator,HorizontalPodAutoscaling,NodeLocalDNS \
        --enable-ip-alias
    
  2. Retrieve Cluster Credentials:
    gcloud container clusters get-credentials ${CLUSTER_NAME} --region=${REGION}
    
  3. Create Hugging Face Secret:
    kubectl create secret generic llm-d-hf-token \
        --from-literal=hf_api_token=${HF_TOKEN} \
        --dry-run=client -o yaml | kubectl apply -f -
    

5. Create Reserved TPU Node Pools

Provision the 8 dedicated node pools for the TPU v6e slices using your reservation.

Run the following loop to create the 8 node pools:

for i in {1..8}
do
  gcloud beta container node-pools create "tpu-v6e-single-$i" \
    --project=${PROJECT_ID} \
    --cluster=${CLUSTER_NAME} \
    --region=${REGION} \
    --node-locations=${ZONE} \
    --machine-type=ct6e-standard-4t \
    --tpu-topology=2x2 \
    --num-nodes=1 \
    --reservation-affinity=specific \
    --reservation=${RESERVATION_NAME} \
    --workload-metadata=GKE_METADATA &
done

Wait until all nodes are created and join the cluster. You can check status with kubectl get nodes.

6. Deploy llm-d Service

Now you will deploy the llm-d framework to manage the disaggregated serving.

  1. Install Helm to deploy llm-d charts:
    curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-4
    chmod 700 get_helm.sh
    ./get_helm.sh
    
  2. Clone llm-d and install required dependencies:
    git clone https://github.com/llm-d/llm-d.git
    # When using yq alongside Helm, you almost always want the version by Mike Farah (mikefarah/yq).  We remove the most common yq installation before reinstalling
    sudo rm -rf /usr/local/bin/yq
    cd llm-d
    ./helpers/client-setup/install-deps.sh
    
  3. Prepare custom values_tpu.yaml to configure the disaggregated serving for your cluster:
    cat <<EOF > llm-d/guides/pd-disaggregation/ms-pd/values_tpu.yaml
    multinode: false
    
    # Configure accelerator type for Google TPU
    accelerator:
    type: google
    
    modelArtifacts:
    uri: "hf://Qwen/Qwen3-32B"
    size: 200Gi
    authSecretName: "llm-d-hf-token"
    name: "Qwen/Qwen3-32B"
    labels:
        llm-d.ai/inference-serving: "true"
        llm-d.ai/guide: "pd-disaggregation"
        llm-d.ai/hardware-variant: "tpu"
        llm-d.ai/hardware-vendor: "google"
        llm-d.ai/model: "Qwen3-32B"
    
    tracing:
    enabled: true
    otlpEndpoint: "localhost:4317"
    serviceNames:
        routingProxy: "routing-proxy"
    sampling:
        sampler: "always_off"
        samplerArg: "0"
    
    routing:
    servicePort: 8000
    proxy:
        image: ghcr.io/llm-d/llm-d-routing-sidecar:v0.5.0
        connector: nixlv2
        secure: false
    
    decode:
    parallelism:
        tensor: 4
    create: true
    replicas: 4
    modelCommand: custom
    extraConfig:
        nodeSelector:
        cloud.google.com/gke-tpu-accelerator: "tpu-v6e-slice"
        cloud.google.com/gke-tpu-topology: "2x2"
    monitoring:
        podmonitor:
        enabled: true
        portName: "vllm"
        path: "/metrics"
        interval: "30s"
    containers:
        - name: "vllm"
        image: "vllm/vllm-tpu:nightly"
        command:
            - "/bin/bash"
            - "-c"
            - |
                # ROLE: kv_consumer (Receives KV cache from prefill)
                KV_CONFIG="{\"kv_connector\":\"TPUConnector\", \"kv_connector_module_path\" : \"tpu_inference.distributed.tpu_connector\", \"kv_role\":\"kv_consumer\", \"kv_ip\" : \"$POD_IP\"}"
                echo "KV_CONFIG=$KV_CONFIG"
                python3 -m vllm.entrypoints.openai.api_server \
                --model "Qwen/Qwen3-32B" \
                --port 8200 \
                --tensor-parallel-size 4 \
                --kv-transfer-config "${KV_CONFIG}" \
                --disable-uvicorn-access-log \
                --max-num-seqs 256 \
                --block-size 128 \
                --gpu-memory-utilization 0.90 \
                --max-model-len 8192
        env:
            - name: POD_IP
            valueFrom:
                fieldRef:
                fieldPath: status.podIP
            - name: TPU_SIDE_CHANNEL_PORT
            value: "9600"
            - name: TPU_KV_TRANSFER_PORT
            value: "9100"
        ports:
            - containerPort: 8200
            name: vllm
            protocol: TCP
            - containerPort: 9100
            name: tpu-kv-transfer
            protocol: TCP
            - containerPort: 9600
            name: tpu-coord
            protocol: TCP
        resources:
            limits:
            memory: 64Gi
            cpu: "16"
            google.com/tpu: 4
            requests:
            memory: 64Gi
            cpu: "16"
            google.com/tpu: 4
        mountModelVolume: true
        volumeMounts:
            - name: metrics-volume
            mountPath: /.config
            - name: shm
            mountPath: /dev/shm
            - name: torch-compile-cache
            mountPath: /.cache
        startupProbe:
            httpGet:
            path: /health
            port: vllm
            initialDelaySeconds: 15
            periodSeconds: 30
            timeoutSeconds: 5
            failureThreshold: 120
        livenessProbe:
            httpGet:
            path: /health
            port: vllm
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
        readinessProbe:
            httpGet:
            path: /v1/models
            port: vllm
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 3
    volumes:
        - name: metrics-volume
        emptyDir: {}
        - name: shm
        emptyDir:
            medium: Memory
            sizeLimit: "16Gi"
        - name: torch-compile-cache
        emptyDir: {}
    
    prefill:
    parallelism:
        tensor: 4
    create: true
    replicas: 4
    modelCommand: custom
    extraConfig:
        nodeSelector:
        cloud.google.com/gke-tpu-accelerator: "tpu-v6e-slice"
        cloud.google.com/gke-tpu-topology: "2x2"
    monitoring:
        podmonitor:
        enabled: true
        portName: "vllm"
        path: "/metrics"
        interval: "30s"
    containers:
        - name: "vllm"
        image: "vllm/vllm-tpu:nightly"
        command:
            - "/bin/bash"
            - "-c"
            - |
                # ROLE: kv_producer (Sends KV cache to decode)
                KV_CONFIG="{\"kv_connector\":\"TPUConnector\", \"kv_connector_module_path\" : \"tpu_inference.distributed.tpu_connector\", \"kv_role\":\"kv_producer\", \"kv_ip\" : \"$POD_IP\"}"
                echo "KV_CONFIG=$KV_CONFIG"
                python3 -m vllm.entrypoints.openai.api_server \
                --model "Qwen/Qwen3-32B" \
                --port 8200 \
                --tensor-parallel-size 4 \
                --kv-transfer-config "${KV_CONFIG}" \
                --disable-uvicorn-access-log \
                --enable-chunked-prefill \
                --block-size 128 \
                --gpu-memory-utilization 0.90 \
                --max-model-len 8192
        env:
            - name: POD_IP
            valueFrom:
                fieldRef:
                fieldPath: status.podIP
            - name: TPU_SIDE_CHANNEL_PORT
            value: "9600"
            - name: TPU_KV_TRANSFER_PORT
            value: "9100"
        ports:
            - containerPort: 8200
            name: vllm
            protocol: TCP
            - containerPort: 9100
            name: tpu-kv-transfer
            protocol: TCP
            - containerPort: 9600
            name: tpu-coord
            protocol: TCP
        resources:
            limits:
            memory: 64Gi
            cpu: "16"
            google.com/tpu: 4
            requests:
            memory: 64Gi
            cpu: "16"
            google.com/tpu: 4
        mountModelVolume: true
        volumeMounts:
            - name: metrics-volume
            mountPath: /.config
            - name: shm
            mountPath: /dev/shm
            - name: torch-compile-cache
            mountPath: /.cache
        startupProbe:
            httpGet:
            path: /health
            port: vllm
            initialDelaySeconds: 15
            periodSeconds: 30
            timeoutSeconds: 5
            failureThreshold: 120
        livenessProbe:
            httpGet:
            path: /health
            port: vllm
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
        readinessProbe:
            httpGet:
            path: /v1/models
            port: vllm
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 3
    volumes:
        - name: metrics-volume
        emptyDir: {}
        - name: shm
        emptyDir:
            medium: Memory
            sizeLimit: "16Gi"
        - name: torch-compile-cache
        emptyDir: {}
    EOF
    
  4. Deploy the Service and Gateway using llm-d's Helm chart:
    cd llm-d/guides/pd-disaggregation/
    helmfile apply -e gke_tpu -n $NAMESPACE
    kubectl apply -f ./httproute.gke.yaml
    
  5. Wait for vLLM services to startWatch the decode and prefill POD logs until you see "INFO: Application startup complete."
    DECODE_POD=$(kubectl get pods -l llm-d.ai/modelservice-role=decode -o jsonpath='{.items[0].metadata.name}')
    
    # Get the first Prefill pod name
    PREFILL_POD=$(kubectl get pods -l llm-d.ai/modelservice-role=prefill -o jsonpath='{.items[0].metadata.name}')
    
    echo "Run each of these until vLLM starts successfully and then ctrl-C out"
    echo "kubectl logs -f $DECODE_POD -c vllm"
    echo "kubectl logs -f $PREFILL_POD -c vllm"
    

7. Test Deployment Response

The script below will test connectivity to the serving cluster via the GKE Inference Gateway and then run a benchmarking test.

  1. Test Connectivity and Run Benchmark:
    cat <<EOBF > ./run_benchmark.sh
    #!/bin/bash
    
    # Configuration
    NAMESPACE="default"
    JOB_NAME="qwen3-pd-benchmark"
    MODEL_NAME="Qwen/Qwen3-32B"
    
    echo "🔍 Discovering Gateway IP..."
    GATEWAY_IP=$(kubectl get gateway -n ${NAMESPACE} -o jsonpath='{.items[0].status.addresses[0].value}')
    
    if [ -z "$GATEWAY_IP" ]; then
        echo "❌ Error: Could not find Gateway IP. Check 'kubectl get gateway'."
        exit 1
    fi
    
    TARGET_URL="http://${GATEWAY_IP}"
    echo "✅ Found Gateway at: $TARGET_URL"
    
    echo "🗑️  Cleaning up old benchmark jobs..."
    kubectl delete job $JOB_NAME --ignore-not-found=true
    
    echo "🚀 Generating and applying Benchmark Job..."
    cat <<EOF | kubectl apply -f -
    apiVersion: batch/v1
    kind: Job
    metadata:
    name: $JOB_NAME
    namespace: $NAMESPACE
    spec:
    template:
        spec:
        containers:
        - name: llm-benchmark
            image: vllm/vllm-openai:latest
            command: ["/bin/bash", "-c"]
            args:
            - |
                # 1. Download dataset
                if [ ! -f /data/sharegpt.json ]; then
                echo "Downloading ShareGPT dataset..."
                curl -L "https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json" -o /data/sharegpt.json
                fi
    
                # 2. Wait for Gateway readiness
                echo "Checking connectivity to $MODEL_NAME..."
                until curl -s "$TARGET_URL/v1/models" | grep -q "$MODEL_NAME"; do
                echo "Waiting for Gateway backends to sync..."
                sleep 10
                done
    
                # 3. Run Benchmark
                vllm bench serve \\
                --base-url "$TARGET_URL" \\
                --model "$MODEL_NAME" \\
                --dataset-name "sharegpt" \\
                --dataset-path "/data/sharegpt.json" \\
                --request-rate 80.0 \\
                --num-prompts 2000 \\
                --tokenizer "$MODEL_NAME"
            volumeMounts:
            - name: dataset-volume
            mountPath: /data
        restartPolicy: Never
        volumes:
        - name: dataset-volume
            emptyDir: {}
    EOF
    
    echo "⏳ Job submitted. Follow logs with:"
    echo "kubectl logs -f job/$JOB_NAME"
    EOBF
    
    chmod a+x ./run_benchmark.sh
    
    ./run_benchmark.sh
    
    You should see output showing requests being processed and latency metrics.

8. Clean up

To avoid ongoing charges to your Google Cloud account, delete the resources created during this codelab.

Run the following steps to clean up your assets:

# 1. Delete LeaderWorkerSet and Helm release
kubectl delete leaderworkerset qwen-simple-anywhere-cache --ignore-not-found
helm uninstall lws --namespace lws-system 2>/dev/null
kubectl delete namespace lws-system --ignore-not-found

# 2. Delete GKE Node Pools
# Note: Usually deleting the cluster deletes the node pools, 
# but explicit deletion ensures it's gone before the cluster teardown begins.
for i in {1..8}
do
	gcloud container node-pools delete "tpu-v6e-single-$i" \
	    --cluster="${CLUSTER_NAME}" \
	    --region="${REGION}" \
	    --project="${PROJECT_ID}" --quiet

done

# 3. Delete GKE Cluster
gcloud container clusters delete "${CLUSTER_NAME}" \
    --region="${REGION}" \
    --project="${PROJECT_ID}" --quiet

echo "--- Starting IAM and Service Account Cleanup ---"

# 1. Define the full Service Account email for clarity
SA_EMAIL="tpu-reader-sa@${PROJECT_ID}.iam.gserviceaccount.com"

# 2. Remove Storage Bucket IAM Binding
# This removes the 'objectViewer' role from the specific bucket
gcloud storage buckets remove-iam-policy-binding gs://inf-demo-model-storage \
    --member="serviceAccount:${SA_EMAIL}" \
    --role="roles/storage.objectViewer" --quiet

# 3. Remove Workload Identity Binding
# This severs the link between the GKE KSA and the GCP SA
gcloud iam service-accounts remove-iam-policy-binding "${SA_EMAIL}" \
    --role="roles/iam.workloadIdentityUser" \
    --member="serviceAccount:${PROJECT_ID}.svc.id.goog[default/default]" --quiet

# 4. Delete the Service Account
gcloud iam service-accounts delete "${SA_EMAIL}" --project="${PROJECT_ID}" --quiet

echo "IAM cleanup complete!"

echo "--- Starting Network and Firewall Cleanup ---"

# 4. Delete Firewall Rules (Must go before the Network)
gcloud compute firewall-rules delete \
    "${GVNIC_NETWORK_PREFIX}-allow-ssh" \
    "${GVNIC_NETWORK_PREFIX}-allow-icmp" \
    "${GVNIC_NETWORK_PREFIX}-allow-internal" \
    "ray-allow-internal" \
    --project="${PROJECT_ID}" --quiet

# 5. Delete Subnets (Must go before the Network)
gcloud compute networks subnets delete "${GVNIC_NETWORK_PREFIX}-tpu" \
    --region="${REGION}" \
    --project="${PROJECT_ID}" --quiet

gcloud compute networks subnets delete "${GVNIC_NETWORK_PREFIX}-proxy-sub" \
    --region="${REGION}" \
    --project="${PROJECT_ID}" --quiet

gcloud compute networks subnets delete "proxy-only-subnet" \
    --region="${REGION}" \
    --project="${PROJECT_ID}" --quiet

# 6. Finally, delete the VPC Network
gcloud compute networks delete "${GVNIC_NETWORK_PREFIX}-main" \
    --project="${PROJECT_ID}" --quiet

echo "Cleanup complete!"

9. Congratulations

Congratulations! You've successfully deployed Qwen3-32B on disaggregated v6e TPUs using llm-d and GKE.

What you've learned

  • How to configure custom networking for high-speed TPU traffic.
  • How to provision reserved TPU node pools on GKE.
  • How to deploy llm-d for separating prefill and decode workloads.

Next Steps