TPU Dynamic Slicing on GKE with Kueue and LeaderWorkerSet

1. Introduction

In this codelab, you will learn how to use GKE Dynamic Slicing to optimize the utilization of Cloud TPU resources. Dynamic slicing is a powerful capability that allows you to decouple raw TPU provisioning from workload scheduling.

Specifically, you will explore two key patterns:

Sub-slicing: Splitting a large provisioned TPU block into smaller, isolated slices for smaller workloads.
Super-slicing: Stitching multiple provisioned TPU blocks together to form a larger virtual slice for large-scale workloads.

You will apply these patterns to deploy a high-performance Disaggregated Serving (Prefill/Decode disaggregation) architecture for a large language model (Qwen 397B) using Kueue, LeaderWorkerSet (LWS), and the Gateway API.

Architecture

Here is the high-level architecture of the TPU Dynamic Slicing and Disaggregated Serving setup:

TPU Dynamic Slicing Architecture

What you'll do

Provision a GKE cluster with the GKE Slice Controller enabled.
Create GKE TPU node pools configured for incremental provisioning.
Deploy Kueue and LeaderWorkerSet to manage TPU workloads.
Run a subslicing workload to verify JAX TPU access on smaller slices.
Run a superslicing workload to verify JAX TPU access across multiple combined node pools.
Deploy a Disaggregated Serving setup where Prefill and Decode stages run on separate, dynamically allocated TPU slices, coordinated by an LLM router.

What you'll need

A web browser such as Chrome.
A Google Cloud project with billing enabled.
IMPORTANT: Access to a Cloud TPU7x (Ironwood) All Capacity mode reservation.

2. Before you begin

Create or select a Google Cloud Project

Create a Google Cloud Project

In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

Start Cloud Shell

Cloud Shell is a command-line environment running in Google Cloud that comes preloaded with necessary tools.

Click Activate Cloud Shell at the top of the Google Cloud console.
Once connected to Cloud Shell, verify your authentication:
```
gcloud auth list
```
Confirm your project is configured:
```
gcloud config get project
```

If your project is not set as expected, set it:

export PROJECT_ID=<YOUR_PROJECT_ID>
gcloud config set project $PROJECT_ID

Clone the Demo Repository

Clone the repository containing the manifests and helper scripts for this codelab:

git clone --depth 1 --sparse https://github.com/GoogleCloudPlatform/devrel-demos.git
cd devrel-demos
git sparse-checkout set ai-ml/dynamic-slicing
cd ai-ml/dynamic-slicing

3. Configure Environment

Before provisioning resources, you need to configure your environment variables. A helper script 01_setup_env.sh is provided to generate an env.sh file.

Run the setup script:

./01_setup_env.sh

You will be prompted for several values. Press [ENTER] to accept the defaults, but make sure to provide the correct Reservation Name and Reservation Block provided by your event instructor:

GCP Project ID: Your current project ID.
GCP Project Number: Your project number.
GKE Cluster Name: tpu-serving-cluster (default).
TPU Node Pool Zone: us-central1-ai1a (default).
Kubernetes Namespace: llm-d-pd-disaggregation (default).
Cloud TPU Reservation Name: [Enter the provided reservation name]
Cloud TPU Reservation Block Name: block-0 (default).
GCS Bucket Name for Weights: model-weights (default).
TPU Machine Type: tpu7x-standard-4t (default).
Hugging Face Token: [Enter your HF token if required, or press ENTER if using pre-loaded weights]

After running the script, apply the variables to your current session:

source env.sh

4. Enable APIs and AI Zone Features

Now that your environment is configured, you need to enable the required Google Cloud APIs and the AI Zone visibility feature. A helper script 02_enable_apis_and_features.sh is provided.

Run the script:

./02_enable_apis_and_features.sh

This script:

Enables the GKE, Compute, IAM, Resource Manager, Filestore, and Network Services APIs.
Enables the ai-zones-visibility preview feature for GKE Dynamic Slicing.

5. Provision GKE Cluster and TPU Node Pools

In this step, you will provision the underlying network infrastructure, GKE cluster, and TPU node pools.

The TPU node pools will be configured with Incremental Provisioning (using --placement-policy=superslice-policy and --reservation-affinity=specific), which maps each node pool to a 16-node "cube" (sub-block) of raw TPU capacity.

Run the provisioning script:

./03_create_cluster_and_nodes.sh

What this script does:

Creates VPC Network & Subnets: Sets up a main VPC network with a large MTU (8896) optimized for TPU traffic, a TPU subnet, and a proxy-only subnet required by GKE Gateway.
Creates GKE Cluster: Provisions a Standard GKE cluster with the Slice Controller enabled (--enable-slice-controller).
Creates Workload Policy: Defines a resource policy named superslice-policy of type HIGH_THROUGHPUT with a topology of 4x4x4.
Creates GKE TPU Node Pools: Provisions two node pools (tpu7-pool-1 and tpu7-pool-2), each containing 16 nodes of tpu7x-standard-4t. These represent two separate 16-node cubes.

Verify the Nodes

Once the script completes, verify that all 32 TPU nodes are provisioned and registered:

kubectl get nodes -l google.com/tpu=present

You should see 32 nodes in the list.

6. Install Orchestration Tools

Dynamic slicing relies on several Kubernetes controllers to coordinate jobs and slice allocation. You will install:

JobSet: For managing group of jobs (needed for superslicing).
Kueue: For queueing, resource management, and Topology Aware Scheduling (TAS).
LeaderWorkerSet (LWS): For managing replicated multi-node TPU deployments (needed for LLM serving).
GKE Slice Controller (User Space): Connects Kueue with the TPU Cluster Director to dynamically manage the physical slices.

Run the installation script:

./04_install_kueue_lws_slice_controller.sh

Verify that the Slice Controller is running successfully:

kubectl rollout status deployment/slice-controller-controller-manager -n slice-controller-system

7. Configure Kueue Resources

Now you need to define the Kueue resources that represent your TPU hardware topology and configure the admission checks.

Run the deployment script:

./05_deploy_kueue_resources.sh

Key Resources Deployed:

Topology (slice-topology): Defines the hierarchical levels of TPU partitions (from block down to hostname) that Kueue should consider when scheduling.
ResourceFlavor (slice-rf): Associates the slice-topology with the tpu7x accelerator.
AdmissionCheck (ac): Configures Kueue to use the GKE Slice Controller (accelerator.gke.io/slice) to dynamically provision slices when a job is admitted.
ClusterQueue (cq) & LocalQueue (lq): Sets up the queues that workloads will be submitted to.
WorkloadPriorityClass (low-priority-1000, medium-priority-2000, high-priority-3000): Defines priority levels to enable preemption and priority-based scheduling.

Verify the resources:

kubectl get topology slice-topology
kubectl get resourceflavor slice-rf
kubectl get admissioncheck ac
kubectl get clusterqueue cq
kubectl get localqueue lq -n ${NAMESPACE}

8. Deploy and verify TPU Access with Subslicing

Sub-slicing allows you to run multiple smaller workloads within a single provisioned TPU block. In this step, you will submit a workload that requests a 2x2x2 topology (8 chips / 2 VMs) to a cluster made of 4x4x4 (64 chips / 16 VMs) blocks.

Deploy the subslicing workload:

./06_deploy_simple_subslicing.sh

This script applies kueue-jobset-simple-subslicing.yaml.

How it works:

The JobSet spec includes the annotation cloud.google.com/gke-tpu-slice-topology: 2x2x2.
It configures replicas: 6 and parallelism: 2 (completions: 2). This means Kueue will schedule 6 independent jobs, each consisting of 2 pods.
Each pod requests google.com/tpu: "4" (1 TPU VM).
Kueue and the GKE Slice Controller dynamically carve the 32-node cluster to allocate six 2x2x2 slices.

Verify JAX execution

Monitor the pods until they are running:

kubectl get pods -n ${NAMESPACE} -l jobset.sigs.k8s.io/jobset-name=kueue-jobset-simple-subslicing

Check the logs of one of the pods to verify JAX successfully detected the 8 TPU devices (cores) on its sub-slice:

kubectl logs $(kubectl get pods -n ${NAMESPACE} -l jobset.sigs.k8s.io/jobset-name=kueue-jobset-simple-subslicing -o name | head -n 1) -n ${NAMESPACE}

You should see output indicating: Total TPU devices (cores): 8

9. Deploy and verify TPU Access with Superslicing

Super-slicing is a powerful GKE feature that allows a single workload to span multiple physical TPU blocks (often referred to as cubes or topologies like 4x4x4). By stitching these blocks together, you can form a larger virtual slice for large-scale training or serving workloads. In this step, you will deploy a JobSet that requests a 4x4x8 topology (128 chips / 32 VMs). Since a single 4x4x4 block only contains 64 chips, this workload exceeds the size of a single block and requires GKE to dynamically stitch the tpu7-pool-1 and tpu7-pool-2 node pools together to satisfy the request.

Deploy the superslicing workload:

./07_deploy_simple_superslicing.sh

This script applies kueue-jobset-simple-superslicing.yaml.

How it works:

The JobSet template includes the annotation cloud.google.com/gke-tpu-slice-topology: 4x4x8.
It configures parallelism: 32 and completions: 32.
Each pod requests google.com/tpu: "4".
Since a 4x4x8 topology requires all 32 nodes, the Slice Controller dynamically configures the OCS (Optical Circuit Switching) network to interconnect the two 16-node pools into a single 32-node ICI mesh.

Verify that the JobSet pods successfully run and that JAX detects all 128 devices:

kubectl get pods -n ${NAMESPACE} -l jobset.sigs.k8s.io/jobset-name=kueue-jobset-simple-superslicing

Check the logs of one of the pods:

kubectl logs $(kubectl get pods -n ${NAMESPACE} -l jobset.sigs.k8s.io/jobset-name=kueue-jobset-simple-superslicing -o name | head -n 1) -n ${NAMESPACE}

You should see JAX output showing the global device count: Global device count: 128

10. Deploy Disaggregated Serving (Prefill/Decode)

Now you will deploy the end-to-end LLM serving stack using Prefill/Decode Disaggregation.

In standard serving, prefill (processing the prompt) and decode (generating tokens) run on the same TPUs. Since prefill is compute-bound and decode is memory-bandwidth-bound, they conflict. Disaggregated serving runs them on separate TPU slices, transferring the KV Cache over the network.

Setup LLM-D and Gateway

Setup the namespaces, Hugging Face secrets, and GKE Gateway:

./08_setup_llm_d.sh

Deploy LLM-D Router

Deploy the router that will receive client requests and coordinate the routing between Prefill and Decode slices:

./09_deploy_llm_d_router.sh

Deploy Prefill and Decode Workloads

Deploy the vLLM model servers on dynamically allocated TPU slices:

./10_deploy_subslicing_pd_workload.sh

What this does:

Deploys kueue-vllm-prefill-model-streamer (LWS requesting a 2x2x2 TPU slice).
Deploys kueue-vllm-decode-model-streamer (LWS requesting a 2x2x2 TPU slice).
The prefill slice loads the Qwen 397B model weights and acts as the kv_producer.
The decode slice acts as the kv_consumer.
They communicate using TPUConnectorHMA to transfer KV caches.

Wait until both prefill and decode pods are running:

kubectl get pods -n ${NAMESPACE} -l llm-d.ai/role=prefill
kubectl get pods -n ${NAMESPACE} -l llm-d.ai/role=decode

11. Verify Serving

With the router, prefill, and decode workloads running, you can now verify the serving API.

Run the verification script:

./11_verify_serving.sh

How it works:

The script retrieves the internal IP of the GKE Gateway.
It spins up a temporary pod (curl-debug-comp) to send a completion request to http://${GATEWAY_IP}/v1/completions.
It spins up another pod (curl-debug-chat) to send a chat request to http://${GATEWAY_IP}/v1/chat/completions.

You should see a successful JSON response from the Qwen model:

{
  "choices": [
    {
      "text": "... [Model Response] ..."
    }
  ]
}

12. Clean up

To avoid ongoing charges to your Google Cloud account, delete the resources created during this codelab.

Run the teardown script:

./12_teardown_cleanup.sh

What this script does:

Deletes GKE node pools (tpu7-pool-1, tpu7-pool-2).
Deletes GKE Cluster (tpu-serving-cluster).
Deletes resource policies (superslice-policy).
Deletes VPC networks (qwen-serving-main).

Alternatively, if you created a dedicated project for this codelab, you can delete the entire project:

gcloud projects delete ${PROJECT_ID}

13. Congratulations

Congratulations! You have successfully explored GKE Dynamic Slicing and deployed a Disaggregated LLM Serving architecture.

What you've learned

How to enable the GKE Slice Controller and configure node pools for Incremental Provisioning.
How to use Kueue to request specific TPU topologies.
How Sub-slicing splits a large TPU block for smaller, independent JAX workloads.
How Super-slicing stitches multiple node pools into a single larger virtual TPU slice.
How to deploy Prefill/Decode disaggregated serving using LWS, Gateway API, and vLLM.

TPU Dynamic Slicing on GKE with Kueue and LeaderWorkerSet

1. Introduction

Architecture

What you'll do

What you'll need

2. Before you begin

Create or select a Google Cloud Project

Create a Google Cloud Project

Start Cloud Shell

Clone the Demo Repository

3. Configure Environment

4. Enable APIs and AI Zone Features

5. Provision GKE Cluster and TPU Node Pools

What this script does:

Verify the Nodes

6. Install Orchestration Tools

7. Configure Kueue Resources

Key Resources Deployed:

8. Deploy and verify TPU Access with Subslicing

How it works:

Verify JAX execution

9. Deploy and verify TPU Access with Superslicing

How it works:

10. Deploy Disaggregated Serving (Prefill/Decode)

Setup LLM-D and Gateway

Deploy LLM-D Router

Deploy Prefill and Decode Workloads

What this does:

11. Verify Serving

How it works:

12. Clean up

What this script does:

13. Congratulations

What you've learned

Reference docs