GKE Autopilot clusters with TPUs, GKE managed DRANET and Gemma 4

1. Overview

This lab introduces you to AI Infrastructure that can be used for running AI workloads. You will be working with the following:

Google Kubernetes Engine (GKE) - The foundational container orchestration platform.

GKE managed DRANET - Dynamic Resource Allocation networking that directly assigns high-speed interconnect fabrics to your TPU pods.

Tensor Processing Unit (TPU) - Google's custom-built accelerator chips.

To configure you are going to deploy a custom VPC, and an autopilot GKE cluster. To enable managed DRANET you will create a ComputeClass and a Resource Claim Template. You will then deploy a workload that uses vLLM, Hugging Face, ComputeClass and resource claim template. Finally you will test the networking setup and connectivity to the Gemma 4 model.

The configurations will use a combination of Terraform, gcloud, and kubectl.

In this lab you will learn how to perform the following task:

  • Set up a VPC network
  • Set up a GKE autopilot cluster
  • Create ComputeClass and ResourceClaimTemplate.
  • Create a deployment that runs, TPUs, vLLM, monitoring and Gemma 4 via Hugging Face
  • Test connectivity to the LLM

In this lab, you're going to be creating the following pattern.

Figure1. d38a898255a06e25.png

2. Google Cloud services setup

Self-paced environment setup

  1. Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.

295004821bab6a87.png

37d264871000675d.png

96d86d3d5655cdbe.png

  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
  • The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as PROJECT_ID). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project.
  • For your information, there is a third value, a Project Number, which some APIs use. Learn more about all three of these values in the documentation.
  1. Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell, a command line environment running in the Cloud.

From the Google Cloud Console, click the Cloud Shell icon on the top right toolbar:

Activate Cloud Shell

It should only take a few moments to provision and connect to the environment. When it is finished, you should see something like this:

Screenshot of Google Cloud Shell terminal showing that the environment has connected

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on Google Cloud, greatly enhancing network performance and authentication. All of your work in this codelab can be done within a browser. You do not need to install anything.

3. Setup environment with Terraform

To do this lab you need access to TPUs. The exact version used is TPU v6e.

  • You should follow the TPU plan doc and enable TPU quota to get access.
  • We are using a small deployment requiring 4 TPU v6e chips (ct6e-standard-4t)which will be a 2x2 slice in a single region.
  • Hugging Face Token: An Access Token is needed to download the Gemma model weights

We will create a custom VPC with firewall rules, a subnet and then an autopilot cluster. Open the cloud console and select the project you will be using.

  1. Open Cloud Shell located at the top of your console on the right, ensure you see the correct project id in Cloud Shell, confirm any prompts to allow access. b51b80043d3bac90.png
  2. Create a folder called gke-auto-tpu and move to the folder
mkdir -p gke-auto-tpu && cd gke-auto-tpu
export PROJECT_ID=$(gcloud config get-value project)
echo $PROJECT_ID
  1. Now add some configuration files. These will create the following terraform.tfvars , variables.tf, net.tf file.
cat << EOF > terraform.tfvars
project_id = "${PROJECT_ID}"
EOF

cat << 'EOF' > variables.tf
variable "project_id" {
  type = string
}

variable "region" {
  type    = string
  default = "us-east5"
}

variable "network_name" {
  type    = string
  default = "tpu-gke-vpc"
}

variable "subnet_name" {
  type    = string
  default = "tpu-sub1"
}

variable "cluster_name" {
  type    = string
  default = "tpu-auto-dra-cluster"
}
EOF

cat << 'EOF' > net.tf
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 7.32.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

resource "google_compute_network" "tpu_vpc" {
  project                 = var.project_id
  name                    = var.network_name
  auto_create_subnetworks = false
  mtu                     = 8896
}

resource "google_compute_subnetwork" "tpu_subnet" {
  project       = var.project_id
  name          = var.subnet_name
  ip_cidr_range = "192.168.100.0/24"
  region        = var.region
  network       = google_compute_network.tpu_vpc.id
}

resource "google_compute_firewall" "allow_ssh" {
  project     = var.project_id
  name        = "${var.network_name}-allow-ssh"
  network     = google_compute_network.tpu_vpc.id
  direction   = "INGRESS"
  priority    = 1000

  allow {
    protocol = "tcp"
    ports    = ["22"]
  }

  source_ranges = ["0.0.0.0/0"]
}

resource "google_compute_firewall" "allow_internal" {
  project     = var.project_id
  name        = "${var.network_name}-allow-internal"
  network     = google_compute_network.tpu_vpc.id
  direction   = "INGRESS"
  priority    = 1000

  allow {
    protocol = "all"
  }

  source_ranges = ["172.16.0.0/12", "192.168.0.0/16"]
}

resource "google_container_cluster" "tpu_autopilot" {
  project  = var.project_id
  name     = var.cluster_name
  location = var.region

  enable_autopilot = true

  network    = google_compute_network.tpu_vpc.id
  subnetwork = google_compute_subnetwork.tpu_subnet.id

  release_channel {
    channel = "RAPID"
  }

  ip_allocation_policy {}

  deletion_protection = false 
}
EOF
  1. Make sure you are in the gke-auto-tpu directory and run the following commands
    terraform init Initializes the working directory. This is the first step and it downloads the providers required for the given configuration.
    terraform plan -out generates an execution plan, showing what actions Terraform will take to deploy your infrastructure. The -out allows you to save the execution plan to a named binary. You can see what will happen without making any changes.
    terraform apply runs the updates.
terraform init 
terraform plan -out vpc 
  1. Now run the deployment after you run terraform apply, since you are applying the saved execution plan, it will execute immediately without prompting for confirmation (This may take between 6 -10 mins)
terraform apply vpc
  1. Verify the set up
echo -e "\n=== Verifying GKE Autopilot Cluster ==="
gcloud container clusters list --filter="name:tpu-auto-dra-cluster" --project=$PROJECT_ID

echo -e "\n=== Verifying VPC Network ==="
gcloud compute networks list --filter="name:tpu-gke-vpc" --project=$PROJECT_ID

echo -e "\n=== Verifying Subnetwork ==="
gcloud compute networks subnets list --filter="name:tpu-sub1" --project=$PROJECT_ID

echo -e "\n=== Verifying Firewall Rules ==="
gcloud compute firewall-rules list --filter="name~tpu-gke-vpc-allow" --project=$PROJECT_ID

4. Create Compute Class and Resource Claim Template

We need to create a custom ComputeClass resource to define the configuration for the node pool. In our case we will be using the TPU v6e chips (ct6e-standard-4t) and managed DRANET networks.

  1. Connect to the cluster you created. (p.s. change the region to the region you deployed your cluster to.)
gcloud container clusters get-credentials tpu-auto-dra-cluster --region us-east5 --project=$PROJECT_ID
  1. Make sure you are in the gke-auto-tpu directory and run the following commands. This creates the ComputeClass manifest. Please note if you used a different region you need to change the zone information to a zone within the region of your cluster
cat << 'EOF' > computeclass.yaml
apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: dranet-auto
spec:
  nodePoolAutoCreation:
    enabled: true
  nodePoolConfig:
    dra:
      networking:
        enabled: true
  priorities:
  - tpu:
      type: tpu-v6e-slice
      count: 4
      topology: "2x2" 
    acceleratorNetworkProfile: auto
    location:
      zones: 
      - us-east5-b
EOF
  1. Now create the ComputeClass.
kubectl apply -f computeclass.yaml

kubectl describe computeclass dranet-auto
  1. In the gke-auto-tpu directory run the following commands below. This creates the ResourceClaimTemplate manifest which supports non-RDMA network devices.
cat << 'EOF' > resourceclaimtpu.yaml
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: all-netdev
spec:
  spec:
    devices:
      requests:
      - name: req-netdev
        exactly:
          deviceClassName: netdev.google.com
          allocationMode: All
EOF
  1. Now create the ResourceClaimTemplate.
kubectl apply -f resourceclaimtpu.yaml

kubectl describe resourceclaimtemplate all-netdev

Create your secret

  1. This lab uses google/gemma-4-31B-it so you would need to create a HF token. Replace YOUR_ACTUAL_HUGGING_FACE_TOKEN below with your actual token.
export HF_TOKEN="YOUR_ACTUAL_HUGGING_FACE_TOKEN"
  1. Make sure you are in the gke-auto-tpu directory and run the following commands.
kubectl create secret generic hf-secret --from-literal=hf_token=${HF_TOKEN} 

kubectl get secrets hf-secret

5. Deploy Workload vLLM and Gemma

This setup uses the ComputeClass to automatically provision the required hardware and networking (TPU v6e and managed DRANET). It uses the ResourceClaimTemplate to define a blueprint for requesting access to that high-speed network, and a deployment that binds them together by generating individual network claims for each pod as they scale.

  1. Make sure you are in the gke-auto-tpu directory and run the following.
cat << 'EOF' > gem4-auto-dra-tpu.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gem4-dra-auto
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma4-tpu
  template:
    metadata:
      labels:
        app: gemma4-tpu
        ai.gke.io/model: gemma-4-31b-it
        ai.gke.io/inference-server: vllm-tpu
    spec:
      dnsPolicy: Default
      resourceClaims:
      - name: netdev-claim        
        resourceClaimTemplateName: all-netdev
      containers:
      - name: vllm-tpu-inference
        image: vllm/vllm-tpu:latest
        resources:
          requests:
            cpu: "30"
            memory: "240Gi"
            ephemeral-storage: "100Gi"
            google.com/tpu: "4"
          limits:
            cpu: "30"
            memory: "240Gi"
            ephemeral-storage: "100Gi"
            google.com/tpu: "4"
          claims:
          - name: netdev-claim
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=4
        - --host=0.0.0.0
        - --port=8000
        - --max-model-len=32768
        - --max-num-batched-tokens=8192
        env:
        - name: MODEL_ID
          value: google/gemma-4-31B-it
        - name: HUGGING_FACE
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_token
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        startupProbe:
          httpGet:
            path: /health
            port: 8000
          failureThreshold: 240
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          periodSeconds: 5
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      nodeSelector:
        cloud.google.com/compute-class: dranet-auto
---
apiVersion: v1
kind: Service
metadata:
  name: gem4-dra-service
spec:
  selector:
    app: gemma4-tpu
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: gem4-monitoring
spec:
  selector:
    matchLabels:
      app: gemma4-tpu
  endpoints:
  - port: 8000
    path: /metrics
    interval: 30s
EOF
  1. Create the deployment.
kubectl apply -f gem4-auto-dra-tpu.yaml
  1. To monitor the completion status, run the following commands. The pods will wait until the node is provisioned before it can proceed, this may take 13+ minutes.
kubectl get pods

kubectl get deployments

kubectl describe pods -l app=gemma4-tpu

echo "       __|__"
echo "  --@--(_|_)--@--"
echo ""
echo "Waiting for Autopilot to register the TPU node (this takes a few minutes)..."

until kubectl get nodes -l gke.networks.io/accelerator-network-profile=auto -o name | grep -q "node/"; do
  sleep 60
done

echo "TPU Node detected in cluster! Waiting for hardware to provision and become Ready..."

kubectl wait --for=condition=Ready nodes -l gke.networks.io/accelerator-network-profile=auto --timeout=900s
  1. After the node is created and the pod is scheduled you can run the command to see the logs of the pods. (p.s. You can add the **-f** **flag for streaming**). This will take up to **15+ minutes** to complete if you are watching the logs when you see the string (APIServer pid=1) INFO: 169.254.4.6:44290 - "GET /health HTTP/1.1" 200 OK the model is ready to serve.
kubectl logs -l app=gemma4-tpu -f | sed -u '\,"GET /health HTTP/1.1" 200 OK,q'
  1. Once the deployment is available, you can verify that the high-speed networking is properly attached to your TPU pods. Run the following command:
for pod in $(kubectl get pods -l app=gemma4-tpu -o name); do
  echo "=== Checking Networking for $pod ==="
  kubectl exec $pod -- ls /sys/class/net
  echo ""
done

What to look for: You should see standard eth0 alongside extra interfaces like eth1 through ethxx.

These additional interfaces confirm that the high-speed managed DRANET fabric is successfully attached to your pod.

6. Interact with AI model using curl

To verify the gemma-4-31B model that you deployed, set up port forwarding from the service to your local machine.

  1. Run this in your current Cloud Shell:
kubectl port-forward service/gem4-dra-service 8000:8000 &
  1. Now, open an additional Cloud Shell window for the same project to chat with your model by using curl. This command sends a prompt and streams the output directly to your terminal.
time curl -sN http://127.0.0.1:8000/v1/chat/completions \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B-it",
    "messages": [
      {
        "role": "user",
        "content": "How can GKE help deployment of AI workloads? Provide concise information. Keep the explanation under 300 words."
      }
    ],
    "max_tokens": 1024,
    "temperature": 0.7,
    "stream": true,
    "stream_options": {"include_usage": true}
  }' | grep '^data:' | sed 's/^data: //' | grep -v '\[DONE\]' | jq --unbuffered -j '
    (.choices[0].delta.content // empty), 
    if .usage then "\n\n--- Usage ---\nPrompt: \(.usage.prompt_tokens)\nCompletion: \(.usage.completion_tokens)\nTotal: \(.usage.total_tokens)\n" else empty end 
  '
  1. Check out the response from your model

Observability

Since we applied the PodMonitoring custom resource, Cloud Monitoring will scrape metrics from the vLLM container on port 8000. You can navigate to the Google Cloud Console Monitoring -> Dashboards to view metrics such as token generation latency, queue length, and KV cache usage natively.

607bcf95ce4d9a82.png

7. Clean up

  1. Delete the resources by running the following.
cd ~/gke-auto-tpu

kubectl delete -f gem4-auto-dra-tpu.yaml
kubectl delete -f resourceclaimtpu.yaml
kubectl delete -f computeclass.yaml
kubectl delete secret hf-secret
  1. Clean up the infrastructure by the following command, type yes to confirm
terraform destroy

8. Congratulations

You have successfully deployed a managed DRANET environment on GKE Autopilot, provisioned TPU v6e hardware dynamically, and served the massive 31-billion parameter Gemma 4 model using vLLM.

By using GKE Autopilot, you allow Kubernetes to handle the underlying node provisioning and infrastructure management, letting you focus entirely on deploying your AI workload.

Next steps / Learn more

You can read more about GKE networking

Take your next lab

Continue your quest with Google Cloud, and check out these other Google Cloud labs: