1. Introduction
This lab details how to build, provision, and execute a high-performance distributed Reinforcement Learning (RL) training loop on GKE Standard with GKE Agent Sandboxes (gVisor), using the Group Relative Policy Optimization (GRPO) algorithm with the trl library.
The goal is to demonstrate how to securely evaluate untrusted, LLM-generated code during an RL training loop. We achieve this by decoupling the orchestration plane (Ray) from the execution plane (GKE Agent Sandboxes).
The Technical Challenge of RL Code Evaluation
When training LLM agents using Reinforcement Learning (e.g., training a model to write code by evaluating its output on unit tests), the training loop must execute thousands of untrusted, LLM-generated Python scripts in parallel. This introduces critical challenges:
- The Pod Churn Bottleneck: Traditional evaluation frameworks spin up a fresh Docker container per task. Doing this dynamically for hundreds of parallel rollouts during an RL training loop causes severe load on the Kubernetes control plane. The latency makes high-frequency RL training impossible.
- The Security Risk: Running arbitrary, LLM-generated code inside standard container runtimes shares the host OS kernel. A single escape vulnerability can compromise your nodes.
- IAM Token Theft: LLM-generated code running inside a Kubernetes pod can query the cloud provider's metadata server to steal node IAM service account tokens.
The Solution: Decoupled Orchestration & Execution
This architecture decouples orchestration from execution:
- The Orchestrator (Ray): A distributed Ray cluster manages the RL training loop and distributes rollout generation.
- The Execution Plane (GKE Agent Sandbox): Instead of creating Kubernetes pods dynamically, Ray workers make simple HTTP calls to a dedicated Sandbox Router. The router instantly assigns the worker an isolated, pre-warmed container running under gVisor (GKE Sandbox).
- Sub-Second Latency: Because sandboxes are pre-warmed in a managed
SandboxWarmPooland managed via a high-speed HTTP gateway, environment creation drops to under 200ms, completely bypassing the Kubernetes control plane.
Lab Objectives
In this codelab, you will learn:
- The architectural challenges and solutions for evaluating untrusted code in RL loops.
- How to build custom sandbox images for efficient rollouts.
- How to configure and use GKE Agent Sandboxes and SandboxWarmPools.
- How to securely isolate sandboxes to prevent IAM token theft.
- How to run a basic RL training job with SweBench and TRL using Ray to decouple orchestration from execution.
2. Cluster Creation & Prerequisites
Before proceeding, you need a GKE cluster with a high-performance GPU node pool and the Ray Operator installed to manage the training workload.
Prerequisites
This codelab assumes the following tools are installed and configured:
- Google Cloud SDK (
gcloud) - Docker (required for building custom images locally)
kubectl
Environment Variables
First, set the environment variables that will be used throughout this codelab. The commands below use sensible defaults, but you can change them as needed to match your specific Google Cloud environment:
export PROJECT_ID=$(gcloud config get-value project)
export REGION="us-west3"
export ZONE="us-west3-a"
export REPO_NAME="rl-sandbox-repo"
Create an Artifact Registry repository to hold the custom container images:
gcloud artifacts repositories create $REPO_NAME \
--repository-format=docker \
--location=$REGION \
--description="Repository for RL Sandbox images"
Cluster Configuration
For a complete walkthrough on provisioning a GKE cluster optimized for AI workloads (including GPUDirect RDMA network wiring), follow the official documentation: Create a GKE AI Hypercompute Custom Cluster
Crucial Prerequisite: When creating your cluster or a specific execution node pool, ensure you pass the --enable-agent-sandbox and --sandbox type=gvisor flags to install the required Custom Resource Definitions (CRDs) for the Sandbox warm pools.
Assuming your cluster, GPUs, and Ray Operator are running, everything below details how to configure the execution plane and run the RL loop.
3. Build Custom Images
A crucial aspect of running high-performance RL is baking dependencies into your images. We need two distinct images: one for the GPU workers running the model, and one for the isolated sandboxes running the untrusted evaluation code.
1. Build the GPU Worker Image
The Ray GPU worker needs libraries to run the language model and orchestrate the training loop. We build this image on top of the official vLLM image so it supports the latest GPUs and has PyTorch/CUDA pre-installed.
Run the following command to create Dockerfile.gpu_worker:
cat << 'EOF' > Dockerfile.gpu_worker
# ==============================================================================
# Base Image: Use the official vLLM production image.
# This image comes pre-baked with PyTorch 2.11, CUDA 13.0, and vLLM.
# It supports sm_100 Blackwell GPUs natively!
# ==============================================================================
FROM vllm/vllm-openai:latest
USER root
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
numactl \
libnuma-dev \
wget \
ca-certificates \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Install Ray, TRL, and Sandbox tools
# TRL does not require compiling flash_attn from source.
RUN pip install --no-cache-dir \
"ray[default]==2.55.1" \
"numpy<2.0" \
gymnasium>=0.28.1 \
k8s-agent-sandbox>=0.4.6 \
trl transformers packaging ninja cachetools accelerate datasets peft
EOF
Build and push the image to your Artifact Registry repository:
export WORKER_REPO="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/ray-gpu-worker:v1"
docker build -f Dockerfile.gpu_worker -t $WORKER_REPO .
docker push $WORKER_REPO
Note: This guide uses local docker commands to build images. If you prefer to build the images remotely, you can use Cloud Build instead (e.g. using gcloud builds submit).
2. Build the CPU Head Image
The Ray head node only orchestrates the cluster and does not run the heavy GPU training models. To avoid a massive image pull bottleneck (typically 15GB+) on your standard CPU nodes, we build a lightweight, CPU-only image for the head node. This image contains Ray and the required Python libraries, but excludes heavy GPU libraries like CUDA and vLLM.
Run the following command to create Dockerfile.head:
cat << 'EOF' > Dockerfile.head
# ==============================================================================
# Base Image: Use the official Python slim image for the exact patch version.
# This aligns the Python version (3.12.13) with the GPU worker node.
# ==============================================================================
FROM python:3.12.13-slim
USER root
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
wget \
ca-certificates \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Install Ray, TRL, and Sandbox tools (CPU versions where applicable)
# We install torch CPU first to avoid pulling the 2GB+ CUDA torch package.
RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu && \
pip install --no-cache-dir \
"ray[default]==2.55.1" \
"numpy<2.0" \
gymnasium>=0.28.1 \
k8s-agent-sandbox>=0.4.6 \
trl transformers packaging ninja cachetools accelerate datasets peft
# Create a 'ray' user to run the container securely and match Ray conventions
RUN useradd -ms /bin/bash ray
USER ray
WORKDIR /home/ray
EOF
Build and push the image:
export HEAD_REPO="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/ray-head:v1"
docker build -f Dockerfile.head -t $HEAD_REPO .
docker push $HEAD_REPO
3. Build the Sandbox Image
The sandbox needs the specific dependencies for the task we are evaluating so that runtime installation is instantaneous. For this codelab, we will use an issue from the django/django repository in SWE-bench. We will pre-clone the repository and pre-build the python environments so our model scripts don't waste time downloading them in the RL loop.
Run the following command to create Dockerfile.sandbox:
cat << 'EOF' > Dockerfile.sandbox
# Use a stable Debian-based Miniconda image
FROM condaforge/miniforge3:latest
# 1. Install essential system libraries (including sqlite3 for Django tests)
RUN apt-get update && apt-get install -y \
git \
build-essential \
libsqlite3-dev \
&& rm -rf /var/lib/apt/lists/*
# 2. Set up the /workspace directory and grant ownership to the pre-existing non-root 'ubuntu' user (UID 1000)
RUN mkdir -p /workspace \
&& chown -R 1000:1000 /workspace
# 3. Switch to the non-root user
USER ubuntu
WORKDIR /workspace
# 4. Pre-configure Git globally so the agent can run git commands
RUN git config --global user.email "agent@gke-sandbox.local" \
&& git config --global user.name "Agent"
# 5. Pre-clone the repository as the non-root user
RUN git clone https://github.com/django/django.git .
# 6. Pre-build Conda environments and pre-cache common dependencies
# We do NOT run "pip install -e ." here to avoid Python version conflicts with the main branch.
# Instead, we pre-install the heavy dependencies so that runtime installation is instantaneous.
RUN conda create -y -n django-py39 python=3.9 \
&& conda run -n django-py39 pip install --no-cache-dir asgiref sqlparse tzdata pytest pytest-django
RUN conda create -y -n django-py310 python=3.10 \
&& conda run -n django-py310 pip install --no-cache-dir asgiref sqlparse tzdata pytest pytest-django
# --- Add Agent Server ---
# We use a multi-stage build to copy the agent server from the official python-runtime-sandbox image
COPY --from=registry.k8s.io/agent-sandbox/python-runtime-sandbox:v0.1.0 /app /opt/sandbox-agent
USER root
RUN chown -R 1000:1000 /opt/sandbox-agent \
&& /opt/conda/bin/pip install --no-cache-dir -r /opt/sandbox-agent/requirements.txt \
&& sed -i 's|"/app"|"/workspace"|g' /opt/sandbox-agent/main.py
USER ubuntu
# ------------------------
# Prepend the django-py39 conda environment bin to PATH for commands executed inside the container
ENV PATH=/home/ubuntu/.conda/envs/django-py39/bin:$PATH
# Keep the container alive and run the agent server using the system Python
CMD ["/opt/conda/bin/python3", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8888", "--log-level", "trace", "--app-dir", "/opt/sandbox-agent"]
EOF
Build and push the image:
export SANDBOX_REPO="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/django-sandbox:v1"
docker build -f Dockerfile.sandbox -t $SANDBOX_REPO .
docker push $SANDBOX_REPO
4. Configure Orchestration and Execution
Now we deploy the Ray cluster for orchestration and the Sandbox resources for execution.
1. Ray Cluster Configuration
Deploy a RayCluster custom resource. Note that your cluster's available resources (like memory, CPU, or GPU type) may differ. Adjust the resources requests and limits accordingly.
Run the following command to create raycluster.yaml. This uses cat << EOF to automatically substitute your environment variables into the manifest:
cat << EOF > raycluster.yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: grpo-cluster
namespace: default
spec:
rayVersion: "2.55.1"
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: ${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/ray-head:v1
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
resources:
limits:
cpu: "2"
memory: "8Gi"
requests:
cpu: "2"
memory: "8Gi"
workerGroupSpecs:
- groupName: gpu-group
replicas: 1
minReplicas: 1
maxReplicas: 1
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: ${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/ray-gpu-worker:v1
resources:
limits:
cpu: "12"
memory: "120Gi"
nvidia.com/gpu: "1"
requests:
cpu: "12"
memory: "120Gi"
nvidia.com/gpu: "1"
EOF
Apply it:
kubectl apply -f raycluster.yaml
Verify that the cluster is created and running (this may take a few minutes):
kubectl get raycluster
Expected output:
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE rl-cluster 1 1 ready 2m
2. SandboxRouter Configuration
The SandboxRouter acts as a high-speed HTTP gateway, fielding requests from Ray workers and bridging them instantly to available gVisor pods, bypassing the slower Kubernetes API server pod lifecycle.
Run the following command to create sandbox_router.yaml:
cat << 'EOF' > sandbox_router.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: sandbox-claim-manager
rules:
- apiGroups: ["extensions.agents.x-k8s.io"]
resources: ["sandboxclaims"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["agents.x-k8s.io"]
resources: ["sandboxes"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: sandbox-claim-manager-binding
namespace: default
subjects:
- kind: ServiceAccount
name: default
namespace: default
roleRef:
kind: Role
name: sandbox-claim-manager
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Service
metadata:
name: sandbox-router
namespace: default
spec:
type: ClusterIP
selector:
app: sandbox-router
ports:
- name: http
protocol: TCP
port: 8080
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sandbox-router-deployment
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: sandbox-router
template:
metadata:
labels:
app: sandbox-router
spec:
containers:
- name: router
image: us-central1-docker.pkg.dev/k8s-staging-images/agent-sandbox/sandbox-router:latest-main
ports:
- containerPort: 8080
env:
- name: ALLOW_UNAUTHENTICATED_ROUTER
value: "true"
EOF
Apply it:
kubectl apply -f sandbox_router.yaml
Verify the deployment is running:
kubectl get deployment sandbox-router-deployment
Expected output:
NAME READY UP-TO-DATE AVAILABLE AGE sandbox-router-deployment 2/2 2 2 1m
3. SandboxTemplate and WarmPool Configuration
GKE Agent Sandbox allows instant assignment of isolated, pre-warmed containers using the Sandbox Router. We define a SandboxTemplate and a SandboxWarmPool to keep pods ready.
Run the following command to create sandbox_warmpool.yaml with your environment variables:
cat << EOF > sandbox_warmpool.yaml
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxTemplate
metadata:
name: swe-bench-django
namespace: default
spec:
podTemplate:
spec:
runtimeClassName: gvisor
securityContext:
runAsNonRoot: true
runAsUser: 1000
nodeSelector:
sandbox.gke.io/runtime: gvisor
tolerations:
- key: sandbox.gke.io/runtime
operator: Equal
value: gvisor
effect: NoSchedule
containers:
- name: sandbox
image: ${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/django-sandbox:v1
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "2"
memory: "4Gi"
---
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxWarmPool
metadata:
name: swe-bench-django-warmpool
namespace: default
spec:
replicas: 10
sandboxTemplateRef:
name: swe-bench-django
EOF
Apply it:
kubectl apply -f sandbox_warmpool.yaml
Verify the SandboxWarmPool is initialized:
kubectl get sandboxwarmpool
Expected output:
NAME READY AGE swe-bench-django-warmpool 10 1m
4. Security Isolation
A NetworkPolicy strictly isolates sandboxes, preventing egress to the GCP Metadata Server, thus preventing IAM token theft.
Run the following command to create network_policy.yaml:
cat << 'EOF' > network_policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-metadata-egress
namespace: default
spec:
podSelector:
matchLabels:
sandbox.gke.io/runtime: gvisor
policyTypes:
- Egress
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 169.254.169.254/32
EOF
Apply the policy:
kubectl apply -f network_policy.yaml
Verify the NetworkPolicy was created:
kubectl get networkpolicy
Expected output:
NAME POD-SELECTOR AGE block-metadata-egress sandbox.gke.io/runtime=gvisor 1m
5. Basic RL Job with SweBench and TRL
Once the cluster and sandboxes are prepared, we can run a GRPO training loop. We will use the trl library to orchestrate the GRPO algorithm, and ray remote functions to evaluate the generated code inside the isolated sandboxes.
To make the execution fast for this codelab, we will filter down to a single Django issue. The routing logic below shows how you would select different warmpools for different repositories, which is useful when expanding to the full SWE-bench dataset.
The Training Script
Run the following command to create train_trl.py:
cat << 'EOF' > train_trl.py
import ray
from k8s_agent_sandbox import SandboxClient
from k8s_agent_sandbox.models import SandboxDirectConnectionConfig
from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import urllib.request
import re
ray.init(ignore_reinit_error=True)
# 1. Define the Ray remote evaluation function
@ray.remote
def evaluate_rollout(code, prompt_data):
client = SandboxClient(connection_config=SandboxDirectConnectionConfig(api_url="http://sandbox-router.default.svc.cluster.local:8080"))
# Claim a pre-warmed sandbox instantly based on the repo
repo = prompt_data.get("repo")
# In a full system, you'd route to different warmpools based on repo
# Here we default to django for our single task
sandbox = client.create_sandbox(
template="swe-bench-django",
warmpool="swe-bench-django-warmpool",
sandbox_ready_timeout=600
)
try:
# Check if the code is correctly formatted
bash_match = re.search(r"```bash\n(.*?)\n```", code, re.DOTALL)
if not bash_match:
return 0.0
script = bash_match.group(1)
# In a real environment, we would apply the base commit and install here
# For simplicity, we just execute the script
import shlex
script_cmd = f"bash -c {shlex.quote(script)}"
result = sandbox.commands.run(script_cmd, timeout=60)
# Calculate continuous reward based on test passage ratio
if result.exit_code == 0:
return 1.0
# Very simple heuristic reward
return 0.1
finally:
# Clean up and release the sandbox back to the pool
client.delete_sandbox(sandbox.claim_name)
# 2. Define the Reward Function for TRL
def sandbox_reward_func(prompts, completions, **kwargs):
# Dispatch evaluation to Ray cluster
futures = [
evaluate_rollout.remote(completion, {
"repo": kwargs.get('repo', [])[i] if 'repo' in kwargs else None,
"base_commit": kwargs.get('base_commit', [])[i] if 'base_commit' in kwargs else None
}) for i, completion in enumerate(completions)
]
# Block and wait for all sandbox evaluations to complete
rewards = ray.get(futures)
return rewards
# 3. Setup GRPO Trainer
@ray.remote(num_gpus=1, num_cpus=8)
def train():
# Load dataset
dataset = load_dataset("princeton-nlp/SWE-bench_Lite", split="test")
# Filter to our selected target issue
dataset = dataset.filter(lambda x: x["instance_id"] == "django__django-15388")
def format_dataset(example):
files = re.findall(r'^\+\+\+ b/(.+)$', example["patch"], re.MULTILINE)
target_file = files[0] if files else ""
file_content = ""
if target_file:
try:
github_repo = example["repo"]
url = f"https://raw.githubusercontent.com/{github_repo}/{example['base_commit']}/{target_file}"
with urllib.request.urlopen(url) as response:
file_content = response.read().decode('utf-8')
except Exception as e:
pass
prompt = f"""You are an expert software engineer.
You are given a GitHub issue and the content of the file that contains the bug.
Write an executable bash script that will modify the target file to fix the bug (e.g. using cat << 'EOF' > {target_file} or inline python edits).
Wrap your bash script in ```bash ... ``` tags. Do not output raw python code directly.
Target File: {target_file}
Original File Content:
```python
{file_content}
```
Issue:
{example['problem_statement']}
"""
return {
"prompt": prompt,
"repo": example["repo"],
"instance_id": example["instance_id"],
"base_commit": example["base_commit"],
}
dataset = dataset.map(format_dataset)
model_name = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
training_args = GRPOConfig(
output_dir="outputs",
learning_rate=5e-6,
max_steps=50,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_generations=4,
)
trainer = GRPOTrainer(
model=model_name,
processing_class=tokenizer,
reward_funcs=[sandbox_reward_func],
args=training_args,
train_dataset=dataset,
)
print("Starting GRPO training with GKE Agent Sandboxes...")
trainer.train()
def main():
print("Submitting training job to GPU worker...")
ray.get(train.remote())
if __name__ == "__main__":
main()
EOF
Submit the Job to the Cluster
First, port-forward to the Ray Head dashboard and submit the training job from your local machine:
kubectl port-forward service/grpo-cluster-head-svc 8265:8265 &
ray job submit \
--address http://localhost:8265 \
--runtime-env-json '{"working_dir": "."}' \
-- python train_trl.py
Monitor the Run
You can monitor the progress of your run:
- Ray Dashboard: Open
http://localhost:8265in your browser. - Sandbox Claims: Watch GKE dynamically claim and release sandboxes under gVisor:
watch -n 1 "kubectl get sandboxclaims,sandboxes,pods"
6. Conclusion
Congratulations! You have successfully configured and executed a high-performance distributed RL training loop securely on GKE Standard using GKE Agent Sandboxes.