Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

GKE Standard 上的高效能分散式 RL：完整指南

1. 簡介

本實驗室詳細說明如何使用 GKE Agent Sandbox (gVisor)，在 GKE Standard 上建構、佈建及執行高效能分散式強化學習 (RL) 訓練迴圈，並使用 trl 程式庫的群組相對策略最佳化 (GRPO) 演算法。

目標是示範如何在 RL 訓練迴圈中，安全地評估 LLM 生成的不受信任程式碼。我們將自動化調度管理層 (Ray) 與執行層 (GKE Agent Sandbox) 解除耦合，藉此達成目標。

RL 程式碼評估的技術挑戰

使用強化學習訓練 LLM 代理程式時 (例如，透過評估單元測試的輸出內容來訓練模型編寫程式碼)，訓練迴圈必須平行執行數千個不可信的 LLM 生成 Python 指令碼。這會帶來重大挑戰：

Pod 耗用瓶頸：傳統評估架構會為每個工作啟動新的 Docker 容器。在 RL 訓練迴圈期間，為數百個平行推出作業動態執行這項操作，會對 Kubernetes 控制層造成嚴重負擔。延遲問題導致無法進行高頻率的 RL 訓練。
安全風險：在標準容器執行階段中執行 LLM 生成的任意程式碼時，會共用主機 OS 核心。單一逸出漏洞就可能導致節點遭到入侵。
IAM 權杖遭竊：在 Kubernetes Pod 中執行的 LLM 生成程式碼，可以查詢雲端供應商的中繼資料伺服器，竊取節點 IAM 服務帳戶權杖。

解決方案：分離式自動化調度管理與執行作業

這個架構會將協調作業與執行作業分離：

自動調度管理工具 (Ray)：分散式 Ray 叢集會管理 RL 訓練迴圈，並分配推出生成作業。
執行平面 (GKE Agent Sandbox)：Ray 工作人員會對專屬的 Sandbox Router 進行簡單的 HTTP 呼叫，而不是動態建立 Kubernetes Pod。路由器會立即指派一個在 gVisor (GKE Sandbox) 下執行的預先暖機容器給工作人員。
延遲時間不到一秒：由於沙箱會在受管理的 SandboxWarmPool 中預先暖機，並透過高速 HTTP 閘道管理，因此環境建立時間會降至 200 毫秒以下，完全略過 Kubernetes 控制層。

實驗室目標

在本程式碼研究室，您將學到：

在 RL 迴圈中評估不受信任程式碼時，會面臨的架構挑戰和解決方案。
如何建構自訂沙箱映像檔，有效率地推出。
如何設定及使用 GKE Agent Sandbox 和 SandboxWarmPool。
如何安全隔離沙箱，防止 IAM 權杖遭竊。
瞭解如何使用 Ray，透過 SweBench 和 TRL 執行基本的 RL 訓練工作，將協調作業與執行作業分離。

2. 建立叢集與必要條件

繼續操作前，您需要具備高效能 GPU 節點集區的 GKE 叢集，並安裝 Ray 運算子來管理訓練工作負載。

必要條件

本程式碼研究室假設您已安裝並設定下列工具：

Google Cloud SDK (gcloud)
Docker (在本機建構自訂映像檔時需要)
kubectl

環境變數

首先，請設定本程式碼研究室全程會用到的環境變數。下列指令使用合理的預設值，但您可以視需要變更這些值，以符合特定 Google Cloud 環境：

export PROJECT_ID=$(gcloud config get-value project)
export REGION="us-west3"
export ZONE="us-west3-a"
export REPO_NAME="rl-sandbox-repo"

建立 Artifact Registry 存放區，用於存放自訂容器映像檔：

gcloud artifacts repositories create $REPO_NAME \
    --repository-format=docker \
    --location=$REGION \
    --description="Repository for RL Sandbox images"

叢集設定

如需如何佈建最適合 AI 工作負載的 GKE 叢集 (包括 GPUDirect RDMA 網路接線) 的完整逐步說明，請參閱官方說明文件：建立 GKE AI Hypercompute 自訂叢集

重要必要條件：建立叢集或特定執行節點集區時，請務必傳遞 --enable-agent-sandbox 和 --sandbox type=gvisor 旗標，安裝 Sandbox 暖集區所需的自訂資源定義 (CRD)。

假設叢集、GPU 和 Ray 運算子正在執行，以下詳細說明如何設定執行平面並執行 RL 迴圈。

3. 建構自訂映像檔

執行高效能 RL 的重要環節，是將依附元件烘焙到映像檔中。我們需要兩個不同的映像檔：一個用於執行模型的 GPU 工作人員，另一個用於執行不受信任的評估程式碼的隔離沙箱。

1. 建構 GPU 工作站映像檔

Ray GPU 工作站需要程式庫才能執行語言模型，並協調訓練迴圈。我們以官方 vLLM 映像檔為基礎建構這個映像檔，因此支援最新 GPU，並預先安裝 PyTorch/CUDA。

執行下列指令來建立 Dockerfile.gpu_worker：

cat << 'EOF' > Dockerfile.gpu_worker
# ==============================================================================
# Base Image: Use the official vLLM production image. 
# This image comes pre-baked with PyTorch 2.11, CUDA 13.0, and vLLM.
# It supports sm_100 Blackwell GPUs natively!
# ==============================================================================
FROM vllm/vllm-openai:latest

USER root

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    numactl \
    libnuma-dev \
    wget \
    ca-certificates \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install Ray, TRL, and Sandbox tools
# TRL does not require compiling flash_attn from source.
RUN pip install --no-cache-dir \
    "ray[default]==2.55.1" \
    "numpy<2.0" \
    gymnasium>=0.28.1 \
    k8s-agent-sandbox>=0.4.6 \
    trl transformers packaging ninja cachetools accelerate datasets peft
EOF

建構映像檔並推送至 Artifact Registry 存放區：

export WORKER_REPO="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/ray-gpu-worker:v1"
docker build -f Dockerfile.gpu_worker -t $WORKER_REPO .
docker push $WORKER_REPO

附註：本指南使用本機 docker 指令建構映像檔。如要遠端建構映像檔，可以改用 Cloud Build (例如使用 gcloud builds submit)。

2. 建構 CPU 標頭圖片

Ray 頭節點只會自動化調度管理叢集，不會執行繁重的 GPU 訓練模型。為避免標準 CPU 節點發生大量映像檔提取瓶頸 (通常為 15 GB 以上)，我們會為頭部節點建構輕量級的 CPU 專用映像檔。這個映像檔包含 Ray 和必要的 Python 程式庫，但不含 CUDA 和 vLLM 等 GPU 程式庫。

執行下列指令來建立 Dockerfile.head：

cat << 'EOF' > Dockerfile.head
# ==============================================================================
# Base Image: Use the official Python slim image for the exact patch version.
# This aligns the Python version (3.12.13) with the GPU worker node.
# ==============================================================================
FROM python:3.12.13-slim

USER root

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    wget \
    ca-certificates \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install Ray, TRL, and Sandbox tools (CPU versions where applicable)
# We install torch CPU first to avoid pulling the 2GB+ CUDA torch package.
RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu && \
    pip install --no-cache-dir \
    "ray[default]==2.55.1" \
    "numpy<2.0" \
    gymnasium>=0.28.1 \
    k8s-agent-sandbox>=0.4.6 \
    trl transformers packaging ninja cachetools accelerate datasets peft

# Create a 'ray' user to run the container securely and match Ray conventions
RUN useradd -ms /bin/bash ray
USER ray
WORKDIR /home/ray
EOF

建構及推送映像檔：

export HEAD_REPO="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/ray-head:v1"
docker build -f Dockerfile.head -t $HEAD_REPO .
docker push $HEAD_REPO

3. 建構沙箱映像檔

沙箱需要評估作業的特定依附元件，才能即時安裝執行階段。在本程式碼研究室中，我們將使用 SWE-bench 中 django/django 存放區的問題。我們會預先複製存放區並預先建構 Python 環境，這樣模型指令碼就不會在 RL 迴圈中浪費時間下載這些項目。

執行下列指令來建立 Dockerfile.sandbox：

cat << 'EOF' > Dockerfile.sandbox
# Use a stable Debian-based Miniconda image
FROM condaforge/miniforge3:latest

# 1. Install essential system libraries (including sqlite3 for Django tests)
RUN apt-get update && apt-get install -y \
    git \
    build-essential \
    libsqlite3-dev \
    && rm -rf /var/lib/apt/lists/*

# 2. Set up the /workspace directory and grant ownership to the pre-existing non-root 'ubuntu' user (UID 1000)
RUN mkdir -p /workspace \
    && chown -R 1000:1000 /workspace

# 3. Switch to the non-root user
USER ubuntu
WORKDIR /workspace

# 4. Pre-configure Git globally so the agent can run git commands
RUN git config --global user.email "agent@gke-sandbox.local" \
    && git config --global user.name "Agent"

# 5. Pre-clone the repository as the non-root user
RUN git clone https://github.com/django/django.git .

# 6. Pre-build Conda environments and pre-cache common dependencies
# We do NOT run "pip install -e ." here to avoid Python version conflicts with the main branch.
# Instead, we pre-install the heavy dependencies so that runtime installation is instantaneous.
RUN conda create -y -n django-py39 python=3.9 \
    && conda run -n django-py39 pip install --no-cache-dir asgiref sqlparse tzdata pytest pytest-django

RUN conda create -y -n django-py310 python=3.10 \
    && conda run -n django-py310 pip install --no-cache-dir asgiref sqlparse tzdata pytest pytest-django

# --- Add Agent Server ---
# We use a multi-stage build to copy the agent server from the official python-runtime-sandbox image
COPY --from=registry.k8s.io/agent-sandbox/python-runtime-sandbox:v0.1.0 /app /opt/sandbox-agent
USER root
RUN chown -R 1000:1000 /opt/sandbox-agent \
    && /opt/conda/bin/pip install --no-cache-dir -r /opt/sandbox-agent/requirements.txt \
    && sed -i 's|"/app"|"/workspace"|g' /opt/sandbox-agent/main.py
USER ubuntu
# ------------------------

# Prepend the django-py39 conda environment bin to PATH for commands executed inside the container
ENV PATH=/home/ubuntu/.conda/envs/django-py39/bin:$PATH

# Keep the container alive and run the agent server using the system Python
CMD ["/opt/conda/bin/python3", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8888", "--log-level", "trace", "--app-dir", "/opt/sandbox-agent"]
EOF

建構及推送映像檔：

export SANDBOX_REPO="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/django-sandbox:v1"
docker build -f Dockerfile.sandbox -t $SANDBOX_REPO .
docker push $SANDBOX_REPO

4. 設定自動化調度管理和執行作業

現在，我們要部署 Ray 叢集以進行自動調度，並部署 Sandbox 資源以供執行。

1. Ray 叢集設定

部署 RayCluster 自訂資源。請注意，叢集可用的資源 (例如記憶體、CPU 或 GPU 類型) 可能有所不同。視情況調整 resources 要求和限制。

執行下列指令來建立 raycluster.yaml。這會使用 cat << EOF，自動將環境變數代入資訊清單：

cat << EOF > raycluster.yaml
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: grpo-cluster
  namespace: default
spec:
  rayVersion: "2.55.1"
  headGroupSpec:
    rayStartParams:
      dashboard-host: "0.0.0.0"
    template:
      spec:
        containers:
        - name: ray-head
          image: ${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/ray-head:v1
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              cpu: "2"
              memory: "8Gi"
            requests:
              cpu: "2"
              memory: "8Gi"
  workerGroupSpecs:
  - groupName: gpu-group
    replicas: 1
    minReplicas: 1
    maxReplicas: 1
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: ${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/ray-gpu-worker:v1
          resources:
            limits:
              cpu: "12"
              memory: "120Gi"
              nvidia.com/gpu: "1"
            requests:
              cpu: "12"
              memory: "120Gi"
              nvidia.com/gpu: "1"
EOF

套用方式：

kubectl apply -f raycluster.yaml

確認叢集已建立並正在執行 (這項作業可能需要幾分鐘的時間才能完成)：

kubectl get raycluster

預期輸出內容：

NAME       DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
rl-cluster   1                 1                                            ready    2m

2. SandboxRouter 設定

SandboxRouter 可做為高速 HTTP 閘道，處理來自 Ray 工作站的要求，並立即將要求橋接至可用的 gVisor Pod，略過速度較慢的 Kubernetes API 伺服器 Pod 生命週期。

執行下列指令來建立 sandbox_router.yaml：

cat << 'EOF' > sandbox_router.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: sandbox-claim-manager
rules:
- apiGroups: ["extensions.agents.x-k8s.io"]
  resources: ["sandboxclaims"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["agents.x-k8s.io"]
  resources: ["sandboxes"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: sandbox-claim-manager-binding
  namespace: default
subjects:
- kind: ServiceAccount
  name: default
  namespace: default
roleRef:
  kind: Role
  name: sandbox-claim-manager
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Service
metadata:
  name: sandbox-router
  namespace: default
spec:
  type: ClusterIP
  selector:
    app: sandbox-router
  ports:
  - name: http
    protocol: TCP
    port: 8080
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sandbox-router-deployment
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sandbox-router
  template:
    metadata:
      labels:
        app: sandbox-router
    spec:
      containers:
      - name: router
        image: us-central1-docker.pkg.dev/k8s-staging-images/agent-sandbox/sandbox-router:latest-main
        ports:
        - containerPort: 8080
        env:
        - name: ALLOW_UNAUTHENTICATED_ROUTER
          value: "true"
EOF

套用方式：

kubectl apply -f sandbox_router.yaml

確認部署作業正在執行：

kubectl get deployment sandbox-router-deployment

預期輸出內容：

NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
sandbox-router-deployment   2/2     2            2           1m

3. SandboxTemplate 和 WarmPool 設定

GKE Agent Sandbox 可透過 Sandbox Router，即時指派已預先暖機的獨立容器。我們定義了 SandboxTemplate 和 SandboxWarmPool，確保 Pod 隨時待命。

執行下列指令，使用環境變數建立 sandbox_warmpool.yaml：

cat << EOF > sandbox_warmpool.yaml
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxTemplate
metadata:
  name: swe-bench-django
  namespace: default
spec:
  podTemplate:
    spec:
      runtimeClassName: gvisor
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      nodeSelector:
        sandbox.gke.io/runtime: gvisor
      tolerations:
      - key: sandbox.gke.io/runtime
        operator: Equal
        value: gvisor
        effect: NoSchedule
      containers:
      - name: sandbox
        image: ${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/django-sandbox:v1
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            cpu: "2"
            memory: "4Gi"
---
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxWarmPool
metadata:
  name: swe-bench-django-warmpool
  namespace: default
spec:
  replicas: 10
  sandboxTemplateRef:
    name: swe-bench-django
EOF

套用方式：

kubectl apply -f sandbox_warmpool.yaml

確認 SandboxWarmPool 已初始化：

kubectl get sandboxwarmpool

預期輸出內容：

NAME                        READY   AGE
swe-bench-django-warmpool   10      1m

4. 安全隔離

NetworkPolicy 會嚴格隔離沙箱，防止連出至 GCP 中繼資料伺服器，進而防止 IAM 權杖遭竊。

執行下列指令來建立 network_policy.yaml：

cat << 'EOF' > network_policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-metadata-egress
  namespace: default
spec:
  podSelector:
    matchLabels:
      sandbox.gke.io/runtime: gvisor
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 169.254.169.254/32
EOF

套用政策：

kubectl apply -f network_policy.yaml

確認 NetworkPolicy 是否已建立：

kubectl get networkpolicy

預期輸出內容：

NAME                 POD-SELECTOR     AGE
block-metadata-egress             sandbox.gke.io/runtime=gvisor     1m

5. 使用 SweBench 和 TRL 的基本 RL 工作

叢集和沙箱準備就緒後，我們就能執行 GRPO 訓練迴圈。我們會使用 trl 程式庫來協調 GRPO 演算法，並使用 Ray 遠端函式評估獨立沙箱中產生的程式碼。

為了加快本程式碼研究室的執行速度，我們將篩選出單一 Django 問題。下方的路由邏輯說明如何為不同存放區選取不同的暖池，這在擴展至完整的 SWE-bench 資料集時非常實用。

訓練指令碼

執行下列指令來建立 train_trl.py：

cat << 'EOF' > train_trl.py
import ray
from k8s_agent_sandbox import SandboxClient
from k8s_agent_sandbox.models import SandboxDirectConnectionConfig
from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import urllib.request
import re

ray.init(ignore_reinit_error=True)

# 1. Define the Ray remote evaluation function
@ray.remote
def evaluate_rollout(code, prompt_data):
    client = SandboxClient(connection_config=SandboxDirectConnectionConfig(api_url="http://sandbox-router.default.svc.cluster.local:8080"))
    
    # Claim a pre-warmed sandbox instantly based on the repo
    repo = prompt_data.get("repo")
    
    # In a full system, you'd route to different warmpools based on repo
    # Here we default to django for our single task
    sandbox = client.create_sandbox(
        template="swe-bench-django",
        warmpool="swe-bench-django-warmpool",
        sandbox_ready_timeout=600
    )
    
    try:
        # Check if the code is correctly formatted
        bash_match = re.search(r"```bash\n(.*?)\n```", code, re.DOTALL)
        if not bash_match:
            return 0.0
            
        script = bash_match.group(1)

        # In a real environment, we would apply the base commit and install here
        # For simplicity, we just execute the script
        import shlex
        script_cmd = f"bash -c {shlex.quote(script)}"
        result = sandbox.commands.run(script_cmd, timeout=60)
        
        # Calculate continuous reward based on test passage ratio
        if result.exit_code == 0:
            return 1.0
        
        # Very simple heuristic reward
        return 0.1
        
    finally:
        # Clean up and release the sandbox back to the pool
        client.delete_sandbox(sandbox.claim_name)

# 2. Define the Reward Function for TRL
def sandbox_reward_func(prompts, completions, **kwargs):
    # Dispatch evaluation to Ray cluster
    futures = [
        evaluate_rollout.remote(completion, {
            "repo": kwargs.get('repo', [])[i] if 'repo' in kwargs else None,
            "base_commit": kwargs.get('base_commit', [])[i] if 'base_commit' in kwargs else None
        }) for i, completion in enumerate(completions)
    ]
    
    # Block and wait for all sandbox evaluations to complete
    rewards = ray.get(futures)
    return rewards

# 3. Setup GRPO Trainer
@ray.remote(num_gpus=1, num_cpus=8)
def train():
    # Load dataset
    dataset = load_dataset("princeton-nlp/SWE-bench_Lite", split="test")
    # Filter to our selected target issue
    dataset = dataset.filter(lambda x: x["instance_id"] == "django__django-15388")
    
    def format_dataset(example):
        files = re.findall(r'^\+\+\+ b/(.+)$', example["patch"], re.MULTILINE)
        target_file = files[0] if files else ""
        
        file_content = ""
        if target_file:
            try:
                github_repo = example["repo"]
                url = f"https://raw.githubusercontent.com/{github_repo}/{example['base_commit']}/{target_file}"
                with urllib.request.urlopen(url) as response:
                    file_content = response.read().decode('utf-8')
            except Exception as e:
                pass
                
        prompt = f"""You are an expert software engineer.
You are given a GitHub issue and the content of the file that contains the bug.
Write an executable bash script that will modify the target file to fix the bug (e.g. using cat << 'EOF' > {target_file} or inline python edits).
Wrap your bash script in ```bash ... ``` tags. Do not output raw python code directly.

Target File: {target_file}

Original File Content:
```python
{file_content}
```

Issue:
{example['problem_statement']}
"""
        return {
            "prompt": prompt,
            "repo": example["repo"],
            "instance_id": example["instance_id"],
            "base_commit": example["base_commit"],
        }
        
    dataset = dataset.map(format_dataset)

    model_name = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    training_args = GRPOConfig(
        output_dir="outputs",
        learning_rate=5e-6,
        max_steps=50,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        num_generations=4,
    )

    trainer = GRPOTrainer(
        model=model_name,
        processing_class=tokenizer,
        reward_funcs=[sandbox_reward_func],
        args=training_args,
        train_dataset=dataset,
    )

    print("Starting GRPO training with GKE Agent Sandboxes...")
    trainer.train()

def main():
    print("Submitting training job to GPU worker...")
    ray.get(train.remote())

if __name__ == "__main__":
    main()
EOF

將工作提交至叢集

首先，請透過通訊埠轉送至 Ray Head 資訊主頁，然後從本機提交訓練工作：

kubectl port-forward service/grpo-cluster-head-svc 8265:8265 &

ray job submit \
  --address http://localhost:8265 \
  --runtime-env-json '{"working_dir": "."}' \
  -- python train_trl.py

監控執行作業

你可以監控執行作業的進度：

Ray 資訊主頁：在瀏覽器中開啟 http://localhost:8265。
沙箱聲明：觀看 GKE 動態聲明及發布 gVisor 下的沙箱：
```
watch -n 1 "kubectl get sandboxclaims,sandboxes,pods"
```

6. 結論

恭喜！您已成功使用 GKE Agent Sandbox，在 GKE Standard 上安全地設定及執行高效能分散式 RL 訓練迴圈。