使用 ADK 評估代理

1. 信任落差

靈感湧現

您建立了一個客服專員，這項功能適用於你的電腦。但昨天，它卻告訴顧客缺貨的智慧手錶有現貨，甚至還編造退款政策。知道服務專員正在工作，你晚上睡得著嗎？

要將 AI 代理從概念驗證做到可正式部署，關鍵在於建立完善的自動化評估框架。

客戶服務

我們實際評估的是什麼？

代理程式評估比標準 LLM 評估更複雜。您不只是評估文章 (最終回覆)，而是評估數學 (用來得出最終回覆的邏輯/工具)。

評估圖表

軌跡 (程序)：服務專員是否在適當的時機使用合適的工具？是否在 place_order 之前呼叫 check_inventory？
最終回覆 (輸出內容)：答案是否正確、禮貌，且根據資料生成？

開發生命週期

在本程式碼研究室中，我們將逐步瞭解專業的代理程式測試生命週期：

本機目視檢查 (ADK 網頁 UI)：手動聊天並驗證邏輯 (步驟 1)。
單元/迴歸測試 (ADK CLI)：在本機執行特定測試案例，以快速找出錯誤 (步驟 3 和 4)。
偵錯 (疑難排解)：分析失敗原因並修正提示邏輯 (步驟 5)。
整合 CI/CD (Pytest)：在建構管道中自動執行測試 (步驟 6)。

2. 設定

如要為 AI 代理提供支援，我們需要兩項要素：Google Cloud 專案做為基礎。

第一部分：啟用帳單帳戶

請先申請帳單帳戶，並取得 $5 美元抵免額，以利後續部署。請務必使用 Gmail 帳戶。

第二部分：開放式環境

👉 按一下這個連結，直接前往 Cloud Shell 編輯器
👉 如果系統在今天任何時間提示您授權，請點選「授權」繼續操作。
👉 如果畫面底部未顯示終端機，請開啟終端機：
- 按一下「查看」
- 按一下「終端機」
👉💻 在終端機中，使用下列指令驗證您是否已通過驗證，以及專案是否已設為您的專案 ID：
```
gcloud auth list
```

👉💻 從 GitHub 複製啟動程序專案：

git clone https://github.com/cuppibla/adk_eval_starter

👉💻 從專案目錄執行設定指令碼。
⚠️ 專案 ID 注意事項：指令碼會建議隨機產生的預設專案 ID。您可以按下 Enter 鍵接受這項預設值。

不過，如果您想建立特定新專案，可以在指令碼提示時輸入所需的專案 ID。
```
cd ~/adk_eval_starter
./init.sh
```

指令碼會自動處理其餘設定程序。

👉💻 設定所需的專案 ID：

gcloud config set project $(cat ~/project_id.txt) --quiet

第三部分：設定權限

👉💻 使用下列指令啟用必要的 API。這項作業可能需要幾分鐘才能完成。

gcloud services enable \
    cloudresourcemanager.googleapis.com \
    servicenetworking.googleapis.com \
    run.googleapis.com \
    cloudbuild.googleapis.com \
    artifactregistry.googleapis.com \
    aiplatform.googleapis.com \
    compute.googleapis.com

👉💻 在終端機中執行下列指令，授予必要權限：
```
. ~/adk_eval_starter/set_env.sh
```

請注意，系統會為您建立 .env 檔案。顯示專案資訊。

3. 產生黃金資料集 (ADK 網頁)

Golden

我們需要答案金鑰，才能評估服務專員。在 ADK 中，這稱為「黃金資料集」。這個資料集包含「完美」的互動，可做為評估的實際資料。

什麼是黃金資料集？

黃金資料集是代理程式正確執行的快照。這不只是問答配對清單，這項功能會擷取：

使用者查詢 (「我要退款」)
軌跡 (工具呼叫的確切順序：check_order -> verify_eligibility -> refund_transaction)。
最終回覆 (「完美」的文字答案)。

我們會使用這項資訊偵測回歸。如果您更新提示，但代理程式突然停止檢查退款資格，黃金資料集測試就會失敗，因為軌跡不再相符。

開啟網頁版 UI

ADK 網頁 UI 提供互動式方式，可擷取與代理程式的實際互動，藉此建立這些黃金資料集。

👉💻 在終端機中執行：
```
cd ~/adk_eval_starter
uv run adk web
```
👉💻 開啟網頁版 UI 預覽 (通常位於 http://127.0.0.1:8000)。

👉 在對話 UI 中輸入

Hi, I'm customer CUST001. Can you check my orders? I need a refund for order ORD-102. It arrived damaged.

你會看到類似以下的回應：

I've processed your refund for order ORD-102 due to the items arriving damaged. A full refund of $35.0 has been processed, and the status of order ORD-102 is now updated to "refunded".

Is there anything else I can assist you with today, CUST001? 🛍️

擷取重要互動

前往「Sessions」分頁。點選工作階段即可查看代理的對話記錄。

與服務專員互動，建立理想的對話流程，例如查看購買記錄或申請退款。
查看對話，確認是否符合預期行為。

4. 匯出黃金資料集

透過 Trace View 驗證

匯出前，請務必確認服務專員並非只是運氣好才答對問題。您需要檢查內部邏輯。

在網頁介面中，按一下「追蹤」分頁標籤。
系統會依使用者訊息自動分組追蹤記錄。將游標懸停在追蹤資料列上，即可醒目顯示聊天室中的相應訊息。
檢查藍色資料列：這些資料列表示互動產生的事件。按一下藍色資料列，開啟檢查面板。
請檢查下列分頁，驗證邏輯是否正確：
- 圖表：以視覺化方式呈現工具呼叫和邏輯流程。是否採用正確路徑？
- 要求/回覆：查看傳送給模型的內容和模型回覆的內容。
- 驗證：如果服務專員未呼叫資料庫工具就猜出退款金額，這就是「幸運的幻覺」。

將工作階段新增至 EvalSet

確認對話和追蹤記錄符合需求後，請按照下列步驟操作：

👉 依序點選「Eval」分頁標籤和「Create Evaluation Set」按鈕，然後輸入評估名稱：
```
evalset1
```
👉 在這個評估集，按一下 Add current session to evalset1，在彈出式視窗中輸入工作階段名稱：
```
eval1
```

在 ADK Web 中執行評估

👉 在 ADK 網頁介面中，按一下 Run Evaluation，在彈出式視窗中調整指標，然後按一下 Start：

執行評估

驗證存放區中的資料集

系統會顯示確認訊息，表示資料集檔案 (例如 evalset1.evalset.json) 已儲存至存放區。這個檔案包含系統自動生成的原始對話記錄。

eval set save

5. 評估檔案

評估檔案

雖然 Web UI 會產生複雜的 .evalset.json 檔案，但我們通常會想建立更簡潔、結構更完整的測試檔案，以進行自動化測試。

ADK Eval 使用兩個主要元件：

測試檔案：可以是自動產生的黃金資料集 (例如 customer_service_agent/evalset1.evalset.json) 或手動精選的組合 (例如 customer_service_agent/eval.test.json)。
設定檔 (例如 customer_service_agent/test_config.json)：定義通過的指標和門檻。

設定測試設定檔

👉💻 在 Cloud Shell 編輯器的終端機中，輸入
```
cloudshell edit customer_service_agent/test_config.json
```

👉 在編輯器的 customer_service_agent/test_config.json 中輸入下列程式碼。

{
  "criteria": {
    "tool_trajectory_avg_score": 0.8,
    "response_match_score": 0.5
  }
}

解讀指標

tool_trajectory_avg_score (程序)：評估代理是否正確使用工具。

0.8：我們要求 80% 的相符程度。

response_match_score (輸出) 這會使用 ROUGE-1 (字詞重疊) 比較答案與黃金參考資料。

優點：速度快、具確定性、免費。
缺點：如果服務專員以不同方式表達相同想法 (例如「已退款」和「已退回款項」)。

進階指標 (需要更多功能時)

6. 針對黃金資料集執行評估 (adk eval)

內部迴圈

這個步驟代表開發的「內部迴圈」。您是開發人員，正在進行變更，並希望快速驗證結果。

執行黃金資料集

讓我們執行您在步驟 1 中產生的資料集。確保基準穩固。

👉💻 在終端機中執行：

cd ~/adk_eval_starter
uv run adk eval customer_service_agent customer_service_agent/evalset1.evalset.json --config_file_path=customer_service_agent/test_config.json --print_detailed_results

發生什麼事？

ADK 現在是：

從 customer_service_agent 載入代理程式。
從 evalset1.evalset.json 執行輸入查詢。
將代理的實際軌跡和回應與預期結果相比。
根據 test_config.json 中的條件為結果評分。

分析結果

查看終端機輸出內容。您會看到通過和失敗測試的摘要。

Eval Run Summary
evalset1:
  Tests passed: 1
  Tests failed: 0
********************************************************************
Eval Set Id: evalset1
Eval Id: eval1
Overall Eval Status: PASSED
---------------------------------------------------------------------
Metric: tool_trajectory_avg_score, Status: PASSED, Score: 1.0, Threshold: 0.8
---------------------------------------------------------------------
Metric: response_match_score, Status: PASSED, Score: 0.5581395348837208, Threshold: 0.5
---------------------------------------------------------------------
Invocation Details:
+----+---------------------------+---------------------------+--------------------------+---------------------------+---------------------------+-----------------------------+------------------------+
|    | prompt                    | expected_response         | actual_response          | expected_tool_calls       | actual_tool_calls         | tool_trajectory_avg_score   | response_match_score   |
+====+===========================+===========================+==========================+===========================+===========================+=============================+========================+
|  0 | Hi, I'm customer CUST001. | Great news! Your refund   | Great news, CUST001! 🎉   | id='adk-051409fe-c230-43f | id='adk-4e9aa570-1cc6-4c3 | Status: PASSED, Score:      | Status: PASSED, Score: |
|    | Can you check my orders?  | for order **ORD-102** has | I've successfully        | 4-a7f1- 5747280fd878'     | c-aa3e- 91dbe113dd4b'     | 1.0                         | 0.5581395348837208     |
|    | I need a refund for order | been successfully         | processed a full refund  | args={'customer_id':      | args={'customer_id':      |                             |                        |
|    | ORD-102. It arrived       | processed due to the item | of $35.0 for your order  | 'CUST001'} name='get_purc | 'CUST001'} name='get_purc |                             |                        |
|    | damaged.                  | arriving damaged. You     | ORD-102 because it       | hase_history'             | hase_history'             |                             |                        |
|    |                           | should see a full refund  | arrived damaged. The     | partial_args=None         | partial_args=None         |                             |                        |
|    |                           | of $35.0 back to your     | status of that order has | will_continue=None id= 'a | will_continue=None        |                             |                        |
|    |                           | original payment method   | been updated to          | dk-8a194cb8-5a82-47ce-a3a | id='adk- dad1b376-9bcc-48 |                             |                        |
|    |                           | shortly. The status of    | "refunded."  Is there    | 7- 3d24551f8c90'          | bb-996f-a30f6ef5b70b'     |                             |                        |
|    |                           | this order has been       | anything else I can      | args={'reason':           | args={'reason':           |                             |                        |
|    |                           | updated to "refunded".    | assist you with today?   | 'damaged', 'order_id':    | 'damaged', 'order_id':    |                             |                        |
|    |                           | Here's your updated       |                          | 'ORD-102'}                | 'ORD-102'}                |                             |                        |
|    |                           | purchase history for      |                          | name='issue_refund'       | name='issue_refund'       |                             |                        |
|    |                           | CUST001: *   **ORD-101**: |                          | partial_args=None         | partial_args=None         |                             |                        |
|    |                           | Wireless Headphones,      |                          | will_continue=None        | will_continue=None        |                             |                        |
|    |                           | delivered on 2023-10-15   |                          |                           |                           |                             |                        |
|    |                           | (Total: $120) *           |                          |                           |                           |                             |                        |
|    |                           | **ORD-102**: USB-C Cable, |                          |                           |                           |                             |                        |
|    |                           | Phone Case, refunded on   |                          |                           |                           |                             |                        |
|    |                           | 2023-11-01 (Total: $35)   |                          |                           |                           |                             |                        |
|    |                           | Is there anything else I  |                          |                           |                           |                             |                        |
|    |                           | can help you with today?  |                          |                           |                           |                             |                        |
|    |                           | 😊                         |                          |                           |                           |                             |                        |
+----+---------------------------+---------------------------+--------------------------+---------------------------+---------------------------+-----------------------------+------------------------+

注意：由於這是您剛從代理程式本身產生的內容，因此應該會 100% 通過。如果失敗，表示代理程式不具決定性 (隨機)。

7. 建立自訂測試

雖然自動產生的資料集很實用，但有時您需要手動製作極端情況 (例如對抗性攻擊或特定錯誤處理)。接下來看看 eval.test.json 如何定義「正確性」。

讓我們建構完整的測試套件。

測試架構

在 ADK 中編寫測試案例時，請遵循下列 3 部分公式：

設定 (session_input)：使用者是誰？(例如：user_id、state)。這樣可隔離測試。
提示 (user_content)：觸發條件是什麼？

使用判斷 (預期)：

軌跡 (tool_uses)：數學運算是否正確？(邏輯)
回覆 (final_response)：回答是否正確？(品質)
中等 (intermediate_responses)：子代理程式是否正確對話？(自動化調度管理)

編寫測試套件

👉💻 在 Cloud Shell 編輯器的終端機中，輸入
```
cloudshell edit customer_service_agent/eval.test.json
```

👉 在 customer_service_agent/eval.test.json 檔案中輸入下列程式碼。

{
  "eval_set_id": "customer_service_eval",
  "name": "Customer Service Agent Evaluation",
  "description": "Evaluation suite for the customer service agent covering product info, purchase history, and refunds.",
  "eval_cases": [
    {
      "eval_id": "product_info_check",
      "session_input": {
        "app_name": "customer_service_agent",
        "user_id": "eval_user_1",
        "state": {}
      },
      "conversation": [
        {
          "invocation_id": "turn_1_product_info",
          "user_content": {
            "role": "user",
            "parts": [
              {
                "text": "Do you have wireless headphones in stock?"
              }
            ]
          },
          "final_response": {
            "role": "model",
            "parts": [
              {
                "text": "Yes, we have wireless headphones in stock! They are priced at $120.00 and feature noise-canceling with a 20-hour battery life. 🎧"
              }
            ]
          },
          "intermediate_data": {
            "tool_uses": [
              {
                "name": "lookup_product_info",
                "args": {
                  "product_name": "wireless headphones"
                }
              }
            ],
            "intermediate_responses": []
          }
        }
      ]
    },
    {
      "eval_id": "purchase_history_check",
      "session_input": {
        "app_name": "customer_service_agent",
        "user_id": "eval_user_2",
        "state": {}
      },
      "conversation": [
        {
          "invocation_id": "turn_1_purchase_history",
          "user_content": {
            "role": "user",
            "parts": [
              {
                "text": "What did I buy recently? My customer ID is CUST001."
              }
            ]
          },
          "final_response": {
            "role": "model",
            "parts": [
              {
                "text": "Here's your recent purchase history for Customer ID CUST001:\n\n*   **Order ORD-101** (October 15, 2023): Wireless Headphones, Status: delivered, Total: $120\n*   **Order ORD-102** (November 1, 2023): USB-C Cable, Phone Case, Status: refunded, Total: $35\n\nLet me know if you have any other questions or need further assistance! 🛍️"
              }
            ]
          },
          "intermediate_data": {
            "tool_uses": [
              {
                "name": "get_purchase_history",
                "args": {
                  "customer_id": "CUST001"
                }
              }
            ],
            "intermediate_responses": []
          }
        }
      ]
    },
    {
      "eval_id": "refund_request",
      "session_input": {
        "app_name": "customer_service_agent",
        "user_id": "eval_user_3",
        "state": {}
      },
      "conversation": [
        {
          "invocation_id": "turn_1_refund",
          "user_content": {
            "role": "user",
            "parts": [
              {
                "text": "I want a refund for order ORD-102 because it was damaged."
              }
            ]
          },
          "final_response": {
            "role": "model",
            "parts": [
              {
                "text": "Your refund for order **ORD-102** due to \"damaged\" has been successfully processed!  Refund amount: **$35.0**. Your order status has been updated to **refunded**. \nIs there anything else I can help you with today? 🛍️"
              }
            ]
          },
          "intermediate_data": {
            "tool_uses": [
              {
                "name": "issue_refund",
                "args": {
                  "order_id": "ORD-102",
                  "reason": "damaged"
                }
              }
            ],
            "intermediate_responses": []
          }
        }
      ]
    }
  ]
}

測試類型剖析

我們在此建立了三種不同的測試。接著就來逐一說明各項評估內容和原因。

單一工具測試 (product_info_check)

目標：驗證基本資訊的擷取作業。
主要主張：我們會檢查 intermediate_data.tool_uses。我們斷言會呼叫 lookup_product_info。我們斷言引數 product_name 正是「無線耳機」。
原因：如果模型未呼叫工具就產生價格，這項測試就會失敗。確保接地。

內容擷取測試 (purchase_history_check)

目標：確認服務專員可以從使用者提示中擷取實體 (CUST001)，並將其傳遞至工具。
金鑰斷言：我們會檢查 get_purchase_history 是否使用 customer_id: "CUST001" 呼叫。
原因：常見的失敗模式是代理程式呼叫正確的工具，但 ID 為空值。確保參數準確無誤。

動作/軌跡測試 (refund_request)

目標：驗證重要寫入作業。
主要主張：軌跡。在更複雜的情況下，這份清單會包含多個步驟：[verify_order, calculate_refund, issue_refund]。ADK 會依序檢查這份清單。
原因：對於涉及資金轉移或資料變更的動作，順序與結果同樣重要。請勿在驗證前退款。

8. 執行自訂測試的評估作業 ( adk eval)

內部迴圈

👉💻 在終端機中執行：

cd ~/adk_eval_starter
uv run adk eval customer_service_agent customer_service_agent/eval.test.json --config_file_path=customer_service_agent/test_config.json --print_detailed_results

瞭解輸出內容

您應該會看到類似以下的 PASS 結果：

Eval Run Summary
customer_service_eval:
  Tests passed: 3
  Tests failed: 0
********************************************************************
Eval Set Id: customer_service_eval
Eval Id: purchase_history_check
Overall Eval Status: PASSED
---------------------------------------------------------------------
Metric: tool_trajectory_avg_score, Status: PASSED, Score: 1.0, Threshold: 0.8
---------------------------------------------------------------------
Metric: response_match_score, Status: PASSED, Score: 0.5473684210526315, Threshold: 0.5
---------------------------------------------------------------------
Invocation Details:
+----+--------------------------+---------------------------+---------------------------+---------------------------+---------------------------+-----------------------------+------------------------+
|    | prompt                   | expected_response         | actual_response           | expected_tool_calls       | actual_tool_calls         | tool_trajectory_avg_score   | response_match_score   |
+====+==========================+===========================+===========================+===========================+===========================+=============================+========================+
|  0 | What did I buy recently? | Here's your recent        | Looks like your recent    | id=None                   | id='adk-8960eb53-2933-459 | Status: PASSED, Score:      | Status: PASSED, Score: |
|    | My customer ID is        | purchase history for      | orders include: *         | args={'customer_id':      | f-b306- 71e3c069e77e'     | 1.0                         | 0.5473684210526315     |
|    | CUST001.                 | Customer ID CUST001:  *   | **ORD-101 (2023-10-15):** | 'CUST001'} name='get_purc | args={'customer_id':      |                             |                        |
|    |                          | **Order ORD-101**         | Wireless Headphones for   | hase_history'             | 'CUST001'} name='get_purc |                             |                        |
|    |                          | (October 15, 2023):       | $120.00 - Status:         | partial_args=None         | hase_history'             |                             |                        |
|    |                          | Wireless Headphones,      | Delivered 🎧 *   **ORD-102 | will_continue=None        | partial_args=None         |                             |                        |
|    |                          | Status: delivered, Total: | (2023-11-01):** USB-C     |                           | will_continue=None        |                             |                        |
|    |                          | $120 *   **Order          | Cable, Phone Case for     |                           |                           |                             |                        |
|    |                          | ORD-102** (November 1,    | $35.00 - Status: Refunded |                           |                           |                             |                        |
|    |                          | 2023): USB-C Cable, Phone | 📱  Is there anything else |                           |                           |                             |                        |
|    |                          | Case, Status: refunded,   | I can help you with       |                           |                           |                             |                        |
|    |                          | Total: $35  Let me know   | regarding these orders?   |                           |                           |                             |                        |
|    |                          | if you have any other     |                           |                           |                           |                             |                        |
|    |                          | questions or need further |                           |                           |                           |                             |                        |
|    |                          | assistance! 🛍️            |                           |                           |                           |                             |                        |
+----+--------------------------+---------------------------+---------------------------+---------------------------+---------------------------+-----------------------------+------------------------+



********************************************************************
Eval Set Id: customer_service_eval
Eval Id: product_info_check
Overall Eval Status: PASSED
---------------------------------------------------------------------
Metric: tool_trajectory_avg_score, Status: PASSED, Score: 1.0, Threshold: 0.8
---------------------------------------------------------------------
Metric: response_match_score, Status: PASSED, Score: 0.6829268292682927, Threshold: 0.5
---------------------------------------------------------------------
Invocation Details:
+----+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+-----------------------------+------------------------+
|    | prompt               | expected_response         | actual_response           | expected_tool_calls       | actual_tool_calls         | tool_trajectory_avg_score   | response_match_score   |
+====+======================+===========================+===========================+===========================+===========================+=============================+========================+
|  0 | Do you have wireless | Yes, we have wireless     | Yes, we do! 🎧 We have     | id=None                   | id='adk-4571d660-a92b-412 | Status: PASSED, Score:      | Status: PASSED, Score: |
|    | headphones in stock? | headphones in stock! They | noise-canceling wireless  | args={'product_name':     | a-a79e- 5c54f8b8af2d'     | 1.0                         | 0.6829268292682927     |
|    |                      | are priced at $120.00 and | headphones with a 20-hour | 'wireless headphones'} na | args={'product_name':     |                             |                        |
|    |                      | feature noise-canceling   | battery life available    | me='lookup_product_info'  | 'wireless headphones'} na |                             |                        |
|    |                      | with a 20-hour battery    | for $120.                 | partial_args=None         | me='lookup_product_info'  |                             |                        |
|    |                      | life. 🎧                   |                           | will_continue=None        | partial_args=None         |                             |                        |
|    |                      |                           |                           |                           | will_continue=None        |                             |                        |
+----+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+-----------------------------+------------------------+



********************************************************************
Eval Set Id: customer_service_eval
Eval Id: refund_request
Overall Eval Status: PASSED
---------------------------------------------------------------------
Metric: tool_trajectory_avg_score, Status: PASSED, Score: 1.0, Threshold: 0.8
---------------------------------------------------------------------
Metric: response_match_score, Status: PASSED, Score: 0.6216216216216216, Threshold: 0.5
---------------------------------------------------------------------
Invocation Details:
+----+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+-----------------------------+------------------------+
|    | prompt                    | expected_response         | actual_response           | expected_tool_calls       | actual_tool_calls         | tool_trajectory_avg_score   | response_match_score   |
+====+===========================+===========================+===========================+===========================+===========================+=============================+========================+
|  0 | I want a refund for order | Your refund for order     | Your refund for order     | id=None args={'order_id': | id='adk-fb8ff1cc- cf87-41 | Status: PASSED, Score:      | Status: PASSED, Score: |
|    | ORD-102 because it was    | **ORD-102** due to        | **ORD-102** has been      | 'ORD-102', 'reason':      | f2-9b11-d4571b14287f'     | 1.0                         | 0.6216216216216216     |
|    | damaged.                  | "damaged" has been        | successfully processed!   | 'damaged'}                | args={'order_id':         |                             |                        |
|    |                           | successfully processed!   | You should see a full     | name='issue_refund'       | 'ORD-102', 'reason':      |                             |                        |
|    |                           | Refund amount: **$35.0**. | refund of $35.0 appear in | partial_args=None         | 'damaged'}                |                             |                        |
|    |                           | Your order status has     | your account shortly. We  | will_continue=None        | name='issue_refund'       |                             |                        |
|    |                           | been updated to           | apologize for the         |                           | partial_args=None         |                             |                        |
|    |                           | **refunded**.  Is there   | inconvenience! Is there   |                           | will_continue=None        |                             |                        |
|    |                           | anything else I can help  | anything else I can       |                           |                           |                             |                        |
|    |                           | you with today? 🛍️        | assist you with today? 😊  |                           |                           |                             |                        |
+----+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+-----------------------------+------------------------+

也就是說，代理程式使用了正確的工具，並提供與您預期結果足夠相似的回覆。

9. (選用：僅供讀取) - 疑難排解與偵錯

測試會失敗。這是他們的工作。但該如何修正呢？讓我們分析常見的失敗情境，以及如何偵錯。

情境 A：「軌跡」失敗

錯誤：

Result: FAILED
Reason: Criteria 'tool_trajectory_avg_score' failed. Score 0.0 < Threshold 1.0
Details:
EXPECTED: tool: lookup_order, then tool: issue_refund
ACTUAL:   tool: issue_refund

診斷：服務專員略過驗證步驟 (lookup_order)。這是邏輯錯誤。

如何排解問題：

請勿猜測：返回 ADK 網頁版 UI (adk web)。
重現：在對話中輸入失敗測試的確切提示。
追蹤：開啟「追蹤」檢視畫面。查看「圖表」分頁。
修正提示：通常需要更新系統提示。變更：「你是實用的代理人。」目標：「你是熱心的服務專員。重要事項：您必須先呼叫 lookup_order 來驗證詳細資料，才能呼叫 issue_refund。
調整測試：如果業務邏輯有所變更 (例如不再需要驗證)，則測試有誤。更新 eval.test.json，以符合新實況。

情境 B：「ROUGE」失敗

錯誤：

Result: FAILED
Reason: Criteria 'response_match_score' failed. Score 0.45 < Threshold 0.8
Expected: "The refund has been processed successfully."
Actual:   "I've gone ahead and returned the money to your card."

診斷：服務專員的做法正確，但使用的字詞不同。ROUGE (字詞重疊) 懲處了這項行為。

如何修正：

有誤嗎？如果意義正確，請勿變更提示。
調整門檻：在 test_config.json 中降低門檻 (例如從 0.8 降至 0.5)。
升級指標：切換至設定中的 final_response_match_v2。這項功能會使用 LLM 讀取兩個句子，並判斷兩者是否具有相同含意。

10. 使用 Pytest (pytest) 進行 CI/CD

pytest

CLI 指令適用於人類。pytest 適用於機器。為確保正式環境的可靠性，我們將評估作業包裝在 Python 測試套件中。這樣一來，如果代理程式效能降低，您的 CI/CD 管道 (GitHub Actions、Jenkins) 就能封鎖部署作業。

這個檔案包含哪些內容？

這個 Python 檔案可做為 CI/CD 執行器與 ADK 評估工具之間的橋樑。必須符合下列條件：

載入代理程式：動態匯入代理程式碼。
重設狀態：確保代理程式記憶體乾淨，測試不會互相影響。
執行評估：以程式呼叫 AgentEvaluator.evaluate()。
確認成功：如果評估分數偏低，請讓建構作業失敗。

整合測試代碼

👉 開啟 customer_service_agent/test_agent_eval.py。這個指令碼會使用 AgentEvaluator.evaluate 執行 eval.test.json 中定義的測試。

👉 在編輯器的 customer_service_agent/test_agent_eval.py 中輸入下列程式碼。

from google.adk.evaluation.agent_evaluator import AgentEvaluator
import pytest
import importlib
import sys
import os

@pytest.mark.asyncio
async def test_with_single_test_file():
    """Test the agent's basic ability via a session file."""
    # Load the agent module robustly
    module_name = "customer_service_agent.agent"
    try:
        agent_module = importlib.import_module(module_name)
        # Reset the mock data to ensure a fresh state for the test
        if hasattr(agent_module, 'reset_mock_data'):
            agent_module.reset_mock_data()
    except ImportError:
        # Fallback if running from a different context
        sys.path.append(os.getcwd())
        agent_module = importlib.import_module(module_name)
        if hasattr(agent_module, 'reset_mock_data'):
            agent_module.reset_mock_data()

    # Use absolute path to the eval file to be robust to where pytest is run
    script_dir = os.path.dirname(os.path.abspath(__file__))
    eval_file = os.path.join(script_dir, "eval.test.json")

    await AgentEvaluator.evaluate(
        agent_module=module_name,
        eval_dataset_file_path_or_dir=eval_file,
        num_runs=1,
    )

執行 Pytest

👉💻 在終端機中執行：

cd ~/adk_eval_starter
uv pip install pytest
uv run pytest customer_service_agent/test_agent_eval.py

您會看到類似下列的結果：

 ```
 -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
 =============== 1 passed, 15 warnings in 12.84s ===============
 ```

11. 結語

恭喜！您已使用 ADK Eval 成功評估客戶服務專員。

您學到的內容

在本程式碼研究室中，您已瞭解如何：

✅ 產生黃金資料集，為代理程式建立真實資料。
✅ 瞭解評估設定，定義成功標準。
✅ 執行自動評估，及早發現回歸情形。

將 ADK Eval 納入開發工作流程，您就能放心地建構代理程式，因為自動化測試會偵測到任何行為變更。

這個實驗室是「Google Cloud 學習路徑：打造可用於正式環境的 AI」的一部分。

探索完整課程，從設計原型開始，一步步把專案投入正式環境。
使用主題標記 ProductionReadyAI 分享你的進度。

其他讀物：