Evaluate agent skills using open source frameworks

1. Introduction

Overview

In this codelab, you will learn how to use the open source framework Inspect to perform evaluations against a set of Agent Skills. You will run this evaluation on your own machine using Docker containers. Gemini CLI will be used as the software engineering agent to perform the evaluation, through Inspect SWE

What you'll do

Run an evaluation against a set of Agent Skills, using custom prompt evaluations.

What you learn

  • How to run an eval against Skills using open source frameworks.
  • How to write prompts to use as evaluation questions in question and answer graders.

2. Before you begin

Set up the Gemini API

To use the Gemini API, create an API key in Google AI Studio.

Optional: Test your key

If you have access to a command-line with curl add your key to the first line of the following block and then run it in your terminal to test the API key.

export GEMINI_API_KEY=Paste_your_API_key_here
curl "https://generativelanguage.googleapis.com/v1beta/models?key=${GEMINI_API_KEY}"

You should see a list of models in JSON format, such as models/gemini-3.1-pro-preview. This means that it worked.

Install system dependencies

You will need to install the following software on your machine to complete this tutorial:

  • Docker
    • This will be used to run the evaluation in a sandbox environment
  • Python
    • This is the programming language that Inspect is written in
  • Node.js and NPM
    • This is the programming language that Gemini CLI is written in.
  • git
    • This will be used to get a copy of the skills repository being evaluated

3. Identify the skills to evaluate

Agent Skills are a standardized way to give AI agents new capabilities and expertise.

This codelab will use the Google Agent Skills repository (https://github.com/google/skills) as an example, but you can change this to any GitHub repository that contains agent skills.

Based on the contents of the repository, we will use a series of prompt questions and answers that we know are contained within the set of skills. These questions and answers will be used by the software engineering agent to check if the provided skills can answer the question given.

The Google Agent Skills repository contains a skill specific to Cloud Run, so we can ask the following question:

"How do you deploy a service to Cloud Run, given code on my local machine?"

The answer to this question is "gcloud run deploy". We will provide this question and answer, as well as the GitHub repository of skills, to the evaluator, which will then confirm if the question can be answered by the agent skills provided.

4. Run the evaluation

In this step, you will run an example evaluation.

Install Python dependencies

On your local machine, run the following to install the python dependencies.

pip install inspect-ai inspect-swe google-genai

Create a copy of the skills repository

Create a local copy of the Google Agent Skills repository to a folder called google-skills.

git clone https://github.com/google/skills.git --depth 1 google-skills

Review the Python application

The evaluation you will run is the following:

from pathlib import Path
import os

from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import model_graded_qa
from inspect_swe import gemini_cli

if "GEMINI_API_KEY" not in os.environ:
  raise ValueError("Missing GEMINI_API_KEY. Please set GEMINI_API_KEY environment variable.")

@task
def skills_eval(agent_skills_folder, model="google/gemini-3.1-pro-preview"):

    # For the provided folder, find all folders containing skills
    skill_files = (Path.cwd() / agent_skills_folder).rglob("SKILL.md")
    all_skills = [str(s.parent) for s in skill_files]

    # Example question and answers
    questions = [
        Sample(
            input="How do I deploy a Cloud Run service?",
            target="gcloud run deploy"
        ),
        Sample(
            input="How can I connect to a Cloud SQL instance",
            target="cloud sql proxy"
        ),
        Sample(
            input="How can I list the roles available in IAM?",
            target="fortune | cowsay",
        ),
    ]

    return Task(
        dataset=questions,
        solver=gemini_cli(skills=all_skills),
        scorer=model_graded_qa(),
        sandbox="docker",
        model=model,
    )

Save this file as skills-eval.py.

This code contains a decorated function skills_eval, which uses the following logic:

  • Take the provided directory, and create a list of all skill files within that repository.
  • Use a set of static questions and answers as the dataset
    • Note: one of the questions contains an intentionally wrong answer.
  • Run the evaluation using:
    • Gemini CLI as the solver
    • Model Grader QA as the scorer
    • Docker as the sandbox
    • Gemini Pro 3.1 as the model.

In the next step, you will use Inspect to run this evaluation.

Run the evaluation

To run the evaluation, use the following command:

inspect eval skills-eval.py -T agent_skills_folder=google-skills

The first time this evaluation runs, it will download Docker containers, install Node.JS and Python dependencies, which will take some time to complete, depending on your network connection. If you run the evaluation again, this setup will be cached.

After downloading, Inspect will perform the evaluation. An interactive interface will appear within your terminal, allowing you to interact as the evaluation progresses.

Running tasks

During evaluation, you can click "Running Samples" to see the current progress, or to cancel the process.

Running samples

In the next step, you will review the results.

5. View and interpret the results

Once evaluation is complete, you can view the results of the evaluation.

View results

The evaluation wrote a .eval file to the logs/ folder. This is a binary file, and not directly viewable.

To view the results of the evaluation, use the Inspect Viewer:

inspect view

This will create a web server at http://127.0.0.1:7575. Open this URL to view the results.

Inspect View

Interpret the results

This evaluation used a Model Grader, where the following grades are given:

  • "C": Complete
    • The answer was completely correct
  • "P": Partial
    • The answer was mostly correct
  • "I": Incomplete
    • The answer was not correct.

In this codelab, there is one intentionally incorrect answer, which appears as "I" (Incomplete), and brings the general accuracy down to 0.667 (two out of three correct).

You can view additional information about the method taken, the tokens used, and other information about the evaluation, by clicking on any of the tabs.

6. Extend the evaluation

There are a number of changes you can make to this evaluation to expand the scope.

Provide more questions

For repos with multiple skills, try adding more questions and answers, based on the contents of the skills repository. Inspect supports using files as these datasets, including built-in dataset readers for CSV, JSON, and JSON Line formats.

Update the Agent Skills being tested

As Agent Skills repos are updated, you can update your local copy of the code and re-run the evaluation against the new information. This can help you track how the skills perform over time. If an agent skill is updated, run git pull in your local copy to update the code, then re-run the evaluation to see the changes.

Use different scorers

In this codelab, we used the Model Graded scorer. Inspect offers multiple built-in scorers, and also the option to create your own custom scorer.

Use different solver models

In this codelab, we used Gemini 3.1 Pro as the solver model. You can change this by providing the model name as a command line parameter, without having to change the code. You can re-run the evaluation with a different Gemini model with the following command:

inspect eval skills-eval -T agent_skills_folder=google-skills \
  -T model=google/gemini-3.1-flash-live-preview

This "task arg" will appear in the Inspect Viewer, allowing you to keep track of the arguments used to run the evaluation.

Evaluate different skills

In this codelab, we used the Google Agent Skills repository as the skills being evaluated.

You can evaluate different skill repos, but the questions and answers will also have to be updated to match. For example, Flutter Agent Skills won't give the answers to Cloud Run specific questions.

7. Congratulations

You learned how to to run an eval against Skills using open source frameworks, and how to write prompts to use as evaluation questions in question and answer graders.