.agents/skills/mcp-builder/reference/evaluation.md
This document provides guidance on creating comprehensive evaluations for MCP servers. Evaluations test whether LLMs can effectively use your MCP server to answer realistic, complex questions using only the tools provided.
<evaluation>
<qa_pair>
<question>Your question here</question>
<answer>Single verifiable answer</answer>
</qa_pair>
</evaluation>
The measure of quality of an MCP server is NOT how well or comprehensively the server implements tools, but how well these implementations (input/output schemas, docstrings/descriptions, functionality) enable LLMs with no other context and access ONLY to the MCP servers to answer realistic and difficult questions.
Create 10 human-readable questions requiring ONLY READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE, and IDEMPOTENT operations to answer. Each question should be:
Questions MUST be independent
Questions MUST require ONLY NON-DESTRUCTIVE AND IDEMPOTENT tool use
Questions must be REALISTIC, CLEAR, CONCISE, and COMPLEX
Questions must require deep exploration
Questions may require extensive paging
Questions must require deep understanding
Questions must not be solvable with straightforward keyword search
Questions should stress-test tool return values
Questions should MOSTLY reflect real human use cases
Questions may require dozens of tool calls
Include ambiguous questions
Questions must be designed so the answer DOES NOT CHANGE
DO NOT let the MCP server RESTRICT the kinds of questions you create
Answers must be STABLE/STATIONARY
Answers must be CLEAR and UNAMBIGUOUS
Answers must be DIVERSE
Answers must NOT be complex structures
Read the documentation of the target API to understand:
List the tools available in the MCP server:
Repeat steps 1 & 2 until you have a good understanding:
After understanding the API and tools, USE the MCP server tools:
limit parameter to limit results (<10)After inspecting the content, create 10 human-readable questions:
Each QA pair consists of a question and an answer. The output should be an XML file with this structure:
<evaluation>
<qa_pair>
<question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
<answer>Website Redesign</answer>
</qa_pair>
<qa_pair>
<question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
<answer>sarah_dev</answer>
</qa_pair>
<qa_pair>
<question>Look for pull requests that modified files in the /api directory and were merged between January 1 and January 31, 2024. How many different contributors worked on these PRs?</question>
<answer>7</answer>
</qa_pair>
<qa_pair>
<question>Find the repository with the most stars that was created before 2023. What is the repository name?</question>
<answer>data-pipeline</answer>
</qa_pair>
</evaluation>
Example 1: Multi-hop question requiring deep exploration (GitHub MCP)
<qa_pair>
<question>Find the repository that was archived in Q3 2023 and had previously been the most forked project in the organization. What was the primary programming language used in that repository?</question>
<answer>Python</answer>
</qa_pair>
This question is good because:
Example 2: Requires understanding context without keyword matching (Project Management MCP)
<qa_pair>
<question>Locate the initiative focused on improving customer onboarding that was completed in late 2023. The project lead created a retrospective document after completion. What was the lead's role title at that time?</question>
<answer>Product Manager</answer>
</qa_pair>
This question is good because:
Example 3: Complex aggregation requiring multiple steps (Issue Tracker MCP)
<qa_pair>
<question>Among all bugs reported in January 2024 that were marked as critical priority, which assignee resolved the highest percentage of their assigned bugs within 48 hours? Provide the assignee's username.</question>
<answer>alex_eng</answer>
</qa_pair>
This question is good because:
Example 4: Requires synthesis across multiple data types (CRM MCP)
<qa_pair>
<question>Find the account that upgraded from the Starter to Enterprise plan in Q4 2023 and had the highest annual contract value. What industry does this account operate in?</question>
<answer>Healthcare</answer>
</qa_pair>
This question is good because:
Example 1: Answer changes over time
<qa_pair>
<question>How many open issues are currently assigned to the engineering team?</question>
<answer>47</answer>
</qa_pair>
This question is poor because:
Example 2: Too easy with keyword search
<qa_pair>
<question>Find the pull request with title "Add authentication feature" and tell me who created it.</question>
<answer>developer123</answer>
</qa_pair>
This question is poor because:
Example 3: Ambiguous answer format
<qa_pair>
<question>List all the repositories that have Python as their primary language.</question>
<answer>repo1, repo2, repo3, data-pipeline, ml-tools</answer>
</qa_pair>
This question is poor because:
After creating evaluations:
<qa_pair> that require WRITE or DESTRUCTIVE operationsRemember to parallelize solving tasks to avoid running out of context, then accumulate all answers and make changes to the file at the end.
After creating your evaluation file, you can use the provided evaluation harness to test your MCP server.
Install Dependencies
pip install -r scripts/requirements.txt
Or install manually:
pip install anthropic mcp
Set API Key
export ANTHROPIC_API_KEY=your_api_key_here
Evaluation files use XML format with <qa_pair> elements:
<evaluation>
<qa_pair>
<question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>
<answer>Website Redesign</answer>
</qa_pair>
<qa_pair>
<question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>
<answer>sarah_dev</answer>
</qa_pair>
</evaluation>
The evaluation script (scripts/evaluation.py) supports three transport types:
Important:
For locally-run MCP servers (script launches the server automatically):
python scripts/evaluation.py \
-t stdio \
-c python \
-a my_mcp_server.py \
evaluation.xml
With environment variables:
python scripts/evaluation.py \
-t stdio \
-c python \
-a my_mcp_server.py \
-e API_KEY=abc123 \
-e DEBUG=true \
evaluation.xml
For SSE-based MCP servers (you must start the server first):
python scripts/evaluation.py \
-t sse \
-u https://example.com/mcp \
-H "Authorization: Bearer token123" \
-H "X-Custom-Header: value" \
evaluation.xml
For HTTP-based MCP servers (you must start the server first):
python scripts/evaluation.py \
-t http \
-u https://example.com/mcp \
-H "Authorization: Bearer token123" \
evaluation.xml
usage: evaluation.py [-h] [-t {stdio,sse,http}] [-m MODEL] [-c COMMAND]
[-a ARGS [ARGS ...]] [-e ENV [ENV ...]] [-u URL]
[-H HEADERS [HEADERS ...]] [-o OUTPUT]
eval_file
positional arguments:
eval_file Path to evaluation XML file
optional arguments:
-h, --help Show help message
-t, --transport Transport type: stdio, sse, or http (default: stdio)
-m, --model Claude model to use (default: claude-3-7-sonnet-20250219)
-o, --output Output file for report (default: print to stdout)
stdio options:
-c, --command Command to run MCP server (e.g., python, node)
-a, --args Arguments for the command (e.g., server.py)
-e, --env Environment variables in KEY=VALUE format
sse/http options:
-u, --url MCP server URL
-H, --header HTTP headers in 'Key: Value' format
The evaluation script generates a detailed report including:
Summary Statistics:
Per-Task Results:
python scripts/evaluation.py \
-t stdio \
-c python \
-a my_server.py \
-o evaluation_report.md \
evaluation.xml
Here's a complete example of creating and running an evaluation:
my_evaluation.xml):<evaluation>
<qa_pair>
<question>Find the user who created the most issues in January 2024. What is their username?</question>
<answer>alice_developer</answer>
</qa_pair>
<qa_pair>
<question>Among all pull requests merged in Q1 2024, which repository had the highest number? Provide the repository name.</question>
<answer>backend-api</answer>
</qa_pair>
<qa_pair>
<question>Find the project that was completed in December 2023 and had the longest duration from start to finish. How many days did it take?</question>
<answer>127</answer>
</qa_pair>
</evaluation>
pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=your_api_key
python scripts/evaluation.py \
-t stdio \
-c python \
-a github_mcp_server.py \
-e GITHUB_TOKEN=ghp_xxx \
-o github_eval_report.md \
my_evaluation.xml
github_eval_report.md to:
If you get connection errors:
If many evaluations fail:
If tasks are timing out:
claude-3-7-sonnet-20250219)