site/docs/providers/claude-agent-sdk.md
This provider makes Claude Agent SDK available for evals through its TypeScript SDK.
:::info The Claude Agent SDK was formerly known as the Claude Code SDK. It's still built on top of Claude Code and exposes all its functionality. :::
You can reference this provider using either:
anthropic:claude-agent-sdk (full name)anthropic:claude-code (alias)The Claude Agent SDK provider requires the @anthropic-ai/claude-agent-sdk package to be installed separately:
npm install @anthropic-ai/claude-agent-sdk
:::note This is an optional dependency and only needs to be installed if you want to use the Claude Agent SDK provider. Note that Anthropic has released the claude-agent-sdk library with a proprietary license. :::
The easiest way to get started is with an Anthropic API key. You can set it with the ANTHROPIC_API_KEY environment variable or specify the apiKey in the provider configuration.
Create Anthropic API keys here.
Example of setting the environment variable:
export ANTHROPIC_API_KEY=your_api_key_here
If Claude Agent SDK will authenticate through an existing local Claude Code session instead of ANTHROPIC_API_KEY, disable Promptfoo's upfront API key check:
providers:
- id: anthropic:claude-agent-sdk
config:
apiKeyRequired: false
This is useful when you're using a local Claude Code binary with an active session, such as Claude Code monthly plans. Promptfoo will skip its preflight API key validation, but the SDK still needs to be able to authenticate on its own.
Apart from using the Anthropic API, you can also use AWS Bedrock and Google Vertex AI.
For AWS Bedrock:
CLAUDE_CODE_USE_BEDROCK environment variable to true:export CLAUDE_CODE_USE_BEDROCK=true
For Google Vertex:
CLAUDE_CODE_USE_VERTEX environment variable to true:export CLAUDE_CODE_USE_VERTEX=true
By default, Claude Agent SDK runs in a temporary directory with no tools enabled, using the default permission mode. This makes it behave similarly to the standard Anthropic provider. It has no access to the file system (read or write) and can't run system commands.
providers:
- anthropic:claude-agent-sdk
prompts:
- 'Output a python function that prints the first 10 numbers in the Fibonacci sequence'
When your test cases finish, the temporary directory is deleted.
You can specify a specific working directory for Claude Agent SDK to run in:
providers:
- id: anthropic:claude-agent-sdk
config:
working_dir: ./src
prompts:
- 'Review the TypeScript files and identify potential bugs'
This allows you to prepare a directory with files or sub-directories before running your tests.
By default, when you specify a working directory, Claude Agent SDK is given read-only access to the directory.
You can also allow Claude Agent SDK to write to files, run system commands, call MCP servers, and more.
Here's an example that will allow Claude Agent SDK to both read from and write to files in the working directory. It uses append_allowed_tools to add tools for writing and editing files to the default set of read-only tools. It also sets permission_mode to acceptEdits so Claude Agent SDK can modify files without asking for confirmation.
providers:
- id: anthropic:claude-agent-sdk
config:
working_dir: ./my-project
append_allowed_tools: ['Write', 'Edit', 'MultiEdit']
permission_mode: 'acceptEdits'
prompts:
- 'Refactor the authentication module to use async/await'
Note: when using
acceptEditsand tools that allow side effects like writing to files, you'll need to consider how you will reset the files after each test run. See the Managing Side Effects section for more information.
| Parameter | Type | Description | Default |
|---|---|---|---|
apiKey | string | Anthropic API key | Environment variable |
apiKeyRequired | boolean | Require Promptfoo to find an Anthropic API key before calling the SDK. Set to false for local SDK auth. | true |
working_dir | string | Directory for file operations | Temporary directory |
model | string | Primary model to use (passed to Claude Agent SDK) | Claude Agent SDK default |
fallback_model | string | Fallback model if primary fails | Claude Agent SDK default |
max_turns | number | Maximum conversation turns | Claude Agent SDK default |
max_thinking_tokens | number | Maximum tokens for thinking | Claude Agent SDK default |
max_budget_usd | number | Maximum cost budget in USD for the agent execution | None |
task_budget | object | Token budget for pacing tool use: {total: N} | None |
permission_mode | string | Permission mode: default, plan, acceptEdits, bypassPermissions, dontAsk, auto | default |
allow_dangerously_skip_permissions | boolean | Required safety flag when using bypassPermissions mode | false |
thinking | object | Thinking config: {type: 'adaptive'}, {type: 'enabled', budgetTokens: N}, or {type: 'disabled'} | Model default |
effort | string | Response effort level: low, medium, high, max | high |
agent | string | Named agent for the main thread (must be defined in agents or settings) | None |
session_id | string | Custom session UUID (cannot be used with continue/resume unless fork_session is set) | Auto-generated |
title | string | Custom title for a new session (skips auto-generation from the first message) | Auto-generated |
debug | boolean | Enable verbose debug logging | false |
debug_file | string | Write debug logs to this file path (implicitly enables debug) | None |
betas | string[] | Enable beta features (e.g., ['context-1m-2025-08-07'] for 1M context) | None |
custom_system_prompt | string | Replace default system prompt | None |
append_system_prompt | string | Append to default system prompt | None |
exclude_dynamic_sections | boolean | Strip per-user dynamic sections from the preset prompt so it stays cacheable across runs | false |
tools | array/object | Base set of built-in tools (array of names or {type: 'preset', preset: 'claude_code'}) | None |
custom_allowed_tools | string[] | Replace default allowed tools | None |
append_allowed_tools | string[] | Add to default allowed tools | None |
allow_all_tools | boolean | Allow all available tools | false |
disallowed_tools | string[] | Tools to explicitly block (overrides allowed) | None |
additional_directories | string[] | Additional directories the agent can access (beyond working_dir) | None |
ask_user_question | object | Automated handling for AskUserQuestion tool (see Handling AskUserQuestion) | None |
mcp | object | MCP server configuration | None |
strict_mcp_config | boolean | Only allow configured MCP servers | true |
cache_mcp | boolean | Enable caching when MCP is configured (for deterministic MCP tools) | false |
setting_sources | string[] | Where SDK looks for settings, CLAUDE.md, and slash commands | None (disabled) |
plugins | array | Local plugins to load for the session | None |
output_format | object | Structured output configuration with JSON schema | None |
agents | object | Programmatic agent definitions for custom subagents | None |
hooks | object | Event hooks for intercepting tool calls and other events | None |
include_partial_messages | boolean | Include partial/streaming messages in response | false |
include_hook_events | boolean | Include hook lifecycle events in output stream | false |
tool_config | object | Per-tool configuration (e.g., askUserQuestion.previewFormat) | None |
prompt_suggestions | boolean | Enable AI-predicted next prompts after each turn | false |
agent_progress_summaries | boolean | Enable periodic AI progress summaries for subagents | false |
settings | string/object | Additional settings (file path or inline object) | None |
on_elicitation | function | Callback for MCP elicitation requests (programmatic only) | Auto-decline |
resume | string | Resume from a specific session ID | None |
fork_session | boolean | Fork from an existing session instead of continuing | false |
continue | boolean | Continue an existing session | false |
enable_file_checkpointing | boolean | Track file changes for rewinding to previous states | false |
persist_session | boolean | Save session to disk for later resumption | true |
sandbox | object | Sandbox settings for command execution isolation | None |
permission_prompt_tool_name | string | MCP tool name to use for permission prompts | None |
executable | string | JavaScript runtime: node, bun, or deno | Auto-detected |
executable_args | string[] | Arguments to pass to the JavaScript runtime | None |
extra_args | object | Additional CLI arguments (keys without --, values as strings or null for flags) | None |
env | object | Extra environment variables to forward to the SDK subprocess (e.g. OTEL_*, CLAUDE_CODE_ENABLE_TELEMETRY) | None |
path_to_claude_code_executable | string | Path to a custom Claude Code executable | Built-in |
spawn_claude_code_process | function | Custom spawn function for VMs/containers (programmatic only) | Default spawn |
Model selection is optional, since Claude Agent SDK uses sensible defaults. When specified, models are passed directly to the Claude Agent SDK.
providers:
- id: anthropic:claude-agent-sdk
config:
model: claude-opus-4-6
fallback_model: claude-sonnet-4-5-20250929
Claude Agent SDK also supports a number of model aliases, which can also be used in the configuration.
providers:
- id: anthropic:claude-agent-sdk
config:
model: sonnet
fallback_model: haiku
Claude Agent SDK also supports configuring models through environment variables. When using this provider, any environment variables you set will be passed through to the Claude Agent SDK.
Unless you specify a custom_system_prompt, the default Claude Code system prompt will be used. You can append additional instructions to it with append_system_prompt.
Set exclude_dynamic_sections: true to strip per-user context (working directory, auto-memory, git status) from the preset prompt. This keeps the prompt-caching prefix static across runs, which matters for high-volume evals. The stripped context is re-injected as the first user message. Has no effect when custom_system_prompt is set.
:::info
Note that this differs slightly from the Claude Agent SDK's behavior when used independently of Promptfoo. The Agent SDK will not use the Claude Code system prompt by default unless it's specified—it will instead use an empty system prompt if none is provided. If you want to use an empty system prompt with this provider, set custom_system_prompt to an empty string.
:::
If no working_dir is specified, Claude Agent SDK runs in a temporary directory with no access to tools by default.
By default, when a working_dir is specified, Claude Agent SDK has access to the following read-only tools:
Read - Read file contentsGrep - Search file contentsGlob - Find files by patternLS - List directory contentsControl Claude Agent SDK's permissions for modifying files and running system commands:
| Mode | Description |
|---|---|
default | Standard permissions |
plan | Planning mode |
acceptEdits | Allow file modifications |
bypassPermissions | No restrictions (requires allow_dangerously_skip_permissions: true) |
dontAsk | Deny permissions that aren't pre-approved (no prompts) |
:::warning
Using bypassPermissions requires setting allow_dangerously_skip_permissions: true as a safety measure:
providers:
- id: anthropic:claude-agent-sdk
config:
permission_mode: bypassPermissions
allow_dangerously_skip_permissions: true
:::
Customize available tools for your use case:
# Use all default Claude Code tools via preset
providers:
- id: anthropic:claude-agent-sdk
config:
tools:
type: preset
preset: claude_code
# Specify exact base tools
providers:
- id: anthropic:claude-agent-sdk
config:
tools:
- Bash
- Read
- Edit
- Write
# Disable all built-in tools
providers:
- id: anthropic:claude-agent-sdk
config:
tools: []
# Add tools to defaults
providers:
- id: anthropic:claude-agent-sdk
config:
append_allowed_tools: ['Write', 'Edit']
# Replace default tools entirely
providers:
- id: anthropic:claude-agent-sdk
config:
custom_allowed_tools: ['Read', 'Grep', 'Glob', 'Write', 'Edit', 'MultiEdit', 'Bash', 'WebFetch', 'WebSearch']
# Block specific tools
providers:
- id: anthropic:claude-agent-sdk
config:
disallowed_tools: ['Delete', 'Run']
# Allow all tools (use with caution)
providers:
- id: anthropic:claude-agent-sdk
config:
allow_all_tools: true
The tools option specifies the base set of available built-in tools, while allowedTools and disallowedTools filter from that base.
⚠️ Security Note: Some tools allow Claude Agent SDK to modify files, run system commands, search the web, and more. Think carefully about security implications before using these tools.
Here's a full list of available tools.
Unlike the standard Anthropic provider, Claude Agent SDK handles MCP (Model Context Protocol) connections directly. Configuration is forwarded to the Claude Agent SDK:
providers:
- id: anthropic:claude-agent-sdk
config:
mcp:
servers:
# HTTP-based server
- url: https://api.example.com/mcp
name: api-server
headers:
Authorization: 'Bearer token'
# Process-based server
- command: node
args: ['mcp-server.js']
name: local-server
strict_mcp_config: true # Only use configured servers (true by default)
For detailed MCP configuration, see Claude Code MCP documentation.
By default, the Claude Agent SDK provider does not look for settings files, CLAUDE.md, or slash commands. You can enable this by specifying setting_sources:
providers:
- id: anthropic:claude-agent-sdk
config:
setting_sources: ['project', 'local']
Available values:
user - User-level settingsproject - Project-level settingslocal - Local directory settingsPlugins extend the agent with additional skills, agents, hooks, and MCP servers. While setting_sources discovers skills from the standard settings hierarchy (project/local/user), plugins are self-contained directories that bundle capabilities together and namespace their skills—mirroring how marketplace-installed plugins work.
providers:
- id: anthropic:claude-agent-sdk
config:
working_dir: ./my-project
plugins:
- type: local
path: ./my-plugin
append_allowed_tools: ['Skill', 'Read']
:::note
Only the local type is currently supported. Relative paths in path resolve against the config file's directory.
:::
A plugin is a directory containing a .claude-plugin/plugin.json manifest:
my-plugin/
├── .claude-plugin/
│ └── plugin.json
└── skills/
└── code-review/
└── SKILL.md
The manifest defines the plugin's name and description:
{
"name": "my-plugin",
"description": "A plugin that provides code review skills"
}
Skills from plugins are namespaced with the plugin name. For example, a standards-check skill in a plugin named project-standards becomes project-standards:standards-check. Use this namespaced name when asserting on skill invocations:
assert:
- type: skill-used
value: project-standards:standards-check
Both plugins and setting_sources can provide skills, but they serve different purposes:
setting_sources: Discovers skills from the standard settings hierarchy—project, local, and user-level .claude/skills/ directories. Skills are not namespaced.plugins: Loads self-contained plugin directories, mirroring how marketplace-installed plugins work. Skills are namespaced with the plugin name (plugin:skill).You can use both together — skills from both sources are available in the same session.
Agent Skills are reusable capabilities that extend Claude's functionality. They are defined as SKILL.md files and can be tested using the Claude Agent SDK provider. Skills can be loaded via setting_sources (from the standard settings hierarchy) or from plugins.
To test skills, load them via setting_sources or plugins, and include Skill in the allowed tools. Using setting_sources:
providers:
- id: anthropic:claude-agent-sdk
config:
working_dir: ./my-project
setting_sources: ['project'] # Load skills from .claude/skills/
append_allowed_tools: ['Skill']
Skills are automatically discovered at startup from the configured setting_sources directories. The SDK scans for SKILL.md files in subdirectories of .claude/skills/:
my-project/
└── .claude/
└── skills/
├── code-review/
│ └── SKILL.md
└── test-generator/
└── SKILL.md
Claude automatically invokes the relevant skill when a task matches the skill's description in its frontmatter.
Promptfoo normalizes Claude Skill tool invocations into response.metadata.skillCalls, so skill evals can use the same skill-used assertion style as Codex. The underlying Skill tool calls are still available in response.metadata.toolCalls when you need the raw tool payload.
providers:
- id: anthropic:claude-agent-sdk
config:
working_dir: ./my-project
setting_sources: ['project']
append_allowed_tools: ['Skill', 'Read', 'Write']
prompts:
- 'Review the authentication module for security issues'
tests:
- assert:
# Check that a specific skill was invoked
- type: skill-used
value: code-review
You can verify skills are loaded by asking Claude to list them. Note that this relies on Claude's free-text response, so use a flexible assertion:
prompts:
- 'List all available skills by name'
tests:
- assert:
- type: icontains
value: 'code-review' # Expected skill name
:::note
Because the response is free-text, contains assertions may be fragile. For more reliable testing, check tool calls instead (see Testing Skill Invocation).
:::
For consistent testing in CI/CD environments, restrict to project-level skills only:
providers:
- id: anthropic:claude-agent-sdk
config:
working_dir: ./my-project
setting_sources: ['project'] # Only team-shared skills, exclude personal
append_allowed_tools: ['Skill', 'Read', 'Bash']
permission_mode: 'acceptEdits'
This ensures tests don't depend on user-specific skills that may not be present in CI.
providers:
- id: anthropic:claude-agent-sdk
config:
working_dir: ./my-project
setting_sources: ['project']
append_allowed_tools: ['Skill', 'Read', 'Write', 'Bash']
permission_mode: 'acceptEdits'
prompts:
- 'Generate unit tests for the UserService class'
tests:
- assert:
# Verify the test-generator skill was invoked
- type: skill-used
value: test-generator
# Verify tests were generated
- type: icontains
value: 'describe('
For more information about creating skills, see the Claude Code skills documentation.
Limit the maximum cost of an agent execution with max_budget_usd:
providers:
- id: anthropic:claude-agent-sdk
config:
max_budget_usd: 0.50
The agent will stop execution if the cost exceeds the specified budget.
Control how the model paces its tool use within a token budget using task_budget:
providers:
- id: anthropic:claude-agent-sdk
config:
task_budget:
total: 50000
The total field sets the token budget for the task. The model uses this to pace its tool use—for example, being more selective about which tools to invoke as the budget is consumed.
Grant the agent access to directories beyond the working directory:
providers:
- id: anthropic:claude-agent-sdk
config:
working_dir: ./project
additional_directories:
- /shared/libs
- /data/models
Get validated JSON responses by specifying an output schema:
providers:
- id: anthropic:claude-agent-sdk
config:
output_format:
type: json_schema
schema:
type: object
properties:
analysis:
type: string
confidence:
type: number
required: [analysis, confidence]
When output_format is configured, the response will include structured output that conforms to the schema. The structured output is available in:
output - The parsed structured output (when available)metadata.structuredOutput - The raw structured output valueContinue or fork existing sessions for multi-turn interactions:
providers:
- id: anthropic:claude-agent-sdk
config:
# Continue an existing session
resume: 'session-id-from-previous-run'
continue: true
# Or fork from an existing session
resume: 'session-id-to-fork'
fork_session: true
Session IDs are returned in the response and can be used to continue conversations across eval runs.
By default, sessions are saved to disk (~/.claude/projects/) and can be resumed later. For ephemeral or automated workflows where session history is not needed, disable persistence:
providers:
- id: anthropic:claude-agent-sdk
config:
persist_session: false
Track file changes during the session to enable rewinding to previous states:
providers:
- id: anthropic:claude-agent-sdk
config:
enable_file_checkpointing: true
working_dir: ./my-project
append_allowed_tools: ['Write', 'Edit']
When file checkpointing is enabled, the SDK creates backups of files before they are modified. This allows programmatic restoration to any previous state in the conversation.
Enable experimental features using the betas parameter:
providers:
- id: anthropic:claude-agent-sdk
config:
betas:
- context-1m-2025-08-07
Currently available betas:
| Beta | Description |
|---|---|
context-1m-2025-08-07 | Enable 1M token context window (Sonnet 4/4.5 only) |
See the Anthropic beta headers documentation for more information.
Run commands in an isolated sandbox environment for additional security:
providers:
- id: anthropic:claude-agent-sdk
config:
sandbox:
enabled: true
autoAllowBashIfSandboxed: true
network:
allowLocalBinding: true
allowedDomains:
- api.example.com
Available sandbox options:
| Option | Type | Description |
|---|---|---|
enabled | boolean | Enable sandboxed execution |
autoAllowBashIfSandboxed | boolean | Auto-allow bash commands when sandboxed |
allowUnsandboxedCommands | boolean | Allow commands that can't be sandboxed |
enableWeakerNestedSandbox | boolean | Enable weaker sandbox for nested environments |
excludedCommands | string[] | Commands to exclude from sandboxing |
failIfUnavailable | boolean | Fail closed when sandbox dependencies are missing |
ignoreViolations | object | Map of command patterns to violation types to ignore |
network.allowedDomains | string[] | Domains allowed for network access |
network.allowLocalBinding | boolean | Allow binding to localhost |
network.allowUnixSockets | string[] | Specific Unix sockets to allow |
network.allowAllUnixSockets | boolean | Allow all Unix socket connections |
network.httpProxyPort | number | HTTP proxy port for network access |
network.socksProxyPort | number | SOCKS proxy port for network access |
ripgrep.command | string | Path to custom ripgrep executable |
ripgrep.args | string[] | Additional arguments for ripgrep |
When sandbox.enabled is true, Claude Agent SDK defaults failIfUnavailable to true; set it to false only if you want the SDK to degrade gracefully when sandbox dependencies or platform support are missing.
See the Claude Code sandbox documentation for more details.
Apply additional settings via a file path or inline object. These load into the "flag settings" layer, which has the highest priority among user-controlled settings:
providers:
- id: anthropic:claude-agent-sdk
config:
settings:
permissions:
allow:
- 'Bash(*)'
- 'Read(*)'
Or reference a settings file:
providers:
- id: anthropic:claude-agent-sdk
config:
settings: /path/to/settings.json
Customize built-in tool behavior with tool_config:
providers:
- id: anthropic:claude-agent-sdk
config:
tool_config:
askUserQuestion:
previewFormat: html # 'markdown' (default) or 'html'
Enable AI-generated progress summaries for running subagents and predicted next prompts:
providers:
- id: anthropic:claude-agent-sdk
config:
agent_progress_summaries: true # periodic summaries for subagents
prompt_suggestions: true # AI-predicted next prompts after each turn
Specify which JavaScript runtime to use:
providers:
- id: anthropic:claude-agent-sdk
config:
executable: bun # or 'node' or 'deno'
executable_args:
- '--smol'
Pass additional arguments to Claude Code:
providers:
- id: anthropic:claude-agent-sdk
config:
extra_args:
verbose: null # boolean flag (adds --verbose)
timeout: '30' # adds --timeout 30
Use a specific Claude Code installation:
providers:
- id: anthropic:claude-agent-sdk
config:
path_to_claude_code_executable: /custom/path/to/claude-code
For running Claude Code in VMs, containers, or remote environments, you can provide a custom spawn function when using the provider programmatically:
import { ClaudeCodeSDKProvider } from 'promptfoo';
const provider = new ClaudeCodeSDKProvider({
config: {
spawn_claude_code_process: (options) => {
// Custom spawn logic for VM/container execution
// options contains: command, args, cwd, env, signal
return myVMProcess; // Must satisfy SpawnedProcess interface
},
},
});
This option is only available when using the provider programmatically, not via YAML configuration.
Define custom subagents with specific tools and permissions:
providers:
- id: anthropic:claude-agent-sdk
config:
agents:
code-reviewer:
name: Code Reviewer
description: Reviews code for bugs and style issues
tools: [Read, Grep, Glob]
test-runner:
name: Test Runner
description: Runs tests and reports results
tools: [Bash, Read]
The AskUserQuestion tool allows Claude to ask the user multiple-choice questions during execution. In automated evaluations, there's no human to answer these questions, so you need to configure how they should be handled.
The simplest approach is to use the ask_user_question configuration:
providers:
- id: anthropic:claude-agent-sdk
config:
append_allowed_tools: ['AskUserQuestion']
ask_user_question:
behavior: first_option
Available behaviors:
| Behavior | Description |
|---|---|
first_option | Always select the first option |
random | Randomly select from available options |
deny | Deny the tool use |
For custom answer selection logic when using the provider programmatically, you can provide your own canUseTool callback:
import { ClaudeCodeSDKProvider } from 'promptfoo';
const provider = new ClaudeCodeSDKProvider({
config: {
append_allowed_tools: ['AskUserQuestion'],
},
// Custom canUseTool passed via SDK options
});
The canUseTool callback receives the tool name and input, and returns an answer:
async function canUseTool(toolName, input, options) {
if (toolName !== 'AskUserQuestion') {
return { behavior: 'allow', updatedInput: input };
}
const answers = {};
for (const q of input.questions) {
// Custom selection logic - prefer options marked as recommended
const preferred = q.options.find((o) => o.description.toLowerCase().includes('recommended'));
answers[q.question] = preferred?.label ?? q.options[0].label;
}
return {
behavior: 'allow',
updatedInput: {
questions: input.questions,
answers,
},
};
}
See the Claude Agent SDK permissions documentation for more details on canUseTool.
:::tip
If you're testing scenarios where the agent asks questions, consider what answer would lead to the most interesting test case. Using random behavior can help discover edge cases.
:::
The Claude Agent SDK provider captures all tool calls made during the agentic session and exposes them in response.metadata.toolCalls. This allows you to assert on tool usage in your evaluations.
Each tool call entry contains:
| Field | Type | Description |
|---|---|---|
id | string | Unique tool call ID |
name | string | Tool name (e.g., Read, Bash, Grep) |
input | unknown | Arguments passed to the tool |
output | unknown | Tool result content (undefined if not available) |
is_error | boolean | Whether the tool call resulted in an error |
parentToolUseId | string | null | Parent tool use ID for sub-agent calls, null for top-level |
Use JavaScript assertions to check which tools were called:
assert:
- type: javascript
value: |
const toolCalls = context.providerResponse?.metadata?.toolCalls || [];
const readCalls = toolCalls.filter(t => t.name === 'Read');
return readCalls.length > 0;
Check that a specific command was run:
assert:
- type: javascript
value: |
const toolCalls = context.providerResponse?.metadata?.toolCalls || [];
const bashCalls = toolCalls.filter(t => t.name === 'Bash');
return bashCalls.some(t => t.input?.command?.includes('npm test'));
Verify tool output content:
assert:
- type: javascript
value: |
const toolCalls = context.providerResponse?.metadata?.toolCalls || [];
const grepCall = toolCalls.find(t => t.name === 'Grep');
return grepCall?.output?.includes('expected match');
For skill evals specifically, prefer the deterministic skill-used assertion over raw JavaScript when possible. Promptfoo derives metadata.skillCalls from these Skill tool calls automatically.
When tracing is enabled, every provider call emits an OpenTelemetry span using the GenAI semantic conventions (gen_ai.system, gen_ai.request.model, gen_ai.usage.*, gen_ai.response.model, gen_ai.response.finish_reasons, etc.) plus a child span per completed tool call (tool {name} with tool.input, tool.output, tool.is_error). Spans are parented to the evaluation trace so they appear grouped in the Traces tab.
The W3C TRACEPARENT environment variable is propagated to the SDK subprocess so telemetry it exports attaches to the same trace:
providers:
- id: anthropic:claude-agent-sdk
config:
env:
CLAUDE_CODE_ENABLE_TELEMETRY: '1'
OTEL_EXPORTER_OTLP_ENDPOINT: 'http://127.0.0.1:4318'
OTEL_EXPORTER_OTLP_PROTOCOL: 'http/protobuf'
tracing:
enabled: true
otlp:
http:
enabled: true
port: 4318
To also capture Claude Code's internal events — API requests, tool decisions, tool results — set OTEL_LOGS_EXPORTER=otlp and use the JSON logs protocol. Each log record becomes a child span on the provider span.
config:
env:
CLAUDE_CODE_ENABLE_TELEMETRY: '1'
OTEL_LOGS_EXPORTER: otlp
OTEL_EXPORTER_OTLP_ENDPOINT: 'http://127.0.0.1:4318'
OTEL_EXPORTER_OTLP_PROTOCOL: 'http/json'
The receiver's /v1/logs endpoint accepts JSON only. The provider automatically injects OTEL_RESOURCE_ATTRIBUTES=promptfoo.trace_id=...,promptfoo.parent_span_id=... so logs link to the correct evaluation trace even though the SDK's logs signal doesn't natively inherit TRACEPARENT.
This provider automatically caches responses, and will read from the cache if the prompt, configuration, and files in the working directory (if working_dir is set) are the same as a previous run.
When MCP servers are configured, caching is disabled by default because MCP tools typically interact with external state (APIs, file systems, databases), making cached responses unreliable. To opt back into caching for deterministic MCP tools (e.g., code search, static knowledge bases), set cache_mcp: true:
providers:
- id: anthropic:claude-agent-sdk
config:
cache_mcp: true
mcp:
servers:
- command: npx
args: ['-y', '@my/deterministic-mcp-server']
name: my-server
To disable caching globally:
export PROMPTFOO_CACHE_ENABLED=false
You can also include bustCache: true in the configuration to prevent reading from the cache.
When using Claude Agent SDK with configurations that allow side effects, like writing to files, running system commands, or calling MCP servers, you'll need to consider:
This increases complexity, so first consider if you can achieve your goal with a read-only configuration. If you do need to test with side effects, here are some strategies that can help:
evaluateOptions.maxConcurrency: 1 in your config or use --max-concurrency 1 CLI flagHere are a few complete example implementations: