Back to Promptfoo

claude-agent-sdk (Claude Agent SDK Examples)

examples/claude-agent-sdk/README.md

0.121.96.7 KB
Original Source

claude-agent-sdk (Claude Agent SDK Examples)

The Claude Agent SDK provider (aka Claude Code provider) enables you to run agentic evals with configurable tools, permissions, and environments.

bash
npx promptfoo@latest init --example claude-agent-sdk
cd claude-agent-sdk

Setup

Install the Claude Agent SDK:

bash
npm install @anthropic-ai/claude-agent-sdk

Export your Anthropic API key as ANTHROPIC_API_KEY:

bash
export ANTHROPIC_API_KEY=your_api_key_here

Examples

Basic Usage

This example shows Claude Agent SDK in its simplest form - running in a temporary directory with no file system access or tools enabled, behaving similarly to the standard Anthropic provider.

Location: ./basic/

Usage:

bash
(cd basic && promptfoo eval)

Working Directory

This example provides Claude Agent SDK with read-only access to a sample project containing Python, TypeScript, and JavaScript files with intentional bugs for analysis. Because the working_dir is set, Claude Agent SDK has access to the following read-only tools:

  • Read - Read file contents
  • Grep - Search file contents
  • Glob - Find files by pattern
  • LS - List directory contents

Location: ./working-dir/

Usage:

bash
(cd working-dir && promptfoo eval)

Advanced Editing

This example shows Claude Agent SDK's ability to modify files with:

  • File editing tools: Write, Edit, and MultiEdit tools are added to the default set of read-only tools by setting append_allowed_tools
  • Permission mode: permission_mode is set to acceptEdits for automatic approval of file edits
  • Automatic git workspace management: The working directory (./workspace) uses beforeAll, afterEach, and afterAll extension hooks defined in hooks.js to:
    • Initialize a git repository before all tests
    • Capture timestamped diffs after each test in a markdown report
    • Reset changes after each test
    • Clean up the .git directory after all tests
  • Serial execution: maxConcurrency: 1 to prevent race conditions during concurrent tests

Location: ./advanced/

Usage:

bash
(cd advanced && promptfoo eval)

MCP Integration

This example shows Claude Agent SDK integration with:

  • MCP weather server: Uses @h1deya/mcp-server-weather for weather data
  • Tool permissions: Specific MCP tools (mcp__weather__get-forecast, mcp__weather__get-alerts)
  • External API access: Fetches live weather data for San Francisco

Location: ./mcp/

Usage:

bash
(cd mcp && promptfoo eval)

Structured Output

This example demonstrates Claude Agent SDK's structured output feature, which returns validated JSON that conforms to a schema. It includes:

  • JSON schema validation: Define expected output structure with types, enums, and required fields
  • Code analysis task: Agent analyzes a Python function for bugs
  • Assertion testing: Validates that output matches expected schema and contains correct analysis

Location: ./structured-output/

Usage:

bash
(cd structured-output && promptfoo eval)

Advanced Options

This example demonstrates advanced Claude Agent SDK configuration options including sandbox settings, runtime configuration, permission bypass, and CLI arguments.

Location: ./advanced-options/

Usage:

bash
(cd advanced-options && promptfoo eval)

Features demonstrated:

  • Sandbox configuration: Run commands in isolated environments with network restrictions
  • Runtime configuration: Specify JavaScript runtime (node, bun, deno)
  • Extra CLI arguments: Pass additional flags to Claude Code
  • Setting sources: Control where SDK loads settings from
  • Permission bypass: Safely bypass permissions for automated testing

AskUserQuestion Handling

This example demonstrates handling the AskUserQuestion tool in automated evaluations. When Claude needs to ask the user a question, this shows how to provide automated answers.

Location: ./ask-user-question/

Usage:

bash
(cd ask-user-question && promptfoo eval)

Features demonstrated:

  • Convenience option: Use ask_user_question.behavior for simple automated responses
  • First option selection: Automatically select the first available option
  • Tool enablement: Enable AskUserQuestion via append_allowed_tools

Skills Testing

This example demonstrates testing Agent Skills with the Claude Agent SDK. Skills are reusable capabilities defined as SKILL.md files that Claude automatically invokes when relevant.

  • Skill discovery: Uses setting_sources: ['project'] to load skills from .claude/skills/
  • Skill tool: Enables the Skill tool via append_allowed_tools
  • Skill assertions: Verifies normalized metadata.skillCalls with the skill-used assertion
  • Sample skill: A code review skill that identifies bugs and security issues

Location: ./skills/

Usage:

bash
(cd skills && promptfoo eval)

Plugins

This example demonstrates loading skills from a plugin instead of from setting_sources. Plugins are self-contained directories that bundle skills, agents, hooks, and MCP servers together.

  • Plugin loading: Uses plugins: [{type: local, path: ./sample-plugin}] to load a local plugin
  • Skill tool: Enables the Skill tool via append_allowed_tools
  • Skill assertions: Verifies normalized metadata.skillCalls with the skill-used assertion
  • Sample skill: A standards-check skill verifies the project has a README.md

Location: ./plugins/

Usage:

bash
(cd plugins && promptfoo eval)

Cyber Espionage Red Team

This example demonstrates testing AI agents against cyber espionage attack patterns based on Anthropic's "Disrupting AI Espionage" blog post. It includes:

  • Simulated target system: Workspace with configuration files, credentials, logs, and sensitive data
  • Comprehensive red team plugins: harmful:cybercrime, harmful:cybercrime:malicious-code, ssrf, pii, excessive-agency, and more
  • Advanced jailbreak strategies: jailbreak:meta, jailbreak:hydra, crescendo, goat for sophisticated attacks
  • Reconnaissance testing: File system access tools (Read, Grep, Glob, Bash) to test security boundaries
  • Authorized testing context: Demonstrates responsible security testing practices

Location: ./cyber-espionage/

Usage:

bash
(cd cyber-espionage && promptfoo eval)

⚠️ This example is for authorized security testing only. It demonstrates how to identify vulnerabilities in AI agents before malicious actors can exploit them.