anthropic/opus-4-6-coding (Claude Opus 4.6 Advanced Coding)

This example demonstrates Claude Opus 4.6's state-of-the-art coding and reasoning capabilities, showcasing its ability to handle complex software engineering tasks with ambiguity and tradeoff analysis.

You can run this example with:

bash

npx promptfoo@latest init --example anthropic/opus-4-6-coding
cd anthropic/opus-4-6-coding

What This Tests

Claude Opus 4.6 is the best model in the world for coding, agents, and computer use. This example evaluates:

Complex code analysis: Understanding multi-file codebases and architectural decisions
Bug diagnosis: Identifying root causes in complex, multi-system scenarios
Ambiguity handling: Making informed decisions when requirements are unclear
Tradeoff reasoning: Evaluating different approaches and explaining pros/cons
Code generation: Writing high-quality, production-ready code

Features Demonstrated

State-of-the-art coding: Opus 4.6 achieves the highest score on SWE-bench Verified among frontier models
Reasoning about tradeoffs: The model excels at analyzing different approaches and making informed decisions
Handling ambiguity: Unlike models that require hand-holding, Opus 4.6 figures things out
Extended thinking: Support for thinking budgets up to 128K tokens for complex reasoning

Running the Example

bash

# Set your API key
export ANTHROPIC_API_KEY=your_api_key_here

# Run the evaluation
npx promptfoo@latest eval

# View results
npx promptfoo@latest view

Expected Results

The evaluation tests Opus 4.6's ability to:

Diagnose bugs across multiple system boundaries
Choose appropriate data structures with clear reasoning
Write production-quality code with proper error handling
Analyze architectural decisions and propose improvements

anthropic/opus-4-6-coding (Claude Opus 4.6 Advanced Coding)

anthropic/opus-4-6-coding (Claude Opus 4.6 Advanced Coding)

What This Tests

Features Demonstrated

Running the Example

Expected Results

Learn More