claude-thinking (Claude Thinking)

This example demonstrates Claude's "thinking" capability, which allows you to see the model's step-by-step reasoning process before it provides a final answer. The example compares thinking outputs from Claude Sonnet 4 (Anthropic API) and Claude Haiku 4.5 (AWS Bedrock).

You can run this example with:

bash

npx promptfoo@latest init --example claude-thinking
cd claude-thinking

What This Example Demonstrates

Using Claude's thinking feature to reveal step-by-step reasoning
Comparing thinking output quality between different Claude models
Comparing Anthropic API vs AWS Bedrock providers
Configuring the thinking token budget
Using LLM-based evaluation rubrics to assess reasoning quality

Environment Variables

This example requires:

For Anthropic API

ANTHROPIC_API_KEY - Your Anthropic API key from console.anthropic.com

For AWS Bedrock

AWS_ACCESS_KEY_ID - Your AWS access key
AWS_SECRET_ACCESS_KEY - Your AWS secret key
Or configure credentials via the AWS CLI: aws configure

Running the Example

After setting up environment variables:

bash

# From the example directory
promptfoo eval
promptfoo view

Test Cases

This example includes several test cases of increasing complexity:

8 Balls Problem - A classic logic puzzle requiring careful reasoning
Train Meeting Problem - A traditional algebra word problem

These test cases are specifically designed to showcase Claude's ability to break down complex problems and show detailed thinking steps.

How Claude Thinking Works

The thinking feature is enabled by setting special parameters in the provider configuration:

yaml

thinking:
  type: 'enabled'
  budget_tokens: 4096 # Controls how many tokens are allocated for thinking
max_tokens: 8192 # Must be greater than budget_tokens

When enabled, Claude's response will include a "Thinking:" section that shows its reasoning process before the final answer:

text

Thinking: Let me solve this step by step...
1. First, I'll divide the 8 balls into three groups...
2. In the first weighing, I'll compare groups A and B...
3. Based on the result, I can determine...

Final answer: We need exactly 2 weighings to find the heavier ball.

claude-thinking (Claude Thinking)

claude-thinking (Claude Thinking)

What This Example Demonstrates

Environment Variables

For Anthropic API

For AWS Bedrock

Running the Example

Test Cases

How Claude Thinking Works

Additional Resources