examples/integration-strands-agents/README.md
This example demonstrates how to evaluate Strands Agents SDK with promptfoo.
Strands Agents is an open-source AI agent framework developed by AWS that provides a model-driven approach to building AI agents.
You can run this example with:
npx promptfoo@latest init --example integration-strands-agents
cd integration-strands-agents
This example showcases:
@tool decorator to define agent capabilitiespip install -r requirements.txt
This installs:
strands-agents[openai] - The Strands Agents SDK with OpenAI supportpydantic - Data validation library required by Strandsexport OPENAI_API_KEY=your-api-key-here
Strands supports multiple model providers. To use Anthropic:
pip install 'strands-agents[anthropic]'
export ANTHROPIC_API_KEY=your-key
Then modify agent.py to use AnthropicModel instead of OpenAIModel.
To use Amazon Bedrock:
pip install 'strands-agents[bedrock]'
# Run evaluation
npx promptfoo eval
# View results in the web UI
npx promptfoo view
The agent is defined in agent.py using the Strands Agent class with two tools:
get_weather: Returns mock weather data for cities (New York, London, Tokyo, Paris, Seattle, San Francisco)convert_temperature: Converts temperatures between Fahrenheit and CelsiusTools are defined using the @tool decorator which automatically exposes them to the LLM based on their docstrings.
agent_provider.py exposes a call_api function that promptfoo's Python provider calls to interact with the Strands agent.
The promptfoo config includes 5 test cases that demonstrate different assertion types:
| Test | Description | Assertion types used |
|---|---|---|
| Weather query for New York | Basic tool usage | contains-any, llm-rubric, latency |
| Weather query for London | Verify temperature format | contains-any, javascript, latency |
| Weather query for Tokyo | Case-insensitive matching | icontains, javascript, latency |
| Weather with temperature conversion | Multi-tool chaining | llm-rubric, javascript, latency |
| Weather for unknown city | Graceful fallback handling | icontains, not-contains, latency |
latency - Ensures responses complete within 30 seconds (applied to all tests via defaultTest)contains-any - Verifies the agent returns expected city names and weather data from the mock toolicontains - Case-insensitive matching to verify city names appear regardless of formattingnot-contains - Ensures the agent handles unknown cities gracefully without error messagesjavascript - Validates temperature format (°F/°C symbols) and response length requirementsllm-rubric - Semantically evaluates whether the agent correctly chains weather lookup with temperature conversion