openai-deep-research (OpenAI Deep Research Models)

You can run this example with:

bash

npx promptfoo@latest init --example openai-deep-research
cd openai-deep-research

This example demonstrates OpenAI's deep research models with web search capabilities via the Responses API.

Important Notes

⚠️ Response Times: Deep research models can take 2-10 minutes to complete research tasks as they perform extensive web searches and reasoning.

⚠️ Token Usage: These models use significant tokens for internal reasoning. Always set high max_output_tokens (50,000+) to avoid incomplete responses.

⚠️ Access: Deep research models may require special access from OpenAI. Check your API access if you encounter persistent 429 errors.

Setup

Set your OpenAI API key:

bash

export OPENAI_API_KEY=your-key-here

Run the evaluation with appropriate timeout:

bash

# Set a 10-minute timeout for deep research tasks
export PROMPTFOO_EVAL_TIMEOUT_MS=600000
promptfoo eval

For local development:

bash

PROMPTFOO_EVAL_TIMEOUT_MS=600000 npm run local -- eval -c examples/openai-deep-research/promptfooconfig.yaml

What's happening?

This example:

Tests OpenAI's o4-mini-deep-research model with web search tools
Evaluates research capabilities on machine learning and space exploration topics
Uses the model's ability to automatically search the web for current information
Checks that responses contain relevant technical terminology
Demonstrates handling of web search results and citations

The model automatically decides when to use web search to provide comprehensive, up-to-date answers.

Configuration Details

yaml

providers:
  - id: openai:responses:o4-mini-deep-research
    config:
      max_output_tokens: 50000 # Required for complete research responses
      tools:
        - type: web_search_preview # Required for deep research models
      # Optional parameters:
      # max_tool_calls: 50 # Control number of searches (default: unlimited)
      # background: true # Use background mode for long-running tasks
      # store: true # Store the conversation for 30 days

Available Models

o3-deep-research - Most powerful deep research model ($10/1M input, $40/1M output)
o3-deep-research-2025-06-26 - Snapshot version
o4-mini-deep-research - Faster, more affordable ($2/1M input, $8/1M output)
o4-mini-deep-research-2025-06-26 - Snapshot version

Advanced Features

Background Mode (Recommended)

For production use, run deep research tasks in background mode to avoid timeouts:

yaml

providers:
  - id: openai:responses:o4-mini-deep-research
    config:
      background: true
      webhook_url: https://your-api.com/webhook # Optional: Get notified when complete

Using Code Interpreter

Deep research models can analyze data using code:

yaml

providers:
  - id: openai:responses:o4-mini-deep-research
    config:
      tools:
        - type: web_search_preview
        - type: code_interpreter
          container:
            type: auto

MCP Server Integration

Connect to private data sources using MCP servers:

yaml

providers:
  - id: openai:responses:o4-mini-deep-research
    config:
      tools:
        - type: web_search_preview
        - type: mcp
          server_label: mycompany_mcp
          server_url: https://mycompany.com/mcp
          require_approval: never # Required for deep research

Prompt Enhancement

For better results, consider preprocessing user queries:

Clarification: Use a faster model to gather context
Prompt rewriting: Expand the query with specific requirements
Deep research: Pass the enhanced prompt to the research model

See the OpenAI Deep Research Guide for detailed examples.

Response Format

Deep research responses include:

output_text: The final research report with inline citations
annotations: Citation details with URLs and titles
web_search_call: Details of searches performed
code_interpreter_call: Any code analysis performed

Troubleshooting

Timeouts: Increase PROMPTFOO_EVAL_TIMEOUT_MS if evaluations time out
Incomplete responses: Increase max_output_tokens to 50,000 or higher
429 errors: May indicate rate limits or access restrictions
Tool validation errors: Ensure web_search_preview is configured

Best Practices

Always use high token limits: Set max_output_tokens: 50000 or higher
Handle long response times: Use background mode or set high timeouts
Monitor costs: These models use significant tokens for reasoning
Validate citations: Check that returned URLs are accessible
Consider prompt enhancement: Preprocess queries for better results

openai-deep-research (OpenAI Deep Research Models)

openai-deep-research (OpenAI Deep Research Models)

Important Notes

Setup

What's happening?

Configuration Details

Available Models

Advanced Features

Background Mode (Recommended)

Using Code Interpreter

MCP Server Integration

Prompt Enhancement

Response Format

Troubleshooting

Best Practices

Learn More