Back to Promptfoo

openai-deep-research (OpenAI Deep Research Models)

examples/openai-deep-research/README.md

0.121.95.0 KB
Original Source

openai-deep-research (OpenAI Deep Research Models)

You can run this example with:

bash
npx promptfoo@latest init --example openai-deep-research
cd openai-deep-research

This example demonstrates OpenAI's deep research models with web search capabilities via the Responses API.

Important Notes

⚠️ Response Times: Deep research models can take 2-10 minutes to complete research tasks as they perform extensive web searches and reasoning.

⚠️ Token Usage: These models use significant tokens for internal reasoning. Always set high max_output_tokens (50,000+) to avoid incomplete responses.

⚠️ Access: Deep research models may require special access from OpenAI. Check your API access if you encounter persistent 429 errors.

Setup

  1. Set your OpenAI API key:
bash
export OPENAI_API_KEY=your-key-here
  1. Run the evaluation with appropriate timeout:
bash
# Set a 10-minute timeout for deep research tasks
export PROMPTFOO_EVAL_TIMEOUT_MS=600000
promptfoo eval

For local development:

bash
PROMPTFOO_EVAL_TIMEOUT_MS=600000 npm run local -- eval -c examples/openai-deep-research/promptfooconfig.yaml

What's happening?

This example:

  • Tests OpenAI's o4-mini-deep-research model with web search tools
  • Evaluates research capabilities on machine learning and space exploration topics
  • Uses the model's ability to automatically search the web for current information
  • Checks that responses contain relevant technical terminology
  • Demonstrates handling of web search results and citations

The model automatically decides when to use web search to provide comprehensive, up-to-date answers.

Configuration Details

yaml
providers:
  - id: openai:responses:o4-mini-deep-research
    config:
      max_output_tokens: 50000 # Required for complete research responses
      tools:
        - type: web_search_preview # Required for deep research models
      # Optional parameters:
      # max_tool_calls: 50 # Control number of searches (default: unlimited)
      # background: true # Use background mode for long-running tasks
      # store: true # Store the conversation for 30 days

Available Models

  • o3-deep-research - Most powerful deep research model ($10/1M input, $40/1M output)
  • o3-deep-research-2025-06-26 - Snapshot version
  • o4-mini-deep-research - Faster, more affordable ($2/1M input, $8/1M output)
  • o4-mini-deep-research-2025-06-26 - Snapshot version

Advanced Features

For production use, run deep research tasks in background mode to avoid timeouts:

yaml
providers:
  - id: openai:responses:o4-mini-deep-research
    config:
      background: true
      webhook_url: https://your-api.com/webhook # Optional: Get notified when complete

Using Code Interpreter

Deep research models can analyze data using code:

yaml
providers:
  - id: openai:responses:o4-mini-deep-research
    config:
      tools:
        - type: web_search_preview
        - type: code_interpreter
          container:
            type: auto

MCP Server Integration

Connect to private data sources using MCP servers:

yaml
providers:
  - id: openai:responses:o4-mini-deep-research
    config:
      tools:
        - type: web_search_preview
        - type: mcp
          server_label: mycompany_mcp
          server_url: https://mycompany.com/mcp
          require_approval: never # Required for deep research

Prompt Enhancement

For better results, consider preprocessing user queries:

  1. Clarification: Use a faster model to gather context
  2. Prompt rewriting: Expand the query with specific requirements
  3. Deep research: Pass the enhanced prompt to the research model

See the OpenAI Deep Research Guide for detailed examples.

Response Format

Deep research responses include:

  • output_text: The final research report with inline citations
  • annotations: Citation details with URLs and titles
  • web_search_call: Details of searches performed
  • code_interpreter_call: Any code analysis performed

Troubleshooting

  • Timeouts: Increase PROMPTFOO_EVAL_TIMEOUT_MS if evaluations time out
  • Incomplete responses: Increase max_output_tokens to 50,000 or higher
  • 429 errors: May indicate rate limits or access restrictions
  • Tool validation errors: Ensure web_search_preview is configured

Best Practices

  1. Always use high token limits: Set max_output_tokens: 50000 or higher
  2. Handle long response times: Use background mode or set high timeouts
  3. Monitor costs: These models use significant tokens for reasoning
  4. Validate citations: Check that returned URLs are accessible
  5. Consider prompt enhancement: Preprocess queries for better results

Learn More