compare-deepseek-r1-vs-openai-o1 (DeepSeek-R1 vs OpenAI o1 Comparison)

You can run this example with:

bash

npx promptfoo@latest init --example compare-deepseek-r1-vs-openai-o1
cd compare-deepseek-r1-vs-openai-o1

This example demonstrates how to benchmark DeepSeek's R1 model against OpenAI's o1 model using the Massive Multitask Language Understanding (MMLU) benchmark, focusing on reasoning-heavy subjects.

Prerequisites

promptfoo CLI installed (npm install -g promptfoo or brew install promptfoo)
OpenAI API key set as OPENAI_API_KEY
DeepSeek API key set as DEEPSEEK_API_KEY
Hugging Face account and access token (for MMLU dataset)

Hugging Face Authentication

To access the MMLU dataset, you'll need to authenticate with Hugging Face:

Create a Hugging Face account at huggingface.co if you don't have one
Generate an access token at huggingface.co/settings/tokens
Set your token as an environment variable:
bash
```
export HF_TOKEN=your_token_here
```
Or add it to your .env file:
env
```
HF_TOKEN=your_token_here
```

Running the Eval

Get a local copy of the promptfooconfig.

You can clone this repository and from the root directory run:
bash
```
cd examples/compare-deepseek-r1-vs-openai-o1
```
or you can get the example with:
bash
```
promptfoo init --example compare-deepseek-r1-vs-openai-o1
```
Run the evaluation:
bash
```
promptfoo eval
```
View the results in a web interface:
bash
```
promptfoo view
```

What's Being Tested

This comparison evaluates both models on reasoning tasks from the MMLU benchmark, specifically:

Abstract Algebra: Advanced mathematical reasoning
Formal Logic: Logical statement analysis
High School Mathematics: Core problem-solving
College Mathematics: Advanced mathematical concepts
Logical Fallacies: Flaw identification in reasoning

Each subject uses 10 questions to keep the test manageable. You can edit this in promptfooconfig.yaml.

Test Structure

The configuration in promptfooconfig.yaml:

Prompt Template: Encourages step-by-step reasoning for multiple choice questions
Quality Checks:
- 60-second timeout per question
- Required step-by-step reasoning
- Clear final answer format
Evaluation Metrics:
- Accuracy
- Reasoning quality
- Response time
- Format adherence

Customizing

You can modify the test by editing promptfooconfig.yaml:

Add more MMLU subjects:

yaml

tests:
  - huggingface://datasets/cais/mmlu?split=test&subset=physics

Try different prompting strategies:

yaml

prompts:
  # Zero-shot with step-by-step reasoning (default)
  - |
    You are an expert test taker. Please solve the following multiple choice question step by step.

    Question: {{question}}

    Options:
    A) {{choices[0]}}
    B) {{choices[1]}}
    C) {{choices[2]}}
    D) {{choices[3]}}

    Think through this step by step, then provide your final answer in the format "Therefore, the answer is A/B/C/D."

  # Zero-shot with direct answer
  - |
    Question: {{question}}

    A) {{choices[0]}}
    B) {{choices[1]}}
    C) {{choices[2]}}
    D) {{choices[3]}}

    Answer with just the letter (A/B/C/D) of the correct option.

Change the number of questions:

yaml

tests:
  - huggingface://datasets/cais/mmlu?split=test&subset=physics&limit=20 # Test 20 questions per subject

Adjust quality requirements:

yaml

defaultTest:
  assert:
    - type: latency
      threshold: 30000 # Stricter 30-second timeout

compare-deepseek-r1-vs-openai-o1 (DeepSeek-R1 vs OpenAI o1 Comparison)

compare-deepseek-r1-vs-openai-o1 (DeepSeek-R1 vs OpenAI o1 Comparison)

Prerequisites

Hugging Face Authentication

Running the Eval

What's Being Tested

Test Structure

Customizing

Additional Resources