examples/compare-deepseek-r1-vs-openai-o1/README.md
You can run this example with:
npx promptfoo@latest init --example compare-deepseek-r1-vs-openai-o1
cd compare-deepseek-r1-vs-openai-o1
This example demonstrates how to benchmark DeepSeek's R1 model against OpenAI's o1 model using the Massive Multitask Language Understanding (MMLU) benchmark, focusing on reasoning-heavy subjects.
npm install -g promptfoo or brew install promptfoo)OPENAI_API_KEYDEEPSEEK_API_KEYTo access the MMLU dataset, you'll need to authenticate with Hugging Face:
Create a Hugging Face account at huggingface.co if you don't have one
Generate an access token at huggingface.co/settings/tokens
Set your token as an environment variable:
export HF_TOKEN=your_token_here
Or add it to your .env file:
HF_TOKEN=your_token_here
Get a local copy of the promptfooconfig.
You can clone this repository and from the root directory run:
cd examples/compare-deepseek-r1-vs-openai-o1
or you can get the example with:
promptfoo init --example compare-deepseek-r1-vs-openai-o1
Run the evaluation:
promptfoo eval
View the results in a web interface:
promptfoo view
This comparison evaluates both models on reasoning tasks from the MMLU benchmark, specifically:
Each subject uses 10 questions to keep the test manageable. You can edit this in promptfooconfig.yaml.
The configuration in promptfooconfig.yaml:
You can modify the test by editing promptfooconfig.yaml:
Add more MMLU subjects:
tests:
- huggingface://datasets/cais/mmlu?split=test&subset=physics
Try different prompting strategies:
prompts:
# Zero-shot with step-by-step reasoning (default)
- |
You are an expert test taker. Please solve the following multiple choice question step by step.
Question: {{question}}
Options:
A) {{choices[0]}}
B) {{choices[1]}}
C) {{choices[2]}}
D) {{choices[3]}}
Think through this step by step, then provide your final answer in the format "Therefore, the answer is A/B/C/D."
# Zero-shot with direct answer
- |
Question: {{question}}
A) {{choices[0]}}
B) {{choices[1]}}
C) {{choices[2]}}
D) {{choices[3]}}
Answer with just the letter (A/B/C/D) of the correct option.
Change the number of questions:
tests:
- huggingface://datasets/cais/mmlu?split=test&subset=physics&limit=20 # Test 20 questions per subject
Adjust quality requirements:
defaultTest:
assert:
- type: latency
threshold: 30000 # Stricter 30-second timeout