site/docs/guides/gpt-mmlu-comparison.md
This guide compares full, mini, and nano OpenAI GPT model tiers on MMLU-Pro reasoning tasks using promptfoo.
MMLU-Pro is a more challenging successor to MMLU with harder reasoning questions and up to 10 answer options per item.
This guide shows you how to run MMLU-Pro benchmarks using promptfoo.
MMLU-Pro covers a broad set of academic and professional subjects, and it is more useful than classic MMLU when current models are already near saturation on easier multiple-choice benchmarks.
Running your own MMLU-Pro eval lets you compare reasoning quality, latency, and cost on a benchmark where full-size, mini, and nano models are less likely to tie at a perfect score.
:::tip Quick Start
npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
:::
OPENAI_API_KEY)HF_TOKEN)Initialize and configure:
npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
cd compare-gpt-model-tiers-mmlu-pro
export HF_TOKEN=your_token_here
Create a minimal configuration:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: GPT model tiers MMLU-Pro comparison
prompts:
- |
Question: {{question}}
{% for option in options -%}
{{ "ABCDEFGHIJ"[loop.index0] }}) {{ option }}
{% endfor %}
End with: Therefore, the answer is <LETTER>.
providers:
- openai:chat:gpt-5.4
- openai:chat:gpt-5.4-mini
- openai:chat:gpt-5.4-nano
defaultTest:
assert:
- type: regex
value: 'Therefore, the answer is [A-J]'
- type: javascript
value: |
const match = String(output).match(/Therefore,\s*the\s*answer\s*is\s*([A-J])/i);
return match?.[1]?.toUpperCase() === String(context.vars.answer).trim().toUpperCase();
tests:
- huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=20
npx promptfoo@latest eval
npx promptfoo@latest view
You should see the full-size GPT tier outperforming the smaller tiers on at least some MMLU-Pro categories, though the exact gaps depend on the sample.
The results show side-by-side benchmark pass rates, letting you compare reasoning capabilities directly.
Add a short reasoning instruction and fixed final-answer format:
prompts:
- |
You are an expert test taker. Solve this step by step.
Question: {{question}}
Options:
{% for option in options -%}
{{ "ABCDEFGHIJ"[loop.index0] }}) {{ option }}
{% endfor %}
Think through this step by step, then provide your final answer as "Therefore, the answer is A."
providers:
- id: openai:chat:gpt-5.4
config:
max_completion_tokens: 1200
- id: openai:chat:gpt-5.4-mini
config:
max_completion_tokens: 1200
- id: openai:chat:gpt-5.4-nano
config:
max_completion_tokens: 1200
defaultTest:
assert:
- type: latency
threshold: 60000
- type: regex
value: 'Therefore, the answer is [A-J]'
- type: javascript
value: |
const match = String(output).match(/Therefore,\s*the\s*answer\s*is\s*([A-J])/i);
return match?.[1]?.toUpperCase() === String(context.vars.answer).trim().toUpperCase();
tests:
- huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=100
Increase the sample size for a broader benchmark pass:
tests:
- huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=200
When evaluating full, mini, and nano GPT model tiers on MMLU-Pro, look for differences in:
Ready to go deeper? Try these advanced techniques: