examples/compare-gpt-model-tiers-mmlu-pro/README.md
You can run this example with:
npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
cd compare-gpt-model-tiers-mmlu-pro
This example demonstrates how to benchmark full, mini, and nano OpenAI GPT model tiers using MMLU-Pro, a more challenging successor to MMLU with up to 10 answer options per question.
npm install -g promptfoo or brew install promptfoo)OPENAI_API_KEYFor higher rate limits or private datasets, authenticate with Hugging Face:
Create a Hugging Face account at huggingface.co if you don't have one
Generate an access token at huggingface.co/settings/tokens
Set your token as an environment variable:
export HF_TOKEN=your_token_here
Or add it to your .env file:
HF_TOKEN=your_token_here
Get a local copy of the promptfooconfig:
npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
cd compare-gpt-model-tiers-mmlu-pro
Run the evaluation:
npx promptfoo@latest eval
View the results:
npx promptfoo@latest view
This comparison evaluates all three models on 100 MMLU-Pro questions spanning many subject categories.
The configuration in promptfooconfig.yaml includes:
Therefore, the answer is X)answerYou can modify the test by editing promptfooconfig.yaml:
Evaluate more MMLU-Pro questions:
tests:
- huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=250
Change the number of questions:
tests:
- huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=200
Adjust model parameters:
providers:
- id: openai:chat:gpt-5.4
config:
max_completion_tokens: 1500