compare-gpt-model-tiers-mmlu-pro (GPT Model Tiers MMLU-Pro Comparison)

You can run this example with:

bash

npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
cd compare-gpt-model-tiers-mmlu-pro

This example demonstrates how to benchmark full, mini, and nano OpenAI GPT model tiers using MMLU-Pro, a more challenging successor to MMLU with up to 10 answer options per question.

Prerequisites

promptfoo CLI installed (npm install -g promptfoo or brew install promptfoo)
OpenAI API key set as OPENAI_API_KEY
Hugging Face account and access token (optional for public MMLU-Pro data, useful for higher rate limits)

Hugging Face Authentication

For higher rate limits or private datasets, authenticate with Hugging Face:

Create a Hugging Face account at huggingface.co if you don't have one
Generate an access token at huggingface.co/settings/tokens
Set your token as an environment variable:
bash
```
export HF_TOKEN=your_token_here
```
Or add it to your .env file:
env
```
HF_TOKEN=your_token_here
```

Running the Eval

Get a local copy of the promptfooconfig:

bash

npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
cd compare-gpt-model-tiers-mmlu-pro

Run the evaluation:
bash
```
npx promptfoo@latest eval
```
View the results:
bash
```
npx promptfoo@latest view
```

What's Being Tested

This comparison evaluates all three models on 100 MMLU-Pro questions spanning many subject categories.

Test Structure

The configuration in promptfooconfig.yaml includes:

Prompt Template: Renders all available MMLU-Pro options dynamically and asks for a final answer in a fixed format
Quality Checks:
- 60-second timeout per question
- Required final answer format (Therefore, the answer is X)
- Deterministic JavaScript scoring that compares the parsed final letter against answer
Model Configuration:
- 1200 max completion tokens for concise reasoning plus the final answer

Customizing

You can modify the test by editing promptfooconfig.yaml:

Evaluate more MMLU-Pro questions:

yaml

tests:
  - huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=250

Change the number of questions:

yaml

tests:
  - huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=200

Adjust model parameters:

yaml

providers:
  - id: openai:chat:gpt-5.4
    config:
      max_completion_tokens: 1500

compare-gpt-model-tiers-mmlu-pro (GPT Model Tiers MMLU-Pro Comparison)

compare-gpt-model-tiers-mmlu-pro (GPT Model Tiers MMLU-Pro Comparison)

Prerequisites

Hugging Face Authentication

Running the Eval

What's Being Tested

Test Structure

Customizing

Additional Resources