Back to Promptfoo

compare-gpt-model-tiers-mmlu-pro (GPT Model Tiers MMLU-Pro Comparison)

examples/compare-gpt-model-tiers-mmlu-pro/README.md

0.121.102.8 KB
Original Source

compare-gpt-model-tiers-mmlu-pro (GPT Model Tiers MMLU-Pro Comparison)

You can run this example with:

bash
npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
cd compare-gpt-model-tiers-mmlu-pro

This example demonstrates how to benchmark full, mini, and nano OpenAI GPT model tiers using MMLU-Pro, a more challenging successor to MMLU with up to 10 answer options per question.

Prerequisites

  • promptfoo CLI installed (npm install -g promptfoo or brew install promptfoo)
  • OpenAI API key set as OPENAI_API_KEY
  • Hugging Face account and access token (optional for public MMLU-Pro data, useful for higher rate limits)

Hugging Face Authentication

For higher rate limits or private datasets, authenticate with Hugging Face:

  1. Create a Hugging Face account at huggingface.co if you don't have one

  2. Generate an access token at huggingface.co/settings/tokens

  3. Set your token as an environment variable:

    bash
    export HF_TOKEN=your_token_here
    

    Or add it to your .env file:

    env
    HF_TOKEN=your_token_here
    

Running the Eval

  1. Get a local copy of the promptfooconfig:

    bash
    npx promptfoo@latest init --example compare-gpt-model-tiers-mmlu-pro
    cd compare-gpt-model-tiers-mmlu-pro
    
  2. Run the evaluation:

    bash
    npx promptfoo@latest eval
    
  3. View the results:

    bash
    npx promptfoo@latest view
    

What's Being Tested

This comparison evaluates all three models on 100 MMLU-Pro questions spanning many subject categories.

Test Structure

The configuration in promptfooconfig.yaml includes:

  1. Prompt Template: Renders all available MMLU-Pro options dynamically and asks for a final answer in a fixed format
  2. Quality Checks:
    • 60-second timeout per question
    • Required final answer format (Therefore, the answer is X)
    • Deterministic JavaScript scoring that compares the parsed final letter against answer
  3. Model Configuration:
    • 1200 max completion tokens for concise reasoning plus the final answer

Customizing

You can modify the test by editing promptfooconfig.yaml:

  1. Evaluate more MMLU-Pro questions:

    yaml
    tests:
      - huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=250
    
  2. Change the number of questions:

    yaml
    tests:
      - huggingface://datasets/TIGER-Lab/MMLU-Pro?split=test&config=default&limit=200
    
  3. Adjust model parameters:

    yaml
    providers:
      - id: openai:chat:gpt-5.4
        config:
          max_completion_tokens: 1500
    

Additional Resources