Benchmark Guide

Complete guide to all 60+ evaluation tasks in lm-evaluation-harness, what they measure, and how to interpret results.

Overview

The lm-evaluation-harness includes 60+ benchmarks spanning:

Language understanding (MMLU, GLUE)
Mathematical reasoning (GSM8K, MATH)
Code generation (HumanEval, MBPP)
Instruction following (IFEval, AlpacaEval)
Long-context understanding (LongBench)
Multilingual capabilities (AfroBench, NorEval)
Reasoning (BBH, ARC)
Truthfulness (TruthfulQA)

List all tasks:

bash

lm_eval --tasks list

Major Benchmarks

MMLU (Massive Multitask Language Understanding)

What it measures: Broad knowledge across 57 subjects (STEM, humanities, social sciences, law).

Task variants:

mmlu: Original 57-subject benchmark
mmlu_pro: More challenging version with reasoning-focused questions
mmlu_prox: Multilingual extension

Format: Multiple choice (4 options)

Example:

Question: What is the capital of France?
A. Berlin
B. Paris
C. London
D. Madrid
Answer: B

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --num_fewshot 5

Interpretation:

Random: 25% (chance)
GPT-3 (175B): 43.9%
GPT-4: 86.4%
Human expert: ~90%

Good for: Assessing general knowledge and domain expertise.

GSM8K (Grade School Math 8K)

What it measures: Mathematical reasoning on grade-school level word problems.

Task variants:

gsm8k: Base task
gsm8k_cot: With chain-of-thought prompting
gsm_plus: Adversarial variant with perturbations

Format: Free-form generation, extract numerical answer

Example:

Question: A baker made 200 cookies. He sold 3/5 of them in the morning and 1/4 of the remaining in the afternoon. How many cookies does he have left?
Answer: 60

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks gsm8k \
  --num_fewshot 5

Interpretation:

Random: ~0%
GPT-3 (175B): 17.0%
GPT-4: 92.0%
Llama 2 70B: 56.8%

Good for: Testing multi-step reasoning and arithmetic.

HumanEval

What it measures: Python code generation from docstrings (functional correctness).

Task variants:

humaneval: Standard benchmark
humaneval_instruct: For instruction-tuned models

Format: Code generation, execution-based evaluation

Example:

python

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

Command:

bash

lm_eval --model hf \
  --model_args pretrained=codellama/CodeLlama-7b-hf \
  --tasks humaneval \
  --batch_size 1

Interpretation:

Random: 0%
GPT-3 (175B): 0%
Codex: 28.8%
GPT-4: 67.0%
Code Llama 34B: 53.7%

Good for: Evaluating code generation capabilities.

BBH (BIG-Bench Hard)

What it measures: 23 challenging reasoning tasks where models previously failed to beat humans.

Categories:

Logical reasoning
Math word problems
Social understanding
Algorithmic reasoning

Format: Multiple choice and free-form

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks bbh \
  --num_fewshot 3

Interpretation:

Random: ~25%
GPT-3 (175B): 33.9%
PaLM 540B: 58.3%
GPT-4: 86.7%

Good for: Testing advanced reasoning capabilities.

IFEval (Instruction-Following Evaluation)

What it measures: Ability to follow specific, verifiable instructions.

Instruction types:

Format constraints (e.g., "answer in 3 sentences")
Length constraints (e.g., "use at least 100 words")
Content constraints (e.g., "include the word 'banana'")
Structural constraints (e.g., "use bullet points")

Format: Free-form generation with rule-based verification

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
  --tasks ifeval \
  --batch_size auto

Interpretation:

Measures: Instruction adherence (not quality)
GPT-4: 86% instruction following
Claude 2: 84%

Good for: Evaluating chat/instruct models.

GLUE (General Language Understanding Evaluation)

What it measures: Natural language understanding across 9 tasks.

Tasks:

cola: Grammatical acceptability
sst2: Sentiment analysis
mrpc: Paraphrase detection
qqp: Question pairs
stsb: Semantic similarity
mnli: Natural language inference
qnli: Question answering NLI
rte: Recognizing textual entailment
wnli: Winograd schemas

Command:

bash

lm_eval --model hf \
  --model_args pretrained=bert-base-uncased \
  --tasks glue \
  --num_fewshot 0

Interpretation:

BERT Base: 78.3 (GLUE score)
RoBERTa Large: 88.5
Human baseline: 87.1

Good for: Encoder-only models, fine-tuning baselines.

LongBench

What it measures: Long-context understanding (4K-32K tokens).

21 tasks covering:

Single-document QA
Multi-document QA
Summarization
Few-shot learning
Code completion
Synthetic tasks

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks longbench \
  --batch_size 1

Interpretation:

Tests context utilization
Many models struggle beyond 4K tokens
GPT-4 Turbo: 54.3%

Good for: Evaluating long-context models.

Additional Benchmarks

TruthfulQA

What it measures: Model's propensity to be truthful vs. generate plausible-sounding falsehoods.

Format: Multiple choice with 4-5 options

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks truthfulqa_mc2 \
  --batch_size auto

Interpretation:

Larger models often score worse (more convincing lies)
GPT-3: 58.8%
GPT-4: 59.0%
Human: ~94%

ARC (AI2 Reasoning Challenge)

What it measures: Grade-school science questions.

Variants:

arc_easy: Easier questions
arc_challenge: Harder questions requiring reasoning

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks arc_challenge \
  --num_fewshot 25

Interpretation:

ARC-Easy: Most models >80%
ARC-Challenge random: 25%
GPT-4: 96.3%

HellaSwag

What it measures: Commonsense reasoning about everyday situations.

Format: Choose most plausible continuation

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks hellaswag \
  --num_fewshot 10

Interpretation:

Random: 25%
GPT-3: 78.9%
Llama 2 70B: 85.3%

WinoGrande

What it measures: Commonsense reasoning via pronoun resolution.

Example:

The trophy doesn't fit in the brown suitcase because _ is too large.
A. the trophy
B. the suitcase

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks winogrande \
  --num_fewshot 5

PIQA

What it measures: Physical commonsense reasoning.

Example: "To clean a keyboard, use compressed air or..."

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks piqa

Multilingual Benchmarks

AfroBench

What it measures: Performance across 64 African languages.

15 tasks: NLU, text generation, knowledge, QA, math reasoning

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks afrobench

NorEval

What it measures: Norwegian language understanding (9 task categories).

Command:

bash

lm_eval --model hf \
  --model_args pretrained=NbAiLab/nb-gpt-j-6B \
  --tasks noreval

Domain-Specific Benchmarks

MATH

What it measures: High-school competition math problems.

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks math \
  --num_fewshot 4

Interpretation:

Very challenging
GPT-4: 42.5%
Minerva 540B: 33.6%

MBPP (Mostly Basic Python Problems)

What it measures: Python programming from natural language descriptions.

Command:

bash

lm_eval --model hf \
  --model_args pretrained=codellama/CodeLlama-7b-hf \
  --tasks mbpp \
  --batch_size 1

DROP

What it measures: Reading comprehension requiring discrete reasoning.

Command:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks drop

Benchmark Selection Guide

For General Purpose Models

Run this suite:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag,arc_challenge,truthfulqa_mc2 \
  --num_fewshot 5

For Code Models

bash

lm_eval --model hf \
  --model_args pretrained=codellama/CodeLlama-7b-hf \
  --tasks humaneval,mbpp \
  --batch_size 1

For Chat/Instruct Models

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
  --tasks ifeval,mmlu,gsm8k_cot \
  --batch_size auto

For Long Context Models

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.1-8B \
  --tasks longbench \
  --batch_size 1

Interpreting Results

Understanding Metrics

Accuracy: Percentage of correct answers (most common)

Exact Match (EM): Requires exact string match (strict)

F1 Score: Balances precision and recall

BLEU/ROUGE: Text generation similarity

Pass@k: Percentage passing when generating k samples

Typical Score Ranges

Model Size	MMLU	GSM8K	HumanEval	HellaSwag
7B	40-50%	10-20%	5-15%	70-80%
13B	45-55%	20-35%	15-25%	75-82%
70B	60-70%	50-65%	35-50%	82-87%
GPT-4	86%	92%	67%	95%

Red Flags

All tasks at random chance: Model not trained properly
Exact 0% on generation tasks: Likely format/parsing issue
Huge variance across runs: Check seed/sampling settings
Better than GPT-4 on everything: Likely contamination

Best Practices

Always report few-shot setting: 0-shot, 5-shot, etc.
Run multiple seeds: Report mean ± std
Check for data contamination: Search training data for benchmark examples
Compare to published baselines: Validate your setup
Report all hyperparameters: Model, batch size, max tokens, temperature

References

Task list: lm_eval --tasks list
Task README: lm_eval/tasks/README.md
Papers: See individual benchmark papers