skills/mlops/evaluation/lm-evaluation-harness/references/benchmark-guide.md
Complete guide to all 60+ evaluation tasks in lm-evaluation-harness, what they measure, and how to interpret results.
The lm-evaluation-harness includes 60+ benchmarks spanning:
List all tasks:
lm_eval --tasks list
What it measures: Broad knowledge across 57 subjects (STEM, humanities, social sciences, law).
Task variants:
mmlu: Original 57-subject benchmarkmmlu_pro: More challenging version with reasoning-focused questionsmmlu_prox: Multilingual extensionFormat: Multiple choice (4 options)
Example:
Question: What is the capital of France?
A. Berlin
B. Paris
C. London
D. Madrid
Answer: B
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--num_fewshot 5
Interpretation:
Good for: Assessing general knowledge and domain expertise.
What it measures: Mathematical reasoning on grade-school level word problems.
Task variants:
gsm8k: Base taskgsm8k_cot: With chain-of-thought promptinggsm_plus: Adversarial variant with perturbationsFormat: Free-form generation, extract numerical answer
Example:
Question: A baker made 200 cookies. He sold 3/5 of them in the morning and 1/4 of the remaining in the afternoon. How many cookies does he have left?
Answer: 60
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks gsm8k \
--num_fewshot 5
Interpretation:
Good for: Testing multi-step reasoning and arithmetic.
What it measures: Python code generation from docstrings (functional correctness).
Task variants:
humaneval: Standard benchmarkhumaneval_instruct: For instruction-tuned modelsFormat: Code generation, execution-based evaluation
Example:
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each other than
given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
Command:
lm_eval --model hf \
--model_args pretrained=codellama/CodeLlama-7b-hf \
--tasks humaneval \
--batch_size 1
Interpretation:
Good for: Evaluating code generation capabilities.
What it measures: 23 challenging reasoning tasks where models previously failed to beat humans.
Categories:
Format: Multiple choice and free-form
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks bbh \
--num_fewshot 3
Interpretation:
Good for: Testing advanced reasoning capabilities.
What it measures: Ability to follow specific, verifiable instructions.
Instruction types:
Format: Free-form generation with rule-based verification
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
--tasks ifeval \
--batch_size auto
Interpretation:
Good for: Evaluating chat/instruct models.
What it measures: Natural language understanding across 9 tasks.
Tasks:
cola: Grammatical acceptabilitysst2: Sentiment analysismrpc: Paraphrase detectionqqp: Question pairsstsb: Semantic similaritymnli: Natural language inferenceqnli: Question answering NLIrte: Recognizing textual entailmentwnli: Winograd schemasCommand:
lm_eval --model hf \
--model_args pretrained=bert-base-uncased \
--tasks glue \
--num_fewshot 0
Interpretation:
Good for: Encoder-only models, fine-tuning baselines.
What it measures: Long-context understanding (4K-32K tokens).
21 tasks covering:
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks longbench \
--batch_size 1
Interpretation:
Good for: Evaluating long-context models.
What it measures: Model's propensity to be truthful vs. generate plausible-sounding falsehoods.
Format: Multiple choice with 4-5 options
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks truthfulqa_mc2 \
--batch_size auto
Interpretation:
What it measures: Grade-school science questions.
Variants:
arc_easy: Easier questionsarc_challenge: Harder questions requiring reasoningCommand:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks arc_challenge \
--num_fewshot 25
Interpretation:
What it measures: Commonsense reasoning about everyday situations.
Format: Choose most plausible continuation
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks hellaswag \
--num_fewshot 10
Interpretation:
What it measures: Commonsense reasoning via pronoun resolution.
Example:
The trophy doesn't fit in the brown suitcase because _ is too large.
A. the trophy
B. the suitcase
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks winogrande \
--num_fewshot 5
What it measures: Physical commonsense reasoning.
Example: "To clean a keyboard, use compressed air or..."
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks piqa
What it measures: Performance across 64 African languages.
15 tasks: NLU, text generation, knowledge, QA, math reasoning
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks afrobench
What it measures: Norwegian language understanding (9 task categories).
Command:
lm_eval --model hf \
--model_args pretrained=NbAiLab/nb-gpt-j-6B \
--tasks noreval
What it measures: High-school competition math problems.
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks math \
--num_fewshot 4
Interpretation:
What it measures: Python programming from natural language descriptions.
Command:
lm_eval --model hf \
--model_args pretrained=codellama/CodeLlama-7b-hf \
--tasks mbpp \
--batch_size 1
What it measures: Reading comprehension requiring discrete reasoning.
Command:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks drop
Run this suite:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag,arc_challenge,truthfulqa_mc2 \
--num_fewshot 5
lm_eval --model hf \
--model_args pretrained=codellama/CodeLlama-7b-hf \
--tasks humaneval,mbpp \
--batch_size 1
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-chat-hf \
--tasks ifeval,mmlu,gsm8k_cot \
--batch_size auto
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.1-8B \
--tasks longbench \
--batch_size 1
Accuracy: Percentage of correct answers (most common)
Exact Match (EM): Requires exact string match (strict)
F1 Score: Balances precision and recall
BLEU/ROUGE: Text generation similarity
Pass@k: Percentage passing when generating k samples
| Model Size | MMLU | GSM8K | HumanEval | HellaSwag |
|---|---|---|---|---|
| 7B | 40-50% | 10-20% | 5-15% | 70-80% |
| 13B | 45-55% | 20-35% | 15-25% | 75-82% |
| 70B | 60-70% | 50-65% | 35-50% | 82-87% |
| GPT-4 | 86% | 92% | 67% | 95% |
lm_eval --tasks listlm_eval/tasks/README.md