examples/huggingface/hle/README.md
Evaluate LLMs against Humanity's Last Exam (HLE), a challenging benchmark created by 1,000+ experts across 500+ institutions. HLE features 3,000+ questions spanning 100+ subjects, designed to push AI capabilities to their limits.
📖 Read the complete HLE benchmark guide →
You can run this example with:
npx promptfoo@latest init --example huggingface/hle
cd huggingface/hle
OPENAI_API_KEYANTHROPIC_API_KEYSet your Hugging Face token:
export HF_TOKEN=your_token_here
Or add it to your .env file:
HF_TOKEN=your_token_here
Get your token at huggingface.co/settings/tokens.
Run the evaluation:
npx promptfoo@latest eval
View results:
npx promptfoo@latest view
This evaluation tests models on:
Each question is evaluated for accuracy using an LLM judge that compares the model's response against the verified correct answer.
HLE is designed to be extremely challenging. Recent model performance:
Low scores are expected - this benchmark represents the cutting edge of AI evaluation.
Increase the sample size:
tests:
- huggingface://datasets/cais/hle?split=test&limit=100
Compare multiple providers:
providers:
- anthropic:claude-sonnet-4-6
- openai:o4-mini
- deepseek:deepseek-reasoner
Try alternative prompting strategies by modifying prompt.py or using static prompts:
prompts:
- 'Answer this question step by step: {{question}}'
- file://prompt.py:create_hle_prompt