huggingface/hle (Humanity's Last Exam)

Evaluate LLMs against Humanity's Last Exam (HLE), a challenging benchmark created by 1,000+ experts across 500+ institutions. HLE features 3,000+ questions spanning 100+ subjects, designed to push AI capabilities to their limits.

📖 Read the complete HLE benchmark guide →

You can run this example with:

bash

npx promptfoo@latest init --example huggingface/hle
cd huggingface/hle

Prerequisites

OpenAI API key set as OPENAI_API_KEY
Anthropic API key set as ANTHROPIC_API_KEY
Hugging Face access token (required for dataset access)

Setup

Set your Hugging Face token:

bash

export HF_TOKEN=your_token_here

Or add it to your .env file:

env

HF_TOKEN=your_token_here

Get your token at huggingface.co/settings/tokens.

Run the Evaluation

Run the evaluation:

bash

npx promptfoo@latest eval

View results:

bash

npx promptfoo@latest view

What's Tested

This evaluation tests models on:

Advanced mathematics and sciences
Humanities and social sciences
Professional domain knowledge
Multimodal reasoning
Interdisciplinary topics

Each question is evaluated for accuracy using an LLM judge that compares the model's response against the verified correct answer.

Current AI Performance

HLE is designed to be extremely challenging. Recent model performance:

OpenAI Deep Research: 26.6% accuracy
o4-mini: 18.1% accuracy
DeepSeek-R1: 9.4% accuracy

Low scores are expected - this benchmark represents the cutting edge of AI evaluation.

Customization

Test More Questions

Increase the sample size:

yaml

tests:
  - huggingface://datasets/cais/hle?split=test&limit=100

Add More Models

Compare multiple providers:

yaml

providers:
  - anthropic:claude-sonnet-4-6
  - openai:o4-mini
  - deepseek:deepseek-reasoner

Different Prompting

Try alternative prompting strategies by modifying prompt.py or using static prompts:

yaml

prompts:
  - 'Answer this question step by step: {{question}}'
  - file://prompt.py:create_hle_prompt

huggingface/hle (Humanity's Last Exam)

huggingface/hle (Humanity's Last Exam)

Prerequisites

Setup

Run the Evaluation

What's Tested

Current AI Performance

Customization

Test More Questions

Add More Models

Different Prompting

Resources