site/docs/guides/factuality-eval.md
Factuality is the measure of how accurately an LLM's response aligns with established facts or reference information. Simply put, it answers the question: "Is what the AI saying actually true?"
A concrete example:
Question: "What is the capital of France?"
AI response: "The capital of France is Paris, which has been the country's capital since 987 CE."
Reference fact: "Paris is the capital of France."In this case, the AI response is factually accurate (it includes the correct capital) but adds additional information about when Paris became the capital.
As LLMs become increasingly integrated into critical applications, ensuring they provide factually accurate information is essential for:
Building trust: Users need confidence that AI responses are reliable and truthful. For example, a financial advisor chatbot that gives incorrect information about tax laws could cause users to make costly mistakes and lose trust in your service.
Reducing misinformation: Factually incorrect AI outputs can spread misinformation at scale. For instance, a healthcare bot incorrectly stating that a common vaccine is dangerous could influence thousands of patients to avoid important preventative care.
Supporting critical use cases: Applications in healthcare, finance, education, and legal domains require high factual accuracy. A legal assistant that misrepresents case law precedents could lead to flawed legal strategies with serious consequences.
Improving model selection: Comparing factuality across models helps choose the right model for your application. A company might discover that while one model is more creative, another has 30% better factual accuracy for technical documentation.
Identifying hallucinations: Factuality evaluation helps detect when models "make up" information. For example, discovering that your product support chatbot fabricates non-existent troubleshooting steps 15% of the time would be a critical finding.
promptfoo's factuality evaluation enables you to systematically measure how well your model outputs align with reference facts, helping you identify and address issues before they reach users.
The fastest way to get started with factuality evaluation is to use our pre-built TruthfulQA example:
# Initialize the example - this command creates a new directory with all necessary files
npx promptfoo@latest init --example huggingface/dataset-factuality
# Change into the newly created directory
cd huggingface/dataset-factuality
# Run the evaluation - this executes the factuality tests using the models specified in the config
npx promptfoo eval
# View the results in an interactive web interface
npx promptfoo view
What these commands do:
This example:
You can easily customize it by:
promptfooconfig.yaml to test more modelspromptfoo implements a structured factuality evaluation methodology based on OpenAI's evals, using the factuality assertion type.
The model-graded factuality check takes the following three inputs:
The evaluation classifies the relationship between the LLM output and the reference into one of five categories:
A: Output is a subset of the reference and is fully consistent with it
B: Output is a superset of the reference and is fully consistent with it
C: Output contains all the same details as the reference
D: Output and reference disagree
E: Output and reference differ, but differences don't affect factuality
By default, categories A, B, C, and E are considered passing (with customizable scores), while category D (disagreement) is considered failing.
To set up a simple factuality evaluation for your LLM outputs:
providers:
- openai:gpt-5-mini
prompts:
- |
Please answer the following question accurately:
Question: What is the capital of {{location}}?
tests:
- vars:
location: California
assert:
- type: factuality
value: The capital of California is Sacramento
npx promptfoo eval
npx promptfoo view
This will produce a report showing how factually accurate your model's responses are compared to the reference answers.
Factuality evaluation is especially useful for comparing how different models perform on the same facts:
providers:
- openai:gpt-5-mini
- openai:gpt-5
- anthropic:claude-sonnet-4-6
- google:gemini-2.0-flash
prompts:
- |
Question: What is the capital of {{location}}?
Please answer accurately.
tests:
- vars:
location: California
assert:
- type: factuality
value: The capital of California is Sacramento
- vars:
location: New York
assert:
- type: factuality
value: Albany is the capital of New York
For comprehensive evaluation, you can run factuality tests against external datasets like TruthfulQA, which we covered in the Quick Start section.
You can integrate any dataset by:
tests: file://your_dataset_loader.ts:generate_tests
The quality of your reference answers is crucial for accurate factuality evaluation. Here are specific guidelines:
Clarity: State the fact directly and unambiguously
Precision: Include necessary details without extraneous information
Verifiability: Ensure your reference is backed by authoritative sources
Completeness: Include all essential parts of the answer
By default, promptfoo uses gpt-5 for grading. To specify a different grading model:
defaultTest:
options:
# Set the provider for grading factuality
provider: openai:gpt-5
You can also override it per assertion:
assert:
- type: factuality
value: The capital of California is Sacramento
provider: anthropic:claude-sonnet-4-6
Or via the command line:
promptfoo eval --grader openai:gpt-5
Tailor the factuality scoring to your specific requirements:
defaultTest:
options:
factuality:
subset: 1.0 # Category A: Output is a subset of reference
superset: 0.8 # Category B: Output is a superset of reference
agree: 1.0 # Category C: Output contains all the same details
disagree: 0.0 # Category D: Output and reference disagree
differButFactual: 0.7 # Category E: Differences don't affect factuality
By default, promptfoo uses a simple binary scoring system:
When to use custom weights:
superset if you're concerned about models adding potentially incorrect informationdifferButFactual if precision in wording is important for your applicationsubset downward if comprehensive answers are requiredA score of 0 means fail, while any positive score is considered passing. The score values can be used for ranking and comparing model outputs.
For complete control over how factuality is evaluated, customize the prompt:
defaultTest:
options:
rubricPrompt: |
You are an expert factuality evaluator. Compare these two answers:
Question: {{input}}
Reference answer: {{ideal}}
Submitted answer: {{completion}}
Determine if the submitted answer is factually consistent with the reference answer.
Choose one option:
A: Submitted answer is a subset of reference (fully consistent)
B: Submitted answer is a superset of reference (fully consistent)
C: Submitted answer contains same details as reference
D: Submitted answer disagrees with reference
E: Answers differ but differences don't affect factuality
Respond with JSON: {"category": "LETTER", "reason": "explanation"}
You must implement the following template variables:
{{input}}: The original prompt/question{{ideal}}: The reference answer (from the value field){{completion}}: The LLM's actual response (provided automatically by promptfoo)The factuality checker supports two response formats:
JSON format (primary and recommended):
{
"category": "A",
"reason": "The submitted answer is a subset of the expert answer and is fully consistent with it."
}
Single Letter (legacy format):
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
When setting up factuality evaluations: