site/docs/configuration/expected-outputs/model-graded/pi.md
pi is an alternative approach to model grading that uses a dedicated scoring model instead of the "LLM as a judge" technique. It can evaluate input and output pairs against criteria.
:::note
Important: Unlike llm-rubric which works with your existing providers, Pi requires a separate external API key from Pi Labs.
:::
Pi offers a different approach to evaluation with some distinct characteristics:
Each approach has different strengths, and you may want to experiment with both to determine which best suits your specific evaluation needs.
To use Pi, you must first:
WITHPI_API_KEY environment variableexport WITHPI_API_KEY=your_api_key_here
or set
env:
WITHPI_API_KEY: your_api_key_here
in your promptfoo config
To use the pi assertion type, add it to your test configuration:
assert:
- type: pi
# Specify the criteria for grading the LLM output
value: Is the response not apologetic and provides a clear, concise answer?
This assertion will use the Pi scorer to grade the output based on the specified criteria.
Under the hood, the pi assertion uses the withpi SDK to evaluate the output based on the criteria you provide.
Compared to LLM as a judge:
llm_input and llm_outputThe pi assertion type supports an optional threshold property that sets a minimum score requirement. When specified, the output must achieve a score greater than or equal to the threshold to pass.
assert:
- type: pi
value: Is not apologetic and provides a clear, concise answer
threshold: 0.8 # Requires a score of 0.8 or higher to pass
:::info
The default threshold is 0.5 if not specified.
:::
You can use the Pi Labs Copilot to interactively brainstorm representative metrics for your application. It helps you:
prompts:
- 'Explain {{concept}} in simple terms.'
providers:
- openai:gpt-5
tests:
- vars:
concept: quantum computing
assert:
- type: pi
value: Is the explanation easy to understand without technical jargon?
threshold: 0.7
- type: pi
value: Does the response correctly explain the fundamental principles?
threshold: 0.8