site/docs/configuration/expected-outputs/model-graded/max-score.md
The max-score assertion selects the output with the highest aggregate score from other assertions. Unlike select-best which uses LLM judgment, max-score provides objective, deterministic selection based on quantitative scores from other assertions.
Use max-score when you want to:
max-score collects the scores from these assertionsprompts:
- 'Write a function to {{task}}'
- 'Write an efficient function to {{task}}'
- 'Write a well-documented function to {{task}}'
providers:
- openai:gpt-5
tests:
- vars:
task: 'calculate fibonacci numbers'
assert:
# Regular assertions that score each output
- type: python
value: 'assert fibonacci(10) == 55'
- type: llm-rubric
value: 'Code is efficient'
- type: contains
value: 'def fibonacci'
# Max-score selects the output with highest average score
- type: max-score
Choose how scores are combined:
assert:
- type: max-score
value:
method: average # Default: average | sum
Give different importance to different assertions by specifying weights per assertion type:
assert:
- type: python # Test correctness
- type: llm-rubric # Test quality
value: 'Well documented'
- type: max-score
value:
weights:
python: 3 # Correctness is 3x more important
llm-rubric: 1 # Documentation is 1x weight
method: average, the final score is: sum(score × weight) / sum(weights)method: sum, the final score is: sum(score × weight)Example calculation with method: average:
Output A: python=1.0, llm-rubric=0.5, contains=1.0
Weights: python=3, llm-rubric=1, contains=1 (default)
Score = (1.0×3 + 0.5×1 + 1.0×1) / (3 + 1 + 1)
= (3.0 + 0.5 + 1.0) / 5
= 0.9
Require a minimum score for selection:
assert:
- type: max-score
value:
threshold: 0.7 # Only select if average score >= 0.7
prompts:
- 'Write a Python function to {{task}}'
- 'Write an optimized Python function to {{task}}'
- 'Write a documented Python function to {{task}}'
providers:
- openai:gpt-5-mini
tests:
- vars:
task: 'merge two sorted lists'
assert:
- type: python
value: |
list1 = [1, 3, 5]
list2 = [2, 4, 6]
result = merge_lists(list1, list2)
assert result == [1, 2, 3, 4, 5, 6]
- type: llm-rubric
value: 'Code has O(n+m) time complexity'
- type: llm-rubric
value: 'Code is well documented with docstring'
- type: max-score
value:
weights:
python: 3 # Correctness most important
llm-rubric: 1 # Each quality metric has weight 1
prompts:
- 'Explain {{concept}} simply'
- 'Explain {{concept}} in detail'
- 'Explain {{concept}} with examples'
providers:
- anthropic:claude-3-haiku-20240307
tests:
- vars:
concept: 'machine learning'
assert:
- type: llm-rubric
value: 'Explanation is accurate'
- type: llm-rubric
value: 'Explanation is clear and easy to understand'
- type: contains
value: 'example'
- type: max-score
value:
method: average # All criteria equally important
tests:
- vars:
query: 'weather in Paris'
assert:
- type: is-json
- type: contains-json
value:
required: ['temperature', 'humidity', 'conditions']
- type: llm-rubric
value: 'Response includes all requested weather data'
- type: latency
threshold: 1000 # Under 1 second
- type: max-score
value:
weights:
is-json: 2 # Must be valid JSON
contains-json: 2 # Must have required fields
llm-rubric: 1 # Quality check
latency: 1 # Performance matters
| Feature | max-score | select-best |
|---|---|---|
| Selection method | Aggregate scores from assertions | LLM judgment |
| API calls | None (uses existing scores) | One per eval |
| Reproducibility | Deterministic | May vary |
| Best for | Objective criteria | Subjective criteria |
| Transparency | Shows exact scores | Shows LLM reasoning |
| Cost | Free (no API calls) | Costs per API call |