Back to Promptfoo

Max-score assertion

site/docs/configuration/expected-outputs/model-graded/max-score.md

0.121.96.6 KB
Original Source

Max Score

The max-score assertion selects the output with the highest aggregate score from other assertions. Unlike select-best which uses LLM judgment, max-score provides objective, deterministic selection based on quantitative scores from other assertions.

When to use max-score

Use max-score when you want to:

  • Select the best output based on objective, measurable criteria
  • Combine multiple metrics with different importance (weights)
  • Have transparent, reproducible selection without LLM API calls
  • Select outputs based on a combination of correctness, quality, and other metrics

How it works

  1. All regular assertions run first on each output
  2. max-score collects the scores from these assertions
  3. Calculates an aggregate score for each output (average by default)
  4. Selects the output with the highest aggregate score
  5. Returns pass=true for the highest scoring output, pass=false for others

Basic usage

yaml
prompts:
  - 'Write a function to {{task}}'
  - 'Write an efficient function to {{task}}'
  - 'Write a well-documented function to {{task}}'

providers:
  - openai:gpt-5

tests:
  - vars:
      task: 'calculate fibonacci numbers'
    assert:
      # Regular assertions that score each output
      - type: python
        value: 'assert fibonacci(10) == 55'
      - type: llm-rubric
        value: 'Code is efficient'
      - type: contains
        value: 'def fibonacci'
      # Max-score selects the output with highest average score
      - type: max-score

Configuration options

Aggregation method

Choose how scores are combined:

yaml
assert:
  - type: max-score
    value:
      method: average # Default: average | sum

Weighted scoring

Give different importance to different assertions by specifying weights per assertion type:

yaml
assert:
  - type: python # Test correctness
  - type: llm-rubric # Test quality
    value: 'Well documented'
  - type: max-score
    value:
      weights:
        python: 3 # Correctness is 3x more important
        llm-rubric: 1 # Documentation is 1x weight

How weights work

  • Each assertion type can have a custom weight (default: 1.0)
  • For method: average, the final score is: sum(score × weight) / sum(weights)
  • For method: sum, the final score is: sum(score × weight)
  • Weights apply to all assertions of that type

Example calculation with method: average:

Output A: python=1.0, llm-rubric=0.5, contains=1.0
Weights:  python=3,   llm-rubric=1,   contains=1 (default)

Score = (1.0×3 + 0.5×1 + 1.0×1) / (3 + 1 + 1)
      = (3.0 + 0.5 + 1.0) / 5
      = 0.9

Minimum threshold

Require a minimum score for selection:

yaml
assert:
  - type: max-score
    value:
      threshold: 0.7 # Only select if average score >= 0.7

Scoring details

  • Binary assertions (pass/fail): Score as 1.0 or 0.0
  • Scored assertions: Use the numeric score (typically 0-1 range)
  • Default weights: 1.0 for all assertions
  • Tie breaking: First output wins (deterministic)

Examples

Example 1: Multi-criteria code selection

yaml
prompts:
  - 'Write a Python function to {{task}}'
  - 'Write an optimized Python function to {{task}}'
  - 'Write a documented Python function to {{task}}'

providers:
  - openai:gpt-5-mini

tests:
  - vars:
      task: 'merge two sorted lists'
    assert:
      - type: python
        value: |
          list1 = [1, 3, 5]
          list2 = [2, 4, 6]
          result = merge_lists(list1, list2)
          assert result == [1, 2, 3, 4, 5, 6]

      - type: llm-rubric
        value: 'Code has O(n+m) time complexity'

      - type: llm-rubric
        value: 'Code is well documented with docstring'

      - type: max-score
        value:
          weights:
            python: 3 # Correctness most important
            llm-rubric: 1 # Each quality metric has weight 1

Example 2: Content generation selection

yaml
prompts:
  - 'Explain {{concept}} simply'
  - 'Explain {{concept}} in detail'
  - 'Explain {{concept}} with examples'

providers:
  - anthropic:claude-3-haiku-20240307

tests:
  - vars:
      concept: 'machine learning'
    assert:
      - type: llm-rubric
        value: 'Explanation is accurate'

      - type: llm-rubric
        value: 'Explanation is clear and easy to understand'

      - type: contains
        value: 'example'

      - type: max-score
        value:
          method: average # All criteria equally important

Example 3: API response selection

yaml
tests:
  - vars:
      query: 'weather in Paris'
    assert:
      - type: is-json

      - type: contains-json
        value:
          required: ['temperature', 'humidity', 'conditions']

      - type: llm-rubric
        value: 'Response includes all requested weather data'

      - type: latency
        threshold: 1000 # Under 1 second

      - type: max-score
        value:
          weights:
            is-json: 2 # Must be valid JSON
            contains-json: 2 # Must have required fields
            llm-rubric: 1 # Quality check
            latency: 1 # Performance matters

Comparison with select-best

Featuremax-scoreselect-best
Selection methodAggregate scores from assertionsLLM judgment
API callsNone (uses existing scores)One per eval
ReproducibilityDeterministicMay vary
Best forObjective criteriaSubjective criteria
TransparencyShows exact scoresShows LLM reasoning
CostFree (no API calls)Costs per API call

Edge cases

  • No other assertions: Error - max-score requires at least one assertion to aggregate
  • Tie scores: First output wins (by index)
  • All outputs fail: Still selects the highest scorer ("least bad")
  • Below threshold: No output selected if threshold is specified and not met

Tips

  1. Use specific assertions: More assertions provide better signal for selection
  2. Weight important criteria: Use weights to emphasize what matters most
  3. Combine with select-best: You can use both in the same test for comparison
  4. Debug with scores: The output shows aggregate scores for transparency

Further reading