Max Score

The max-score assertion selects the output with the highest aggregate score from other assertions. Unlike select-best which uses LLM judgment, max-score provides objective, deterministic selection based on quantitative scores from other assertions.

When to use max-score

Use max-score when you want to:

Select the best output based on objective, measurable criteria
Combine multiple metrics with different importance (weights)
Have transparent, reproducible selection without LLM API calls
Select outputs based on a combination of correctness, quality, and other metrics

How it works

All regular assertions run first on each output
max-score collects the scores from these assertions
Calculates an aggregate score for each output (average by default)
Selects the output with the highest aggregate score
Returns pass=true for the highest scoring output, pass=false for others

Basic usage

yaml

prompts:
  - 'Write a function to {{task}}'
  - 'Write an efficient function to {{task}}'
  - 'Write a well-documented function to {{task}}'

providers:
  - openai:gpt-5

tests:
  - vars:
      task: 'calculate fibonacci numbers'
    assert:
      # Regular assertions that score each output
      - type: python
        value: 'assert fibonacci(10) == 55'
      - type: llm-rubric
        value: 'Code is efficient'
      - type: contains
        value: 'def fibonacci'
      # Max-score selects the output with highest average score
      - type: max-score

Configuration options

Aggregation method

Choose how scores are combined:

yaml

assert:
  - type: max-score
    value:
      method: average # Default: average | sum

Weighted scoring

Give different importance to different assertions by specifying weights per assertion type:

yaml

assert:
  - type: python # Test correctness
  - type: llm-rubric # Test quality
    value: 'Well documented'
  - type: max-score
    value:
      weights:
        python: 3 # Correctness is 3x more important
        llm-rubric: 1 # Documentation is 1x weight

How weights work

Each assertion type can have a custom weight (default: 1.0)
For method: average, the final score is: sum(score × weight) / sum(weights)
For method: sum, the final score is: sum(score × weight)
Weights apply to all assertions of that type

Example calculation with method: average:

Output A: python=1.0, llm-rubric=0.5, contains=1.0
Weights:  python=3,   llm-rubric=1,   contains=1 (default)

Score = (1.0×3 + 0.5×1 + 1.0×1) / (3 + 1 + 1)
      = (3.0 + 0.5 + 1.0) / 5
      = 0.9

Minimum threshold

Require a minimum score for selection:

yaml

assert:
  - type: max-score
    value:
      threshold: 0.7 # Only select if average score >= 0.7

Scoring details

Binary assertions (pass/fail): Score as 1.0 or 0.0
Scored assertions: Use the numeric score (typically 0-1 range)
Default weights: 1.0 for all assertions
Tie breaking: First output wins (deterministic)

Examples

Example 1: Multi-criteria code selection

yaml

prompts:
  - 'Write a Python function to {{task}}'
  - 'Write an optimized Python function to {{task}}'
  - 'Write a documented Python function to {{task}}'

providers:
  - openai:gpt-5-mini

tests:
  - vars:
      task: 'merge two sorted lists'
    assert:
      - type: python
        value: |
          list1 = [1, 3, 5]
          list2 = [2, 4, 6]
          result = merge_lists(list1, list2)
          assert result == [1, 2, 3, 4, 5, 6]

      - type: llm-rubric
        value: 'Code has O(n+m) time complexity'

      - type: llm-rubric
        value: 'Code is well documented with docstring'

      - type: max-score
        value:
          weights:
            python: 3 # Correctness most important
            llm-rubric: 1 # Each quality metric has weight 1

Example 2: Content generation selection

yaml

prompts:
  - 'Explain {{concept}} simply'
  - 'Explain {{concept}} in detail'
  - 'Explain {{concept}} with examples'

providers:
  - anthropic:claude-3-haiku-20240307

tests:
  - vars:
      concept: 'machine learning'
    assert:
      - type: llm-rubric
        value: 'Explanation is accurate'

      - type: llm-rubric
        value: 'Explanation is clear and easy to understand'

      - type: contains
        value: 'example'

      - type: max-score
        value:
          method: average # All criteria equally important

Example 3: API response selection

yaml

tests:
  - vars:
      query: 'weather in Paris'
    assert:
      - type: is-json

      - type: contains-json
        value:
          required: ['temperature', 'humidity', 'conditions']

      - type: llm-rubric
        value: 'Response includes all requested weather data'

      - type: latency
        threshold: 1000 # Under 1 second

      - type: max-score
        value:
          weights:
            is-json: 2 # Must be valid JSON
            contains-json: 2 # Must have required fields
            llm-rubric: 1 # Quality check
            latency: 1 # Performance matters

Comparison with select-best

Feature	max-score	select-best
Selection method	Aggregate scores from assertions	LLM judgment
API calls	None (uses existing scores)	One per eval
Reproducibility	Deterministic	May vary
Best for	Objective criteria	Subjective criteria
Transparency	Shows exact scores	Shows LLM reasoning
Cost	Free (no API calls)	Costs per API call

Edge cases

No other assertions: Error - max-score requires at least one assertion to aggregate
Tie scores: First output wins (by index)
All outputs fail: Still selects the highest scorer ("least bad")
Below threshold: No output selected if threshold is specified and not met

Tips

Use specific assertions: More assertions provide better signal for selection
Weight important criteria: Use weights to emphasize what matters most
Combine with select-best: You can use both in the same test for comparison
Debug with scores: The output shows aggregate scores for transparency

Max-score assertion

Max Score

When to use max-score

How it works

Basic usage

Configuration options

Aggregation method

Weighted scoring

How weights work

Minimum threshold

Scoring details

Examples

Example 1: Multi-criteria code selection

Example 2: Content generation selection

Example 3: API response selection

Comparison with select-best

Edge cases

Tips

Further reading