eval-bert-score (BERTScore Evaluation)

Use BERTScore to measure semantic similarity between LLM outputs and reference text.

bash

npx promptfoo@latest init --example eval-bert-score
cd eval-bert-score

Setup

bash

pip install -r requirements.txt

Note: First run will download the BERT model (~1.4GB).

Usage

Basic Example

yaml

# promptfooconfig.yaml
tests:
  - vars:
      text: 'Hello world'
      reference: 'Hi there'
    assert:
      - type: python
        value: file://bertscore_check.py
        threshold: 0.7 # Pass if similarity > 70%

Run: promptfoo eval

Advanced Example

Compare against multiple valid references:

yaml

# promptfooconfig-advanced.yaml
assert:
  - type: python
    value: |
      from bert_score import score
      references = [
          "First valid answer",
          "Second valid answer",
          "Third valid answer"
      ]
      scores = []
      for ref in references:
          _, _, F1 = score([output], [ref], lang='en', verbose=False)
          scores.append(F1.item())
      return max(scores)  # Use best match

Run: promptfoo eval -c promptfooconfig-advanced.yaml

How It Works

BERTScore returns a similarity score from 0 to 1:

0.9+ = Nearly identical meaning
0.7-0.9 = Similar meaning
<0.7 = Different meaning

Learn more