examples/eval-bert-score/README.md
Use BERTScore to measure semantic similarity between LLM outputs and reference text.
npx promptfoo@latest init --example eval-bert-score
cd eval-bert-score
pip install -r requirements.txt
Note: First run will download the BERT model (~1.4GB).
# promptfooconfig.yaml
tests:
- vars:
text: 'Hello world'
reference: 'Hi there'
assert:
- type: python
value: file://bertscore_check.py
threshold: 0.7 # Pass if similarity > 70%
Run: promptfoo eval
Compare against multiple valid references:
# promptfooconfig-advanced.yaml
assert:
- type: python
value: |
from bert_score import score
references = [
"First valid answer",
"Second valid answer",
"Third valid answer"
]
scores = []
for ref in references:
_, _, F1 = score([output], [ref], lang='en', verbose=False)
scores.append(F1.item())
return max(scores) # Use best match
Run: promptfoo eval -c promptfooconfig-advanced.yaml
BERTScore returns a similarity score from 0 to 1: