plugins/promptfoo-evals/skills/promptfoo-evals/references/cheatsheet.md
Field order: description, env, prompts, providers, defaultTest, scenarios, tests.
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: 'Summarization quality' # 3-10 words
prompts:
- file://prompts/main.txt # plain text with {{variables}}
- file://prompts/chat.json # chat format [{role, content}]
providers:
- openai:chat:gpt-4.1-mini # model ID shorthand
- id: anthropic:messages:claude-sonnet-4-6
label: claude # display name in results
config:
temperature: 0
defaultTest: # shared across all tests
assert:
- type: cost
threshold: 0.01
- type: latency
threshold: 5000
tests:
- file://tests/*.yaml # glob loads all test files
Use Nunjucks syntax with quotes — shell syntax ($VAR) does not work:
# ✅ Correct
apiKey: '{{env.OPENAI_API_KEY}}'
baseUrl: '{{env.API_BASE_URL}}'
# ❌ Wrong
apiKey: $OPENAI_API_KEY
| Type | What it checks |
|---|---|
equals | Exact match |
contains / icontains | Substring (case-sensitive / insensitive) |
contains-all / contains-any | All or any substrings |
icontains-all / icontains-any | Case-insensitive variants |
starts-with | Prefix match |
regex | Regex pattern |
is-json | Valid JSON (optional JSON Schema in value) |
contains-json | Output contains valid JSON |
javascript | value: "output.length < 100" (return bool or {pass, score, reason}) |
python | Same as javascript but Python |
cost | threshold: 0.01 (max cost in dollars) |
latency | threshold: 5000 (max ms) |
word-count | Validate word count |
All deterministic types support not- prefix: not-contains, not-regex, etc.
| Type | What it checks |
|---|---|
similar | Cosine similarity (set threshold: 0.8) |
levenshtein | Edit distance |
rouge-n | ROUGE score (summarization) |
bleu | BLEU score (translation) |
| Type | When to use |
|---|---|
llm-rubric | Custom criteria: value: "Is helpful, accurate, and concise" |
factuality | Check factual accuracy against a reference |
answer-relevance | Is the answer relevant to the query |
context-faithfulness | Is the response grounded in provided context |
context-recall | Does the context contain needed info |
model-graded-closedqa | Closed-domain QA scoring |
| Type | What it checks |
|---|---|
is-valid-openai-tools-call | Valid OpenAI tools call format |
is-valid-openai-function-call | Valid function call format |
| Type | What it checks |
|---|---|
moderation | OpenAI moderation API |
is-refusal | Model refused to answer |
classifier | Custom text classification |
| Type | What it does |
|---|---|
select-best | Compare all outputs, pick best |
human | Manual grading via web UI |
webhook | External validation endpoint |
Assertions support these optional fields:
assert:
- type: icontains
value: 'expected text'
weight: 2 # relative importance for scoring (default: 1)
threshold: 0.8 # assertion-specific (e.g. min score for graded; max for cost/latency)
metric: 'relevance' # custom metric name for reporting
Pin the grader model/provider explicitly for stable scoring:
defaultTest:
options:
provider: openai:gpt-5-mini
tests:
- description: 'Quality check'
assert:
- type: llm-rubric
value: 'Accurate and concise'
# Optional per-assertion override:
# provider: anthropic:messages:claude-sonnet-4-6
# OpenAI
- openai:chat:gpt-4.1-2025-04-14
- openai:chat:gpt-4.1-mini
- openai:responses:gpt-4.1
# Anthropic
- anthropic:messages:claude-sonnet-4-6
- anthropic:messages:claude-haiku-4-5-20251001
# Google
- google:gemini-2.5-pro
- google:gemini-2.5-flash
- google:gemini-2.0-flash
# AWS Bedrock
- bedrock:anthropic.claude-sonnet-4-6
- bedrock:anthropic.claude-haiku-4-5-20251001-v1:0
# Other
- azure:chat:my-deployment
- groq:llama-3.3-70b-versatile
- ollama:chat:llama3.3
- mistral:mistral-large-latest
- togetherai:meta-llama/Llama-4-Scout-Instruct
providers:
- id: https
label: my-api
config:
url: 'https://api.example.com/generate'
method: POST
headers:
Content-Type: application/json
body:
prompt: '{{prompt}}'
transformResponse: 'json.output'
providers:
- file://provider.py
# provider.py
def call_api(prompt, options, context):
# Call your system under test
result = my_system(prompt)
return {"output": result}
providers:
- file://provider.js
// provider.js
module.exports = {
id: () => 'my-provider',
callApi: async (prompt) => {
const result = await mySystem(prompt);
return { output: result };
},
};
providers:
- echo # Returns the prompt as output
providers:
- id: openai:chat:gpt-4.1-mini
config:
temperature: 0
response_format:
type: json_object
- description: 'Returns correct answer'
vars:
question: 'What is 2+2?'
assert:
- type: contains
value: '4'
- description: 'Returns valid structured output'
vars:
input: 'Banana'
assert:
- type: is-json
value:
type: object
required: [color]
properties:
color: { type: string }
- type: javascript
value: "JSON.parse(output).color.toLowerCase() === 'yellow'"
- description: 'Summary is grounded in source'
vars:
article: 'MIT researchers developed an aluminum-sulfur battery...'
assert:
- type: llm-rubric
value: |
The summary only states facts from this source:
"{{article}}"
It does not add, infer, or fabricate any claims.
When models wrap JSON in markdown fences or add extra text, use
options.transform to clean the output before assertions run:
- description: 'Parses JSON from markdown output'
vars:
input: 'Give me a JSON object'
options:
transform: "output.replace(/```json\\n?|```/g, '').trim()"
assert:
- type: is-json
# CSV/JSONL/XLSX datasets
tests: file://tests.csv
# Script-generated tests
tests: file://generate_tests.py:create_tests
assertionTemplates:
noHallucination: &noHallucination
type: llm-rubric
value: 'Response only contains information supported by the context'
tests:
- description: 'Grounded response'
vars: { query: 'What is our refund policy?' }
assert:
- *noHallucination
defaultTest:
assert:
- type: not-icontains
value: 'As an AI'
- type: cost
threshold: 0.005
- type: latency
threshold: 3000
- description: 'Balanced quality check'
vars:
question: 'Explain quantum computing'
assert:
- type: llm-rubric
value: 'Technically accurate'
weight: 3
metric: accuracy
- type: llm-rubric
value: 'Easy to understand for a beginner'
weight: 1
metric: clarity
- type: javascript
value: "output.split(' ').length <= 200"
weight: 1
metric: conciseness
Always use --no-cache during development to avoid stale results.
npx promptfoo@latest validate config -c path/to/promptfooconfig.yaml
npx promptfoo@latest eval -c path/to/promptfooconfig.yaml -o output.json --no-cache --no-share
For CI/non-UI workflows, use -o output.json and check success, score, and
error fields.
Inside the promptfoo repo, use the local build:
source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c path/to/promptfooconfig.yaml
npm run local -- eval -c path/to/promptfooconfig.yaml -o output.json --no-cache --no-share
Add --env-file .env only when the eval needs local credentials and that file
exists.
Do not run npm run local -- view unless explicitly asked.