Back to Promptfoo

Promptfoo Eval Cheatsheet

plugins/promptfoo-evals/skills/promptfoo-evals/references/cheatsheet.md

0.121.1210.2 KB
Original Source

Promptfoo Eval Cheatsheet

Config structure

Field order: description, env, prompts, providers, defaultTest, scenarios, tests.

yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: 'Summarization quality' # 3-10 words

prompts:
  - file://prompts/main.txt # plain text with {{variables}}
  - file://prompts/chat.json # chat format [{role, content}]

providers:
  - openai:chat:gpt-4.1-mini # model ID shorthand
  - id: anthropic:messages:claude-sonnet-4-6
    label: claude # display name in results
    config:
      temperature: 0

defaultTest: # shared across all tests
  assert:
    - type: cost
      threshold: 0.01
    - type: latency
      threshold: 5000

tests:
  - file://tests/*.yaml # glob loads all test files

Environment variables in configs

Use Nunjucks syntax with quotes — shell syntax ($VAR) does not work:

yaml
# ✅ Correct
apiKey: '{{env.OPENAI_API_KEY}}'
baseUrl: '{{env.API_BASE_URL}}'

# ❌ Wrong
apiKey: $OPENAI_API_KEY

Assertion types

Deterministic (use first)

TypeWhat it checks
equalsExact match
contains / icontainsSubstring (case-sensitive / insensitive)
contains-all / contains-anyAll or any substrings
icontains-all / icontains-anyCase-insensitive variants
starts-withPrefix match
regexRegex pattern
is-jsonValid JSON (optional JSON Schema in value)
contains-jsonOutput contains valid JSON
javascriptvalue: "output.length < 100" (return bool or {pass, score, reason})
pythonSame as javascript but Python
costthreshold: 0.01 (max cost in dollars)
latencythreshold: 5000 (max ms)
word-countValidate word count

All deterministic types support not- prefix: not-contains, not-regex, etc.

Similarity

TypeWhat it checks
similarCosine similarity (set threshold: 0.8)
levenshteinEdit distance
rouge-nROUGE score (summarization)
bleuBLEU score (translation)

Model-graded (use sparingly — costs money, non-deterministic)

TypeWhen to use
llm-rubricCustom criteria: value: "Is helpful, accurate, and concise"
factualityCheck factual accuracy against a reference
answer-relevanceIs the answer relevant to the query
context-faithfulnessIs the response grounded in provided context
context-recallDoes the context contain needed info
model-graded-closedqaClosed-domain QA scoring

Tool/function call

TypeWhat it checks
is-valid-openai-tools-callValid OpenAI tools call format
is-valid-openai-function-callValid function call format

Classification

TypeWhat it checks
moderationOpenAI moderation API
is-refusalModel refused to answer
classifierCustom text classification

Special

TypeWhat it does
select-bestCompare all outputs, pick best
humanManual grading via web UI
webhookExternal validation endpoint

Assertion options

Assertions support these optional fields:

yaml
assert:
  - type: icontains
    value: 'expected text'
    weight: 2 # relative importance for scoring (default: 1)
    threshold: 0.8 # assertion-specific (e.g. min score for graded; max for cost/latency)
    metric: 'relevance' # custom metric name for reporting

Model-graded provider selection

Pin the grader model/provider explicitly for stable scoring:

yaml
defaultTest:
  options:
    provider: openai:gpt-5-mini

tests:
  - description: 'Quality check'
    assert:
      - type: llm-rubric
        value: 'Accurate and concise'
        # Optional per-assertion override:
        # provider: anthropic:messages:claude-sonnet-4-6

Provider patterns

LLM providers

yaml
# OpenAI
- openai:chat:gpt-4.1-2025-04-14
- openai:chat:gpt-4.1-mini
- openai:responses:gpt-4.1

# Anthropic
- anthropic:messages:claude-sonnet-4-6
- anthropic:messages:claude-haiku-4-5-20251001

# Google
- google:gemini-2.5-pro
- google:gemini-2.5-flash
- google:gemini-2.0-flash

# AWS Bedrock
- bedrock:anthropic.claude-sonnet-4-6
- bedrock:anthropic.claude-haiku-4-5-20251001-v1:0

# Other
- azure:chat:my-deployment
- groq:llama-3.3-70b-versatile
- ollama:chat:llama3.3
- mistral:mistral-large-latest
- togetherai:meta-llama/Llama-4-Scout-Instruct

HTTP endpoint

yaml
providers:
  - id: https
    label: my-api
    config:
      url: 'https://api.example.com/generate'
      method: POST
      headers:
        Content-Type: application/json
      body:
        prompt: '{{prompt}}'
      transformResponse: 'json.output'

Python provider

yaml
providers:
  - file://provider.py
python
# provider.py
def call_api(prompt, options, context):
    # Call your system under test
    result = my_system(prompt)
    return {"output": result}

JavaScript provider

yaml
providers:
  - file://provider.js
javascript
// provider.js
module.exports = {
  id: () => 'my-provider',
  callApi: async (prompt) => {
    const result = await mySystem(prompt);
    return { output: result };
  },
};

Echo (testing assertions without an LLM)

yaml
providers:
  - echo # Returns the prompt as output

JSON mode (force structured output)

yaml
providers:
  - id: openai:chat:gpt-4.1-mini
    config:
      temperature: 0
      response_format:
        type: json_object

Test patterns

Basic test

yaml
- description: 'Returns correct answer'
  vars:
    question: 'What is 2+2?'
  assert:
    - type: contains
      value: '4'

JSON output validation

yaml
- description: 'Returns valid structured output'
  vars:
    input: 'Banana'
  assert:
    - type: is-json
      value:
        type: object
        required: [color]
        properties:
          color: { type: string }
    - type: javascript
      value: "JSON.parse(output).color.toLowerCase() === 'yellow'"

Faithfulness check (include source in rubric)

yaml
- description: 'Summary is grounded in source'
  vars:
    article: 'MIT researchers developed an aluminum-sulfur battery...'
  assert:
    - type: llm-rubric
      value: |
        The summary only states facts from this source:
        "{{article}}"
        It does not add, infer, or fabricate any claims.

Transform output before assertions

When models wrap JSON in markdown fences or add extra text, use options.transform to clean the output before assertions run:

yaml
- description: 'Parses JSON from markdown output'
  vars:
    input: 'Give me a JSON object'
  options:
    transform: "output.replace(/```json\\n?|```/g, '').trim()"
  assert:
    - type: is-json

Dataset-driven tests (scale)

yaml
# CSV/JSONL/XLSX datasets
tests: file://tests.csv

# Script-generated tests
tests: file://generate_tests.py:create_tests

Reusable assertion templates

yaml
assertionTemplates:
  noHallucination: &noHallucination
    type: llm-rubric
    value: 'Response only contains information supported by the context'

tests:
  - description: 'Grounded response'
    vars: { query: 'What is our refund policy?' }
    assert:
      - *noHallucination

Default test with shared constraints

yaml
defaultTest:
  assert:
    - type: not-icontains
      value: 'As an AI'
    - type: cost
      threshold: 0.005
    - type: latency
      threshold: 3000

Weighted scoring

yaml
- description: 'Balanced quality check'
  vars:
    question: 'Explain quantum computing'
  assert:
    - type: llm-rubric
      value: 'Technically accurate'
      weight: 3
      metric: accuracy
    - type: llm-rubric
      value: 'Easy to understand for a beginner'
      weight: 1
      metric: clarity
    - type: javascript
      value: "output.split(' ').length <= 200"
      weight: 1
      metric: conciseness

CLI commands

Always use --no-cache during development to avoid stale results.

bash
npx promptfoo@latest validate config -c path/to/promptfooconfig.yaml
npx promptfoo@latest eval -c path/to/promptfooconfig.yaml -o output.json --no-cache --no-share

For CI/non-UI workflows, use -o output.json and check success, score, and error fields.

Inside the promptfoo repo, use the local build:

bash
source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c path/to/promptfooconfig.yaml
npm run local -- eval -c path/to/promptfooconfig.yaml -o output.json --no-cache --no-share

Add --env-file .env only when the eval needs local credentials and that file exists.

Do not run npm run local -- view unless explicitly asked.