Back to Promptfoo

Output Formats - Results Export and Analysis

site/docs/configuration/outputs.md

0.121.98.5 KB
Original Source

Output Formats

Save and analyze your evaluation results in various formats.

Quick Start

bash
# Interactive web viewer (default)
promptfoo eval

# Save as HTML report
promptfoo eval --output results.html

# Export as JSON for further processing
promptfoo eval --output results.json

# Create CSV for spreadsheet analysis
promptfoo eval --output results.csv

# Generate XML for integration with other tools
promptfoo eval --output results.xml

Available Formats

HTML Report

Generate a visual, shareable report:

bash
promptfoo eval --output report.html

Features:

  • Interactive table with sorting and filtering
  • Side-by-side output comparison
  • Pass/fail statistics
  • Shareable standalone file

Use when: Presenting results to stakeholders or reviewing outputs visually.

JSON Output

Export complete evaluation data:

bash
promptfoo eval --output results.json

Structure:

json
{
  "version": 3,
  "timestamp": "2024-01-15T10:30:00Z",
  "results": {
    "prompts": [...],
    "providers": [...],
    "outputs": [...],
    "stats": {...}
  }
}

Use when: Integrating with other tools or performing custom analysis.

CSV Export

Create spreadsheet-compatible data:

bash
promptfoo eval --output results.csv

Columns include:

  • Test variables
  • Prompt used
  • Model outputs
  • Pass/fail status
  • Latency
  • Token usage

Use when: Analyzing results in Excel, Google Sheets, or data science tools.

YAML Format

Human-readable structured data:

bash
promptfoo eval --output results.yaml

Use when: Reviewing results in a text editor or version control.

JSONL Format

Each line contains one JSON result:

bash
promptfoo eval --output results.jsonl

Use when: Working with very large evaluations or when JSON export fails with memory errors.

jsonl
{"testIdx":0,"promptIdx":0,"success":true,"score":1.0,"response":{"output":"Response 1"},"gradingResult":{"pass":true,"score":1.0,"reason":"All assertions passed","componentResults":[{"pass":true,"score":1.0,"reason":"Expected output to contain \"hello\"","assertion":{"type":"contains","value":"hello"}}]}}
{"testIdx":1,"promptIdx":0,"success":false,"score":0.0,"response":{"output":"Response 2"},"gradingResult":null}

For assertion-level details, inspect each row's gradingResult?.componentResults array when present. The top-level success, score, and gradingResult fields describe the aggregate result for the row, while each componentResults[] entry contains the pass/fail, score, reason, and assertion metadata for one evaluated assertion. Both gradingResult and componentResults may be absent on error rows or rows without assertions.

To stream a JSONL file and read each row's component results:

ts
import fs from 'node:fs';
import readline from 'node:readline';

const rl = readline.createInterface({
  input: fs.createReadStream('results.jsonl', { encoding: 'utf8' }),
  crlfDelay: Infinity,
});

for await (const line of rl) {
  if (!line.trim()) {
    continue;
  }
  const row = JSON.parse(line);
  for (const component of row.gradingResult?.componentResults ?? []) {
    console.log({
      type: component.assertion?.type,
      pass: component.pass,
      score: component.score,
      reason: component.reason,
    });
  }
}

?. and ?? [] together cover the gradingResult: null case shown above and rows where a single top-level assertion produced no nested componentResults.

XML Format

Structured data for enterprise integrations:

bash
promptfoo eval --output results.xml

Structure:

xml
<promptfoo>
  <evalId>abc-123-def</evalId>
  <results>
    <version>3</version>
    <timestamp>2024-01-15T10:30:00Z</timestamp>
    <prompts>...</prompts>
    <providers>...</providers>
    <outputs>...</outputs>
    <stats>...</stats>
  </results>
  <config>...</config>
  <shareableUrl>...</shareableUrl>
</promptfoo>

Use when: Integrating with enterprise systems, XML-based workflows, or when XML is a requirement.

Configuration Options

Setting Output Path in Config

yaml
# Specify default output file
outputPath: evaluations/latest_results.html

prompts:
  - '...'
tests:
  - '...'

Multiple Output Formats

Generate multiple formats simultaneously:

bash
# Command line
promptfoo eval --output results.html --output results.json

# Or use shell commands
promptfoo eval --output results.json && \
promptfoo eval --output results.csv

Output Contents

Standard Fields

All formats include:

FieldDescription
timestampWhen the evaluation ran
promptsPrompts used in evaluation
providersLLM providers tested
testsTest cases with variables
outputsRaw LLM responses
resultsPass/fail for each assertion
statsSummary statistics

:::warning

json, yaml, yml, txt, html, and xml outputs include the eval config. Sensitive fields are redacted using Promptfoo's sanitizer rules on a best-effort basis (not comprehensive). Non-sensitive config.env values may still appear in exports.

:::

Detailed Metrics

When available, outputs include:

  • Latency: Response time in milliseconds
  • Token Usage: Input/output token counts
  • Cost: Estimated API costs
  • Error Details: Failure reasons and stack traces

Analyzing Results

JSON Processing Example

javascript
const fs = require('fs');

// Load results
const results = JSON.parse(fs.readFileSync('results.json', 'utf8'));

// Analyze pass rates by provider
const providerStats = {};
results.results.outputs.forEach((output) => {
  const provider = output.provider;
  if (!providerStats[provider]) {
    providerStats[provider] = { pass: 0, fail: 0 };
  }

  if (output.pass) {
    providerStats[provider].pass++;
  } else {
    providerStats[provider].fail++;
  }
});

console.log('Pass rates by provider:', providerStats);

CSV Analysis with Pandas

python
import pandas as pd

# Load results
df = pd.read_csv('results.csv')

# Group by provider and calculate metrics
summary = df.groupby('provider').agg({
    'pass': 'mean',
    'latency': 'mean',
    'cost': 'sum'
})

print(summary)

Best Practices

1. Organize Output Files

text
project/
├── promptfooconfig.yaml
├── evaluations/
│   ├── 2024-01-15-baseline.html
│   ├── 2024-01-16-improved.html
│   └── comparison.json

2. Use Descriptive Filenames

bash
# Include date and experiment name
promptfoo eval --output "results/$(date +%Y%m%d)-gpt4-temperature-test.html"

3. Version Control Considerations

gitignore
# .gitignore
# Exclude large output files
evaluations/*.html
evaluations/*.json

# But keep summary reports
!evaluations/summary-*.csv

4. Automate Report Generation

bash
#!/bin/bash
# run_evaluation.sh

TIMESTAMP=$(date +%Y%m%d-%H%M%S)
promptfoo eval \
  --output "reports/${TIMESTAMP}-full.json" \
  --output "reports/${TIMESTAMP}-summary.html"

Sharing Results

Web Viewer

The default web viewer (promptfoo view) provides:

  • Real-time updates during evaluation
  • Interactive exploration
  • Local-only (no data sent externally)

Sharing HTML Reports

HTML outputs are self-contained:

bash
# Generate report
promptfoo eval --output team-review.html

# Share via email, Slack, etc.
# No external dependencies required

Promptfoo Share

For collaborative review:

bash
# Share results with your team
promptfoo share

Creates a shareable link with:

  • Read-only access
  • Commenting capabilities
  • No setup required for viewers

Troubleshooting

Large Output Files

For extensive evaluations:

yaml
# Limit output size
outputPath: results.json
sharing:
  # Exclude raw outputs from file
  includeRawOutputs: false

Encoding Issues

Ensure proper encoding for international content:

bash
# Explicitly set encoding
LANG=en_US.UTF-8 promptfoo eval --output results.csv

Performance Tips

  1. Use JSONL for large datasets - avoids memory issues
  2. Use JSON for standard datasets - complete data structure
  3. Generate HTML for presentations - best visual format
  4. Use CSV for data analysis - Excel/Sheets compatible