site/docs/tracing.md
Promptfoo supports OpenTelemetry (OTLP) tracing to help you understand the internal operations of your LLM providers during evaluations.
This feature allows you to collect detailed performance metrics and debug complex provider implementations.
Promptfoo acts as an OpenTelemetry receiver, collecting traces from your providers and displaying them in the web UI. This eliminates the need for external observability infrastructure during development and testing.
Tracing provides visibility into:
Promptfoo automatically instruments its built-in providers with OpenTelemetry spans following GenAI Semantic Conventions. When tracing is enabled, every provider call creates spans with standardized attributes.
The following providers have built-in instrumentation:
| Provider | Automatic Tracing |
|---|---|
| OpenAI | ✓ |
| Anthropic | ✓ |
| Azure OpenAI | ✓ |
| AWS Bedrock | ✓ |
| Google Vertex AI | ✓ |
| Ollama | ✓ |
| Mistral | ✓ |
| Cohere | ✓ |
| Huggingface | ✓ |
| IBM Watsonx | ✓ |
| HTTP | ✓ |
| OpenRouter | ✓ |
| Replicate | ✓ |
| OpenAI-compatible (Deepseek, Perplexity, etc.) | ✓ (inherited) |
| Cloudflare AI | ✓ (inherited) |
Each provider call creates a span with these attributes:
Request Attributes:
gen_ai.system - Provider system (e.g., "openai", "anthropic", "azure", "bedrock")gen_ai.operation.name - Operation type ("chat", "completion", "embedding")gen_ai.request.model - Model namegen_ai.request.max_tokens - Max tokens settinggen_ai.request.temperature - Temperature settinggen_ai.request.top_p - Top-p settinggen_ai.request.stop_sequences - Stop sequencesResponse Attributes:
gen_ai.usage.input_tokens - Input/prompt token countgen_ai.usage.output_tokens - Output/completion token countgen_ai.usage.total_tokens - Total token countgen_ai.usage.cached_tokens - Cached token count (if applicable)gen_ai.usage.reasoning_tokens - Reasoning token count (for o1, DeepSeek-R1)gen_ai.response.finish_reasons - Finish/stop reasonsPromptfoo-specific Attributes:
promptfoo.provider.id - Provider identifierpromptfoo.test.index - Test case indexpromptfoo.prompt.label - Prompt labelpromptfoo.cache_hit - Whether the response was served from cachepromptfoo.request.body - The request body sent to the provider (truncated to 4KB)promptfoo.response.body - The response body from the provider (truncated to 4KB)When calling OpenAI's GPT-4:
Span: chat gpt-4
├─ gen_ai.system: openai
├─ gen_ai.operation.name: chat
├─ gen_ai.request.model: gpt-4
├─ gen_ai.request.max_tokens: 1000
├─ gen_ai.request.temperature: 0.7
├─ gen_ai.usage.input_tokens: 150
├─ gen_ai.usage.output_tokens: 85
├─ gen_ai.usage.total_tokens: 235
├─ gen_ai.response.finish_reasons: ["stop"]
├─ promptfoo.provider.id: openai:chat:gpt-4
└─ promptfoo.test.index: 0
Add tracing configuration to your promptfooconfig.yaml:
tracing:
enabled: true # Required to send OTLP telemetry
otlp:
http:
enabled: true # Required to start the built-in OTLP receiver
Promptfoo passes a W3C trace context to providers via the traceparent field. Use this to create child spans:
const { trace, context, SpanStatusCode } = require('@opentelemetry/api');
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { resourceFromAttributes } = require('@opentelemetry/resources');
// Initialize tracer (SDK 2.x API - pass spanProcessors to constructor)
const provider = new NodeTracerProvider({
resource: resourceFromAttributes({ 'service.name': 'my-provider' }),
spanProcessors: [
new SimpleSpanProcessor(
new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces',
}),
),
],
});
provider.register();
const tracer = trace.getTracer('my-provider');
module.exports = {
async callApi(prompt, promptfooContext) {
// Parse trace context from Promptfoo
if (promptfooContext.traceparent) {
const activeContext = trace.propagation.extract(context.active(), {
traceparent: promptfooContext.traceparent,
});
return context.with(activeContext, async () => {
const span = tracer.startSpan('provider.call');
try {
// Your provider logic here
span.setAttribute('prompt.length', prompt.length);
const result = await yourLLMCall(prompt);
span.setStatus({ code: SpanStatusCode.OK });
return { output: result };
} catch (error) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
});
}
// Fallback for when tracing is disabled
return { output: await yourLLMCall(prompt) };
},
};
After running an evaluation, view traces in the web UI:
Run your evaluation:
promptfoo eval
Open the web UI:
promptfoo view
Click the magnifying glass (🔎) icon on any test result
Scroll to the "Trace Timeline" section
Once traces are flowing into Promptfoo, you can evaluate what the agent actually did, not just the final answer:
tests:
- vars:
order_id: '123'
assert:
- type: trajectory:tool-used
value: search_orders
- type: trajectory:tool-args-match
value:
name: search_orders
args:
order_id: '{{ order_id }}'
- type: trajectory:tool-sequence
value:
steps:
- search_orders
- compose_reply
- type: trajectory:goal-success
value: 'Determine the shipping status for order {{ order_id }} and tell the user whether it has shipped'
provider: openai:gpt-5-mini
Use trajectory assertions when your spans identify tools, commands, searches, reasoning steps, or messages. Promptfoo also normalizes common command-like tool spans, including OpenAI Agents SDK exec_command calls with cmd arguments, into command trajectory steps. For traced tool calls, Promptfoo recognizes both generic attributes such as tool.name and tool.arguments and framework-specific ones such as Vercel AI SDK's ai.toolCall.name, ai.toolCall.args, ai.toolCall.arguments, and ai.toolCall.input. If you only need raw span counts, durations, or error detection, use trace-span-count, trace-span-duration, or trace-error-spans.
tracing:
enabled: true # Enable/disable tracing
otlp:
http:
enabled: true # Required to start the OTLP receiver
# port: 4318 # Optional - defaults to 4318 (standard OTLP HTTP port)
# host: '0.0.0.0' # Optional - defaults to '0.0.0.0'
# acceptFormats: ['json', 'protobuf'] # Optional - defaults to both
Promptfoo's OTLP receiver accepts traces in both JSON and protobuf formats:
| Format | Content-Type | Use Case |
|---|---|---|
| JSON | application/json | JavaScript/TypeScript (default) |
| Protobuf | application/x-protobuf | Python (default), Go, Java, and other languages |
Protobuf is more efficient for serialization and produces smaller payloads. Python's OpenTelemetry SDK uses protobuf by default.
You can also configure tracing via environment variables:
# Enable tracing
export PROMPTFOO_TRACING_ENABLED=true
# Configure OTLP endpoint (for providers)
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
# Set service name
export OTEL_SERVICE_NAME="my-rag-application"
# Authentication headers (if needed)
export OTEL_EXPORTER_OTLP_HEADERS="api-key=your-key"
Forward traces to external observability platforms:
tracing:
enabled: true
otlp:
http:
enabled: true
forwarding:
enabled: true
endpoint: 'http://jaeger:4318' # or Tempo, Honeycomb, etc.
headers:
'api-key': '{{ env.OBSERVABILITY_API_KEY }}'
For complete provider implementation details, see the JavaScript Provider documentation. For tracing-specific examples, see the OpenTelemetry tracing example.
Key points:
SimpleSpanProcessor for immediate trace exporttraceparenttool.name or function.name when you want to use trajectory assertionsai.toolCall.name plus the matching ai.toolCall.args / ai.toolCall.arguments / ai.toolCall.input attributes into trajectory tool stepsFor complete provider implementation details, see the Python Provider documentation. For a working example with protobuf tracing, see the Python OpenTelemetry tracing example. For a framework-specific example that exports OpenAI Agents SDK traces into Promptfoo, see the OpenAI Agents Python SDK guide.
:::note
Python's opentelemetry-exporter-otlp-proto-http package uses protobuf format by default (application/x-protobuf), which is more efficient than JSON.
:::
from opentelemetry import trace
from opentelemetry.propagate import extract
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
# Setup - uses protobuf format by default
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
provider.add_span_processor(SimpleSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
def call_api(prompt, options, context):
# Extract trace context
if 'traceparent' in context:
ctx = extract({"traceparent": context["traceparent"]})
with tracer.start_as_current_span("provider.call", context=ctx) as span:
span.set_attribute("prompt.length", len(prompt))
# Your provider logic here
result = your_llm_call(prompt)
return {"output": result}
# Fallback without tracing
return {"output": your_llm_call(prompt)}
If you only need provider-level timing for a Python provider, enable the wrapper OTEL path by installing the Python OpenTelemetry packages and setting PROMPTFOO_ENABLE_OTEL=true. Add custom child spans only when you want internal workflow visibility such as tools, searches, or multi-step agent trajectories.
Promptfoo includes a built-in trace viewer that displays all collected telemetry data. Since Promptfoo functions as an OTLP receiver, you can view traces directly without configuring external tools like Jaeger or Grafana Tempo.
The web UI displays traces as a hierarchical timeline showing:
[Root Span: provider.call (500ms)]
├─[Retrieve Documents (100ms)]
├─[Prepare Context (50ms)]
└─[LLM Generation (300ms)]
Each bar's width represents its duration relative to the total trace time. Hover over any span to see:
Click the expand icon on any span to reveal a detailed attributes panel showing:
This is useful for inspecting the full request/response bodies (promptfoo.request.body and promptfoo.response.body) and debugging provider behavior.
Trace reads redact credential-like attribute keys such as authorization headers, cookies, API keys, tokens, secrets, and passwords before displaying or exporting spans. GenAI token counters such as gen_ai.usage.input_tokens remain visible. Avoid placing secrets in custom span attributes because raw attributes may still be retained in the local trace store for internal evaluation workflows.
Click the Export Traces button to download all traces for the current evaluation or test case as a JSON file. The export includes:
The exported JSON can be imported into external tools like Jaeger, Grafana Tempo, or custom analysis scripts.
Use descriptive, hierarchical span names:
// Good
'rag.retrieve_documents';
'rag.rank_results';
'llm.generate_response';
// Less informative
'step1';
'process';
'call_api';
Include context that helps debugging:
span.setAttributes({
'prompt.tokens': tokenCount,
'documents.count': documents.length,
'model.name': 'gpt-4',
'cache.hit': false,
});
Always record exceptions and set error status:
try {
// Operation
} catch (error) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
}
Add metadata that appears in the UI:
span.setAttributes({
'user.id': userId,
'feature.flags': JSON.stringify(featureFlags),
version: packageVersion,
});
Reduce overhead in high-volume scenarios:
const { TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base');
const provider = new NodeTracerProvider({
sampler: new TraceIdRatioBasedSampler(0.1), // Sample 10% of traces
});
Trace across multiple services:
// Service A: Forward trace context
const headers = {};
trace.propagation.inject(context.active(), headers);
await fetch(serviceB, { headers });
// Service B: Extract and continue trace
const extractedContext = trace.propagation.extract(context.active(), request.headers);
tracing.enabled: true in confighttp://localhost:4318/v1/tracestraceparent value to ensure it's being passedIf you see context.active is not a function, rename the OpenTelemetry import:
// Avoid conflict with promptfoo context parameter
const { context: otelContext } = require('@opentelemetry/api');
async callApi(prompt, promptfooContext) {
// Use otelContext for OpenTelemetry
// Use promptfooContext for Promptfoo's context
}
BatchSpanProcessor for production useEnable debug logs to troubleshoot:
# Promptfoo debug logs
DEBUG=promptfoo:* promptfoo eval
# OpenTelemetry debug logs
OTEL_LOG_LEVEL=debug promptfoo eval
async function ragPipeline(query, context) {
const span = tracer.startSpan('rag.pipeline');
try {
// Retrieval phase
const retrieveSpan = tracer.startSpan('rag.retrieve', { parent: span });
const documents = await vectorSearch(query);
retrieveSpan.setAttribute('documents.count', documents.length);
retrieveSpan.end();
// Reranking phase
const rerankSpan = tracer.startSpan('rag.rerank', { parent: span });
const ranked = await rerank(query, documents);
rerankSpan.setAttribute('documents.reranked', ranked.length);
rerankSpan.end();
// Generation phase
const generateSpan = tracer.startSpan('llm.generate', { parent: span });
const response = await llm.generate(query, ranked);
generateSpan.setAttribute('response.tokens', response.tokenCount);
generateSpan.end();
span.setStatus({ code: SpanStatusCode.OK });
return response;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
}
async function compareModels(prompt, context) {
const span = tracer.startSpan('compare.models');
const models = ['gpt-4', 'claude-3', 'llama-3'];
const promises = models.map(async (model) => {
const modelSpan = tracer.startSpan(`model.${model}`, { parent: span });
try {
const result = await callModel(model, prompt);
modelSpan.setAttribute('model.name', model);
modelSpan.setAttribute('response.latency', result.latency);
return result;
} finally {
modelSpan.end();
}
});
const results = await Promise.all(promises);
span.end();
return results;
}
When running red team tests, tracing provides a powerful capability: traces from your application's internal operations can be fed back to adversarial attack strategies, allowing them to craft more sophisticated attacks based on what they observe.
This creates a feedback loop where:
When red team tracing is enabled, adversarial strategies receive visibility into:
Example trace summary provided to an attacker:
Trace 0af76519 • 5 spans
Execution Flow:
1. [1.2s] llm.generate (client) | model=gpt-4
2. [300ms] guardrail.check (internal) | tool=content-filter
3. [150ms] tool.database_query (server) | tool=search
4. [50ms] guardrail.check (internal) | ERROR: Rate limit exceeded
Key Observations:
• Guardrail content-filter decision: blocked
• Tool call search via "tool.database_query"
• Error span "guardrail.check": Rate limit exceeded
The attacker can now craft a follow-up attack that:
content-filter guardrailEnable red team tracing in your promptfooconfig.yaml:
tracing:
enabled: true
otlp:
http:
enabled: true
redteam:
tracing:
enabled: true
# Feed traces to attack generation (default: true)
includeInAttack: true
# Feed traces to grading (default: true)
includeInGrading: true
# Filter which spans to include
spanFilter:
- 'llm.*'
- 'guardrail.*'
- 'tool.*'
plugins:
- harmful
strategies:
- jailbreak # Iterative strategy that benefits from trace feedback
Different attack strategies can use different tracing settings:
redteam:
tracing:
enabled: true
strategies:
# Jailbreak benefits from seeing all internal operations
jailbreak:
includeInAttack: true
maxSpans: 100
# Crescendo focuses on guardrail decisions
crescendo:
includeInAttack: true
spanFilter:
- 'guardrail.*'
See the red team tracing example for a complete working implementation.
For more details on red team testing with tracing, see How to Red Team LLM Agents.