docs/adr/002-mcp-server.md
Large Language Models (LLMs) are increasingly being used as assistants for debugging and analyzing distributed systems. Jaeger, as a distributed tracing platform, contains rich observability data that could help LLMs diagnose issues in microservice architectures. However, distributed traces can be massive—a single trace might contain hundreds or thousands of spans—and loading full trace data directly into an LLM's context window is impractical and often counterproductive.
The Model Context Protocol (MCP) is an open standard that facilitates integration between LLM applications and external data sources. MCP defines a structured way for AI agents to discover and invoke tools, access resources, and receive responses in a format optimized for LLM consumption.
The key insight driving this design is progressive disclosure: rather than dumping entire traces into an LLM context, we provide tools that allow the LLM to follow a guided "drill-down" workflow:
This approach prevents context-window exhaustion and forces structured reasoning.
The official MCP Go SDK is available at github.com/modelcontextprotocol/go-sdk, maintained in collaboration with Google. This SDK supports:
Jaeger already provides most of the backend functionality needed:
| MCP Tool | Existing Jaeger Component | Notes |
|---|---|---|
get_services | QueryService.GetServices() | Direct mapping |
search_traces | QueryService.FindTraces() | Returns metadata; needs filtering |
get_trace_topology | QueryService.GetTrace() | Needs post-processing to strip attributes |
get_span_details | QueryService.GetTrace() | Needs span-level filtering |
get_trace_errors | QueryService.GetTrace() | Needs error status filtering |
get_critical_path | Not available in backend | Only exists in UI (TypeScript) |
[!IMPORTANT] The critical path algorithm currently exists only in the Jaeger UI codebase (
jaeger-ui/packages/jaeger-ui/src/components/TracePage/CriticalPath/index.tsx). This algorithm must be re-implemented in Go for the MCP server.
Following the pattern established by jaegerquery, the MCP server will be implemented as an OpenTelemetry Collector extension. This provides:
jaegerquery extension[!NOTE] Phase 2 Requirement: The MCP extension will need to retrieve the
QueryServiceinstance from thejaegerqueryextension. This will requirejaegerqueryto exposeQueryServicethrough an Extension interface, similar to howjaegerstorageexposes storage factories via thejaegerstorage.Extensioninterface andGetTraceStoreFactory()helper function. Seecmd/jaeger/internal/exporters/storageexporter/exporter.go:35for reference implementation pattern.
Implement an MCP server as a new extension under cmd/jaeger/internal/extension/mcpserver/ that:
jaegerstorage for trace data access, similar to jaegerquerytools:
- name: get_services
description: List available service names. Use this first to discover valid service names for search_traces.
input_schema:
pattern: string (optional) - Filter services by pattern (substring match). Future: may support regex or semantic search.
limit: integer (optional, default: 100) - Maximum number of services to return
output: List of service names (strings)
- name: get_span_names
description: List available span names for a service. Useful for discovering valid span names before using search_traces.
input_schema:
service_name: string (required) - Filter by service name. Use get_services to discover valid names.
pattern: string (optional) - Optional regex pattern to filter span names
span_kind: string (optional) - Optional span kind filter (e.g., SERVER, CLIENT, PRODUCER, CONSUMER, INTERNAL)
limit: integer (optional, default: 100) - Maximum number of span names to return
output: List of span names with span kind information
- name: search_traces
description: Find traces matching service, time, attributes, and duration criteria. Returns metadata only.
input_schema:
start_time_min: string (optional, default: "-1h") - Start of time interval. Supports RFC3339 or relative (e.g., "-1h", "-30m")
start_time_max: string (optional) - End of time interval. Supports RFC3339 or relative (e.g., "now", "-1m"). Default: now
service_name: string (required) - Filter by service name. Use get_services to discover valid names.
span_name: string (optional) - Filter by span name. Use get_span_names to discover valid names.
attributes: object (optional) - Key-value pairs to match against span/resource attributes (e.g., {"http.status_code": "500"})
with_errors: boolean (optional) - If true, only return traces containing error spans
duration_min: duration string (optional, e.g., "2s", "100ms")
duration_max: duration string (optional)
limit: integer (default: 10, max: 100)
output: List of trace summaries (trace_id, service_count, span_count, duration, has_errors)
- name: get_trace_topology
description: Get the structural tree of a trace showing parent-child relationships, timing, and error locations. Does NOT return attributes or logs.
input_schema:
trace_id: string (required)
depth: integer (optional, default: 3) - Maximum depth of the tree. 0 for full tree.
output: Tree structure with span metadata (id, service, span_name, duration, error flag, children[])
- name: get_critical_path
description: Identify the sequence of spans forming the critical latency path (the blocking execution path).
input_schema:
trace_id: string (required)
output: Ordered list of spans on the critical path with timing information
- name: get_span_details
description: Fetch full details (attributes, events, links, status) for specific spans.
input_schema:
trace_id: string (required)
span_ids: string[] (required, max 20)
output: Full OTLP span data for requested spans
- name: get_trace_errors
description: Get full details for all spans with error status.
input_schema:
trace_id: string (required)
output: Full OTLP span data for error spans only
Returns available service names for use in search_traces.
Input:
{
"pattern": "payment", // optional: substring filter
"limit": 100 // optional: max results (default: 100)
}
Output:
{
"services": ["payment-service", "payment-gateway", "payment-processor"]
}
Find traces matching criteria. Returns lightweight metadata only (no attributes/events).
Input:
{
"start_time_min": "-1h", // required: RFC3339 or relative
"start_time_max": "now", // optional: default "now"
"service_name": "frontend", // required
"span_name": "/api/checkout", // optional
"attributes": { // optional: match span/resource attributes
"http.status_code": "500",
"user.id": "12345"
},
"with_errors": true, // optional: filter to error traces
"duration_min": "2s", // optional
"duration_max": "10s", // optional
"limit": 10 // optional: default 10, max 100
}
Output:
{
"traces": [
{
"trace_id": "1a2b3c4d5e6f7890",
"root_service": "frontend",
"root_span_name": "/api/checkout",
"start_time": "2024-01-15T10:30:00Z",
"duration_ms": 2450,
"span_count": 47,
"service_count": 8,
"has_errors": true
}
]
}
Returns the structural skeleton of a trace—parent-child relationships, timing, and error locations—without loading attributes or events. This keeps the response small for LLM context.
Input:
{
"trace_id": "1a2b3c4d5e6f7890"
}
Output:
{
"trace_id": "1a2b3c4d5e6f7890",
"root": {
"span_id": "span_A",
"service": "frontend",
"span_name": "/api/checkout",
"start_time": "2024-01-15T10:30:00Z",
"duration_ms": 2450,
"status": "OK",
"children": [
{
"span_id": "span_B",
"service": "cart-service",
"span_name": "getCart",
"start_time": "2024-01-15T10:30:00.050Z",
"duration_ms": 120,
"status": "OK",
"children": []
},
{
"span_id": "span_C",
"service": "payment-service",
"span_name": "processPayment",
"start_time": "2024-01-15T10:30:00.200Z",
"duration_ms": 2200,
"status": "ERROR",
"children": [
{
"span_id": "span_D",
"service": "payment-gateway",
"span_name": "chargeCard",
"start_time": "2024-01-15T10:30:00.250Z",
"duration_ms": 2100,
"status": "ERROR",
"children": []
}
]
}
]
}
}
Returns the sequence of spans that form the critical latency path—the "blocking" execution path that directly contributed to total trace duration.
Input:
{
"trace_id": "1a2b3c4d5e6f7890"
}
Output:
{
"trace_id": "1a2b3c4d5e6f7890",
"total_duration_ms": 2450,
"critical_path_duration_ms": 2400,
"path": [
{
"span_id": "span_A",
"service": "frontend",
"span_name": "/api/checkout",
"self_time_ms": 50,
"section_start_ms": 0,
"section_end_ms": 50
},
{
"span_id": "span_C",
"service": "payment-service",
"span_name": "processPayment",
"self_time_ms": 100,
"section_start_ms": 50,
"section_end_ms": 150
},
{
"span_id": "span_D",
"service": "payment-gateway",
"span_name": "chargeCard",
"self_time_ms": 2100,
"section_start_ms": 150,
"section_end_ms": 2250
},
{
"span_id": "span_A",
"service": "frontend",
"span_name": "/api/checkout",
"self_time_ms": 200,
"section_start_ms": 2250,
"section_end_ms": 2450
}
]
}
[!NOTE] A span may appear multiple times on the critical path (e.g.,
span_Aabove) if it has work both before and after its children execute.
Fetch full OTLP span data for specific spans. Use this only after identifying suspicious spans via topology or critical path.
Input:
{
"trace_id": "1a2b3c4d5e6f7890",
"span_ids": ["span_C", "span_D"] // max 20 spans
}
Output:
{
"trace_id": "1a2b3c4d5e6f7890",
"spans": [
{
"span_id": "span_C",
"trace_id": "1a2b3c4d5e6f7890",
"parent_span_id": "span_A",
"service": "payment-service",
"span_name": "processPayment",
"start_time": "2024-01-15T10:30:00.200Z",
"duration_ms": 2200,
"status": {
"code": "ERROR",
"message": "Upstream service timeout"
},
"attributes": {
"http.method": "POST",
"http.url": "http://payment-gateway/charge",
"http.status_code": "504",
"retry.count": "3"
},
"events": [
{
"name": "retry_attempt",
"timestamp": "2024-01-15T10:30:00.700Z",
"attributes": {"attempt": "1"}
},
{
"name": "retry_attempt",
"timestamp": "2024-01-15T10:30:01.200Z",
"attributes": {"attempt": "2"}
}
],
"links": []
},
{
"span_id": "span_D",
"trace_id": "1a2b3c4d5e6f7890",
"parent_span_id": "span_C",
"service": "payment-gateway",
"span_name": "chargeCard",
"start_time": "2024-01-15T10:30:00.250Z",
"duration_ms": 2100,
"status": {
"code": "ERROR",
"message": "Connection timeout to payment processor"
},
"attributes": {
"db.system": "postgresql",
"db.statement": "SELECT * FROM transactions WHERE...",
"net.peer.name": "payment-db.internal",
"net.peer.port": "5432"
},
"events": [],
"links": []
}
]
}
Shortcut to get full details for all error spans in a trace.
Input:
{
"trace_id": "1a2b3c4d5e6f7890"
}
Output:
{
"trace_id": "1a2b3c4d5e6f7890",
"error_count": 2,
"spans": [
// Same format as get_span_details output
// Contains only spans where status.code == "ERROR"
]
}
extensions:
jaeger_mcp:
# HTTP endpoint for MCP protocol (Streamable HTTP transport)
http:
endpoint: "0.0.0.0:4320"
# Storage configuration (references jaegerstorage extension)
storage:
traces: "some_storage"
# Server identification for MCP protocol
server_name: "jaeger"
server_version: "${version}"
# Limits
max_span_details_per_request: 20
max_search_results: 100
cmd/jaeger/internal/extension/jaegermcp/
├── README.md
├── config.go # Configuration struct and validation
├── config_test.go
├── factory.go # Extension factory (NewFactory, createDefaultConfig)
├── factory_test.go
├── server.go # Extension lifecycle (Start, Shutdown, Dependencies)
├── server_test.go
└── internal/
├── criticalpath/ # Critical path algorithm (ported from UI)
│ ├── criticalpath.go
│ └── criticalpath_test.go
├── handlers/ # MCP tool handlers
│ ├── search_traces.go
│ ├── search_traces_test.go
│ ├── get_trace_topology.go
│ ├── get_critical_path.go
│ ├── get_span_details.go
│ ├── get_span_details_test.go
│ ├── get_trace_errors.go
│ └── get_trace_errors_test.go
└── types/ # Response types for MCP tools (one file per handler)
├── search_traces.go
├── get_span_details.go
└── get_trace_errors.go
github.com/modelcontextprotocol/go-sdk to dependenciesExtension Scaffold ✅
jaegermcp extension directory structureconfig.go with configuration validationfactory.go following jaegerquery patternserver.go with lifecycle managementMCP Server Setup ✅
github.com/modelcontextprotocol/go-sdk dependencyStorage Integration ✅
jaegerstorage extension for trace reader accessImplement get_services Tool ✅
QueryService.GetServices()4b. Implement get_span_names Tool ✅
QueryService.GetOperations()Implement search_traces Tool ✅
QueryService.FindTraces()Implement get_span_details Tool ✅
QueryService.GetTrace()Implement get_trace_errors Tool ✅
QueryService.GetTrace()Implement get_trace_topology Tool ✅
QueryService.GetTrace()Port Critical Path Algorithm ✅
jaeger-ui/packages/jaeger-ui/src/components/TracePage/CriticalPath/internal/criticalpath/findLastFinishingChildSpan() - Find LFC for a spansanitizeOverFlowingChildren() - Handle child spans that exceed parent durationcomputeCriticalPath() - Main recursive algorithmImplement get_critical_path Tool ✅
Configuration and Observability
Documentation
README.md for the extensionIntegration Testing
| Component | Testing Approach |
|---|---|
config.go | Test validation with valid/invalid configs |
factory.go | Test factory creation and default config |
server.go | Test lifecycle with mock storage extension |
| Critical path algorithm | Port test cases from TypeScript tests; use same expected results |
| Tool handlers | Mock QueryService; test input validation, response format, error handling |
Extension Lifecycle
jaegerstorageMCP Protocol Compliance
tools/list)End-to-End Scenarios
Reuse existing test fixtures from:
cmd/jaeger/internal/extension/jaegerquery/internal/fixture/ - Sample tracesjaeger-ui/packages/jaeger-ui/src/components/TracePage/CriticalPath/testCases/ - Critical path test casesmake test targetcmd/jaeger/internal/extension/jaegerquery/jaeger-ui/packages/jaeger-ui/src/components/TracePage/CriticalPath/index.tsxdesign.md