Back to Beads

OpenTelemetry Architecture

docs/design/otel/otel-architecture.md

1.0.325.1 KB
Original Source

OpenTelemetry Architecture

Overview

Beads uses OpenTelemetry (OTel) for structured observability of all database operations, CLI commands, and Dolt version control. Telemetry is emitted via standard OTLP HTTP to any compatible backend (metrics, traces).

Backend-agnostic design: The system emits standard OpenTelemetry Protocol (OTLP) — any OTLP v1.x+ compatible backend can consume it. You are not obligated to use VictoriaMetrics/VictoriaLogs; these are simply development defaults.

Best-effort design: Telemetry initialization errors are returned but do not affect normal bd operation. The system remains functional even when telemetry is unavailable.


Implementation Status

Core Telemetry (Implemented ✅)

FeatureStatusNotes
Core OTel initialization✅ Implementedtelemetry.Init(), providers setup
Metrics export (counters)✅ ImplementedStorage operations, Dolt operations
Metrics export (histograms)✅ ImplementedOperation durations, query latency
Traces (stdout only)✅ ImplementedOTLP traces via stdout (dev mode)
Storage layer instrumentation✅ ImplementedInstrumentedStorage wrapper for all storage ops
Command lifecycle tracing✅ ImplementedPer-command spans with arguments
Dolt version control tracing✅ ImplementedCommit, push, pull, merge operations

Dolt Backend Telemetry (Implemented ✅)

FeatureStatusNotes
SQL query tracing✅ ImplementedAll Dolt queries wrapped with spans
Dolt lock wait timing✅ Definedbd.db.lock_wait_ms histogram registered; .Record() not yet called
Dolt retry counting✅ Implementedbd.db.retry_count counter
Dolt circuit breaker✅ Implementedbd.db.circuit_trips, bd.db.circuit_rejected counters
Auto-commit tracking✅ ImplementedPer-command auto-commit events
Working set flush tracking✅ ImplementedFlush on shutdown/signal

Server Lifecycle Telemetry (Not yet instrumented ❌)

internal/doltserver/ has no OTel imports. Server lifecycle spans and metrics are roadmap items (see Tier 1 below).


Roadmap

Current coverage: ~40% of the codebase. Below is a prioritized plan based on operational value vs. implementation effort.

Tier 1 — High value, moderate effort

Tracker integrations (internal/linear/, internal/jira/, internal/gitlab/)

External API calls are currently a black box. No visibility into latency, rate-limiting, or sync volume.

New metrics:

  • bd_tracker_api_calls_total (Counter) — by tracker, method, status
  • bd_tracker_api_latency_ms (Histogram) — by tracker, method
  • bd_tracker_errors_total (Counter) — by tracker, error_type
  • bd_tracker_issues_synced_total (Counter) — by tracker, direction

New spans: tracker.<name>.pull_issues, tracker.<name>.push_issue, tracker.<name>.resolve_state

Git operations (internal/git/)

Git push/pull can dominate wall-clock time but is currently invisible.

New metrics:

  • bd_git_operation_duration_ms (Histogram) — by operation, status
  • bd_git_errors_total (Counter) — by operation, error_type

New spans: git.clone, git.pull, git.push, git.commit, git.merge

Dolt server lifecycle (internal/doltserver/)

Server crashes and restarts are silent. No alerting possible.

New metrics:

  • bd_doltserver_status (Gauge, 1=running/0=stopped)
  • bd_doltserver_startup_ms (Histogram)
  • bd_doltserver_restarts_total (Counter)
  • bd_doltserver_errors_total (Counter) — by error_type

New spans: doltserver.start, doltserver.stop


Tier 2 — Medium value, low effort

Query engine (internal/query/)

Distinguishes whether slowness is client-side (parsing/compilation) or DB-side.

New spans: query.parse, query.compile New metrics: bd_query_duration_ms (Histogram), bd_query_parse_errors_total (Counter)

Validation engine (internal/validation/)

Data integrity errors are currently silent until they surface as user-visible failures.

New spans: validation.check_dependencies, validation.check_schema New metrics: bd_validation_errors_total (Counter) — by error_type

Dolt version control (internal/storage/dolt/versioned.go)

versioned.go has no OTel imports yet. Future spans:

  • History: Query complete version history for an issue
  • AsOf: Query state at specific commit or branch
  • Diff: Cell-level diff between two commits
  • ListBranches: Enumerate all branches
  • GetCurrentCommit: Get HEAD commit hash
  • GetConflicts: Check for merge conflicts

Dolt system table polling

Periodic SQL queries against Dolt system tables to surface metrics unavailable via OTLP (Dolt has no native OTel export):

MetricSourceFrequency
bd_dolt_commits_per_hourdolt_log GROUP BY hour5 min
bd_dolt_working_set_sizedolt_status COUNT(*)1 min
bd_dolt_branch_countdolt_branches COUNT(*)5 min
bd_dolt_conflict_countdolt_conflicts COUNT(*)5 min

Tier 3 — Low priority / future

  • Command-level sub-spans: Instrument validation vs. DB vs. render breakdown per command (bd create, bd list, bd compact, etc.)
  • Molecules & recipes: molecule.create, recipe.execute spans
  • Hook duration metrics: Currently only spans (hook.exec), no histogram for aggregation
  • OTel test suite: Integration tests that verify telemetry output (currently none)
  • Lock wait recording: bd.db.lock_wait_ms histogram is registered but .Record() is not yet called

Components

1. Initialization (internal/telemetry/telemetry.go)

The telemetry.Init() function sets up OTel providers on process startup and returns only an error:

go
if err := telemetry.Init(ctx, "bd", version); err != nil {
    // Log and continue — telemetry is best-effort
}
defer telemetry.Shutdown(ctx)

Providers:

  • Metrics: Any OTLP-compatible metrics backend via otlpmetrichttp exporter
  • Traces: Stdout only (local debug). No remote trace backend in default stack.

Default endpoints (when BD_OTEL_METRICS_URL is not set):

  • Metrics: http://localhost:8428/opentelemetry/api/v1/push
  • Traces: stdout (via BD_OTEL_STDOUT=true)

Note: These defaults target VictoriaMetrics for local development convenience. Beads uses standard OTLP — you can override endpoints to use any OTLP v1.x+ compatible backend (Prometheus, Grafana Mimir, Datadog, New Relic, Grafana Cloud, Loki, OpenTelemetry Collector, etc.).

OTLP Compatibility:

  • Uses standard OpenTelemetry Protocol (OTLP) over HTTP
  • Protobuf encoding (VictoriaMetrics, Prometheus, and others accept this)
  • Compatible with any backend that supports OTLP v1.x+

Resource attributes (set at init time):

  • service.name: "bd"
  • service.version: bd binary version
  • host: system hostname
  • os: system OS info

Custom resource attributes (via OTEL_RESOURCE_ATTRIBUTES env var or BEADS_ACTOR):

  • bd.actor: Actor identity (from git config or env) — set after actor resolution
  • bd.command: Current command name
  • bd.args: Full arguments passed to command

2. Storage Instrumentation (internal/telemetry/storage.go)

The InstrumentedStorage wraps storage.Storage with OTel tracing and metrics:

  • Every storage method gets a span
  • Counters track operation counts
  • Histograms track operation duration
  • Error counters track failures
go
func WrapStorage(s storage.Storage) storage.Storage {
    if !Enabled() {
        return s  // Zero overhead when telemetry disabled
    }
    // Wrap with instrumentation
    return &InstrumentedStorage{inner: s, tracer, ops, dur, errs, issueGauge}
}

Metric names in code (OTel SDK notation with dots):

  • bd.storage.operations → exported as bd_storage_operations_total by Prometheus/VM
  • bd.storage.operation.durationbd_storage_operation_duration_ms
  • bd.storage.errorsbd_storage_errors_total
  • bd.issue.countbd_issue_count

Instrumented Storage Operations:

  • Issue CRUD: CreateIssue, GetIssue, UpdateIssue, CloseIssue, DeleteIssue
  • Dependencies: AddDependency, RemoveDependency, GetDependencies
  • Labels: AddLabel, RemoveLabel, GetLabels
  • Queries: SearchIssues, GetReadyWork, GetBlockedIssues
  • Statistics: GetStatistics (also emits gauge of issue counts by status)
  • Transactions: RunInTransaction

3. Dolt Backend Telemetry (internal/storage/dolt/store.go)

Dolt storage layer emits metrics for:

  • bd.db.retry_count: SQL retries in server mode (recorded in withRetry when attempts > 1)
  • bd.db.lock_wait_ms: Histogram registered but .Record() not yet called (stub)
  • bd.db.circuit_trips: Circuit breaker trips to open state (recorded in withRetry)
  • bd.db.circuit_rejected: Requests rejected by open circuit breaker (fail-fast path)
  • SQL query spans via queryContext(), execContext(), queryRowContext() wrappers using doltTracer
  • Dolt version control spans: dolt.commit, dolt.push, dolt.pull, dolt.merge, dolt.branch, dolt.checkout

SQL Span pattern (queryContext):

go
func (s *DoltStore) queryContext(ctx context.Context, query string, args ...any) (*sql.Rows, error) {
    ctx, span := doltTracer.Start(ctx, "dolt.query",
        trace.WithSpanKind(trace.SpanKindClient),
        trace.WithAttributes(append(s.doltSpanAttrs(),
            attribute.String("db.operation", "query"),
            attribute.String("db.statement", spanSQL(query)),
        )...),
    )
    var rows *sql.Rows
    err := s.withRetry(ctx, func() error {
        rows, queryErr = s.db.QueryContext(ctx, query, args...)
        return queryErr
    })
    endSpan(span, wrapLockError(err))
    return rows, err
}

4. Dolt Version Control Telemetry (internal/storage/dolt/store.go)

Version control operations emit spans directly in store.go via doltTracer.Start(). These are not in versioned.go (which has no OTel imports).

Implemented spans (see Appendix for exact source locations):

  • dolt.commitCALL DOLT_COMMIT
  • dolt.pushCALL DOLT_PUSH
  • dolt.pullCALL DOLT_PULL
  • dolt.mergeCALL DOLT_MERGE
  • dolt.branchCALL DOLT_BRANCH
  • dolt.checkoutCALL DOLT_CHECKOUT

5. Hook Telemetry (internal/hooks/)

Hooks emit a single root span per execution (hook.exec). There are no metric counters or histograms for hooks — only span-level observability. Duration metrics are a roadmap item (Tier 3).


Metric Naming Convention

OTel SDK uses dot-notation internally. Prometheus-compatible backends (VictoriaMetrics, Prometheus) export these as underscore-separated names with type suffixes:

Code nameExported name
bd.storage.operationsbd_storage_operations_total
bd.storage.operation.durationbd_storage_operation_duration_ms
bd.storage.errorsbd_storage_errors_total
bd.issue.countbd_issue_count
bd.db.retry_countbd_db_retry_count_total
bd.db.lock_wait_msbd_db_lock_wait_ms
bd.db.circuit_tripsbd_db_circuit_trips_total
bd.db.circuit_rejectedbd_db_circuit_rejected_total
bd.ai.input_tokensbd_ai_input_tokens_total
bd.ai.output_tokensbd_ai_output_tokens_total
bd.ai.request.durationbd_ai_request_duration_ms

Environment Variables

Beads-Level Variables

VariableSet byDescription
BD_OTEL_METRICS_URLOperatorOTLP metrics endpoint (default: localhost:8428)
BD_OTEL_LOGS_URLOperatorOTLP logs endpoint (reserved for future log export)
BD_OTEL_STDOUTOperatorOpt-in: Write spans and metrics to stderr (dev/debug). Also activates telemetry.

Context Variables

VariableSourceUsed By
BEADS_ACTORGit config / env varActor identity for audit trails (BD_ACTOR still works as deprecated alias)
BD_NAMEEnvironmentBinary name override (for multi-instance setups)
OTEL_RESOURCE_ATTRIBUTESOperatorCustom resource attributes for all spans

Dolt-Specific Variables (See DOLT.md)

VariablePurpose
BEADS_DOLT_PASSWORDServer mode password
BEADS_DOLT_SERVER_MODEEnable server mode
BEADS_DOLT_SERVER_HOSTServer host (default: 127.0.0.1)
BEADS_DOLT_SERVER_PORTServer port (default: 3307, 3308 in shared mode, or derived)
BEADS_DOLT_SERVER_TLSEnable TLS for server connections
BEADS_DOLT_SERVER_USERMySQL connection user
DOLT_REMOTE_USERPush/pull auth user
DOLT_REMOTE_PASSWORDPush/pull auth password

Note: Dolt-specific configuration variables are documented in DOLT.md and are out of scope for OTEL design documentation.


Event Types

CLI Command Events

EventTriggerKey Attributes
bd.command.<name>Each bd subcommand executionbd.command, bd.version, bd.args, bd.actor

Storage Events

EventTriggerKey Attributes
storage.CreateIssueIssue creationbd.issue.id, bd.issue.type, bd.actor
storage.UpdateIssueIssue updatebd.issue.id, bd.update.count, bd.actor
storage.GetIssueIssue lookupbd.issue.id
storage.SearchIssuesIssue searchbd.query, bd.result.count
storage.GetReadyWorkReady work querybd.result.count
storage.GetBlockedIssuesBlocked issues querybd.result.count
storage.RunInTransactionTransaction executiondb.commit_msg

Dolt Events

EventTriggerKey Attributes
dolt.queryEach SQL query (queryContext)db.operation, db.statement
dolt.execEach SQL write (execContext)db.operation, db.statement
dolt.query_rowSingle-row queries (queryRowContext)db.operation, db.statement
dolt.commitDOLT_COMMIT operationcommit_msg
dolt.pushDOLT_PUSH operationdolt.branch
dolt.pullDOLT_PULL operationdolt.branch
dolt.mergeDOLT_MERGE operationdolt.merge_branch
dolt.branchDOLT_BRANCH operationdolt.branch
dolt.checkoutDOLT_CHECKOUT operationdolt.branch

Hooks Events

EventTriggerKey Attributes
hook.execHook execution (span only — no metric counters)hook.event, hook.path, bd.issue_id

Monitoring Gaps

Currently Monitored ✅

AreaCoverage
Storage operationsFull (all CRUD, queries, transactions)
CLI command lifecycleFull (all commands with arguments)
Dolt SQL queriesFull (all queries via queryContext/execContext wrappers)
Dolt retry countingFull (retry counter incremented in withRetry)
Dolt version controlFull (commit, push, pull, merge, branch, checkout)
AI compactionFull (bd.ai.* metrics in compact/haiku.go)

Not Currently Monitored ❌

AreaNotesOperational Impact
Dolt lock wait timebd.db.lock_wait_ms registered but .Record() not calledLock contention invisible
Dolt server lifecycleinternal/doltserver/ has no OTel importsServer crashes are silent
Hook execution timehook.exec span exists but no duration histogramCannot detect hook regressions
versioned.go operationsversioned.go has no OTel importsHistory/AsOf/Diff invisible
Dolt server metricsDolt has internal metrics but not exposed to OTelCannot monitor server health, connection count, query load
Working set sizeUncommitted changes count unknownCannot detect batch mode accumulation
Database size growthDolt database size not trackedCannot plan capacity or detect bloat
Branch proliferationBranch count not exposedCannot detect cleanup needed
Remote sync bandwidthBytes transferred not trackedCannot monitor network usage or cost
Query execution plansEXPLAIN ANALYZE not capturedCannot identify slow queries
Connection pool utilizationActive/idle counts not trackedCannot tune connection pool sizing

Queries

Metrics (Any OTLP-compatible backend)

Total counts by operation:

promql
sum(rate(bd_storage_operations_total[5m])) by (db.operation)
sum(rate(bd_db_retry_count_total[5m]))

Latency distributions:

promql
histogram_quantile(0.50, bd_storage_operation_duration_ms) by (db.operation)
histogram_quantile(0.95, bd_storage_operation_duration_ms) by (db.operation)
histogram_quantile(0.99, bd_storage_operation_duration_ms) by (db.operation)

Issue counts by status:

promql
bd_issue_count{status="open"}
bd_issue_count{status="in_progress"}
bd_issue_count{status="closed"}
bd_issue_count{status="deferred"}

Dolt Telemetry Capabilities

Dolt Internal Metrics

Important: Dolt does not provide native OpenTelemetry export. The documentation search confirms there is no Dolt configuration variable or feature to enable OTLP export.

Dolt exposes internal metrics only via:

  • performance_schema tables (MySQL standard, accessible via SQL queries)
  • System tables (dolt_log, dolt_status, dolt_diff, dolt_branches, dolt_conflicts)

Beads implementation: Beads currently queries Dolt metrics via direct SQL (see cmd/bd/doctor/perf_dolt.go) rather than via OTLP. This is intentional — Dolt lacks native OTel support.

Dolt System Tables for Telemetry

TablePurpose
dolt_logCommit history (queryable for audit)
dolt_statusWorking set state (uncommitted changes)
dolt_diffCell-level diff between commits
dolt_branchesBranch metadata
dolt_conflictsMerge conflicts (when present)

Sample Queries for Dolt Telemetry

Commit frequency analysis:

sql
SELECT
    DATE_FORMAT(commit_date, '%Y-%m') as month,
    COUNT(*) as commits
FROM dolt_log
GROUP BY month
ORDER BY month DESC;

Working set size tracking:

sql
SELECT
    COUNT(*) as staged_changes,
    SUM(CASE WHEN staged = 1 THEN 1 ELSE 0 END) as added,
    SUM(CASE WHEN staged = 0 THEN 1 ELSE 0 END) as removed
FROM dolt_status;

Branch proliferation detection:

sql
SELECT
    COUNT(*) as branch_count,
    MIN(commit_date) as oldest,
    MAX(commit_date) as newest
FROM dolt_branches;

Conflict analysis:

sql
SELECT
    COUNT(*) as conflict_count,
    COUNT(DISTINCT table_name) as tables_affected
FROM dolt_conflicts;

Backends Compatible with OTLP

BackendNotes
VictoriaMetricsDefault for metrics (localhost:8428) — open source. Override with BD_OTEL_METRICS_URL
VictoriaLogsReserved for future log export. Override with BD_OTEL_LOGS_URL
PrometheusSupports OTLP via remote_write receiver — open source
Grafana MimirSupports OTLP via write endpoint — open source
LokiRequires OTLP bridge (Loki uses different format) — open source
OpenTelemetry CollectorUniversal forwarder to any backend (recommended for production) — open source

Production Recommendation: For production deployments, consider using OpenTelemetry Collector as a sidecar. The Collector provides:

  • Single agent for all telemetry
  • Advanced processing and batching
  • Support for multiple backends simultaneously
  • Better resource efficiency than per-process exporters

Appendix: Source Reference Audit

Audited against main @ 371df32b. All line numbers below refer to that commit.

Every factual claim in this document is backed by a specific source location. This table exists to prevent documentation drift and to make it easy to re-verify after code changes.

Initialization (internal/telemetry/telemetry.go, cmd/bd/main.go)

ClaimSource
Init signature — returns only errortelemetry.go:64
Enabled() — true when BD_OTEL_METRICS_URL set or BD_OTEL_STDOUT=truetelemetry.go:53-55
Traces: stdout only when BD_OTEL_STDOUT=truetelemetry.go:84-93
Metrics: HTTP OTLP when BD_OTEL_METRICS_URL settelemetry.go:131-139
Resource: service.name, service.versiontelemetry.go:73-75
Resource: WithHost(), WithProcess()telemetry.go:76-77
Shutdown(ctx) signaturetelemetry.go:162
Init called in PersistentPreRunmain.go:256
Command span started with bd.command, bd.version, bd.argsmain.go:262-266
bd.actor set on span after actor resolutionmain.go:474
Shutdown called in PersistentPostRunmain.go:681

Storage Instrumentation (internal/telemetry/storage.go)

ClaimSource
WrapStorage returns original store when telemetry disabledstorage.go:33-36
Metric bd.storage.operations (Counter)storage.go:38-40
Metric bd.storage.operation.duration (Histogram, ms)storage.go:41-44
Metric bd.storage.errors (Counter)storage.go:45-47
Gauge bd.issue.countstorage.go:48-50
CreateIssue — attrs: bd.actor, bd.issue.typestorage.go:86
UpdateIssue — attrs: bd.issue.id, bd.update.count, bd.actorstorage.go:131
GetIssue — attr: bd.issue.idstorage.go:108
SearchIssues — attrs: bd.query, bd.result.countstorage.go:162
GetReadyWork — attr: bd.result.countstorage.go:283
GetBlockedIssues — attr: bd.result.countstorage.go:293
RunInTransaction — attr: db.commit_msgstorage.go:393
GetStatistics emits gauge broken down by statusstorage.go:349
AddDependency, RemoveDependency, GetDependencies instrumentedstorage.go:175, 187, 198
AddLabel, RemoveLabel, GetLabels instrumentedstorage.go:243, 254, 265

Dolt Backend (internal/storage/dolt/store.go)

ClaimSource
doltTracer package-level varstore.go:288
Metric bd.db.retry_count (Counter) registeredstore.go:302
retryCount.Add() called when attempts > 1store.go:281
Metric bd.db.lock_wait_ms (Histogram) registeredstore.go:306
lockWaitMs.Record() never called anywheregrep store.go for lockWaitMs\.Record → zero matches
Metric bd.db.circuit_trips (Counter) registeredstore.go:310
circuitTrips.Add() called on circuit openstore.go:265
Metric bd.db.circuit_rejected (Counter) registeredstore.go:314
circuitRejected.Add() called on fail-faststore.go:250, 554
withRetry() functionstore.go:247
execContext() uses doltTracer.Start() + withRetry()store.go:359
queryContext() uses doltTracer.Start() + withRetry()store.go:396
queryRowContext() uses doltTracer.Start() + withRetry()store.go:425
Span dolt.commitstore.go:1086
Span dolt.pushstore.go:1231, 1266
Span dolt.pullstore.go:1295
Span dolt.mergestore.go:1389
Span dolt.branchstore.go:1357
Span dolt.checkoutstore.go:1372

versioned.go — no OTel

ClaimSource
versioned.go has no OTel importsversioned.go:1-9 — imports: context, fmt, storage, types only

doltserver — no OTel

ClaimSource
internal/doltserver/ has no OTel importsgrep internal/doltserver/*.go for otel|telemetry|otlp → zero matches

Hooks (internal/hooks/)

ClaimSource
Span hook.exec created in runHookhooks_unix.go:31
Span attrs: hook.event, hook.path, bd.issue_idhooks_unix.go:33-35
Stdout/stderr added as span events via addHookOutputEventshooks_otel.go:14, 20
hook.stdout / hook.stderr events carry output, bytes attrshooks_otel.go:15-16, 21-22
No metric counters or histograms for hooksgrep internal/hooks/ for Counter|Histogram → zero matches

AI (internal/compact/haiku.go, cmd/bd/find_duplicates.go)

ClaimSource
Metric bd.ai.input_tokens (Counter)haiku.go:110
Metric bd.ai.output_tokens (Counter)haiku.go:114
Metric bd.ai.request.duration (Histogram, ms)haiku.go:118
Metrics initialized lazily via aiMetricsOncehaiku.go:62, 106
Span anthropic.messages.newhaiku.go:126
Span attrs: bd.ai.model, bd.ai.operationhaiku.go:129-130
Span attrs: bd.ai.input_tokens, bd.ai.output_tokens, bd.ai.attemptshaiku.go:165-167
Retry on HTTP 429 and 5xxhaiku.go:217
find_duplicates.go — span attrs only, no aiMetrics.* callsfind_duplicates.go:429-454
find_duplicates.gobd.ai.batch_size attrfind_duplicates.go:433