Back to Claude Mem

Plan 14 — Telemetry Reliability Signals

plans/14-telemetry-reliability-signals.md

13.5.620.4 KB
Original Source

Plan 14 — Telemetry Reliability Signals

Adds the five highest-value missing telemetry signals identified by the 2026-06-10 capture-surface audit. Theme: we instrument success well; failure is invisible. Every signal below feeds the Reliability sentence of plans/2026-06-09-telemetry-metrics-spec.md ("Core pipeline succeeds X% of the time at scale") — plus retrieval quality, which today has no KPI at all.

Phases are self-contained: each can be executed in a fresh chat context. Execute in order; Phase 1–4 are independent of each other but all depend on Phase 0's facts and share the Phase-ritual below.


Phase 0 — Verified facts, allowed APIs, and the every-property ritual

Consolidated from 5 documentation-discovery agents (all high confidence, all findings cite read code). Do not invent APIs beyond this list.

The pipeline ritual — EVERY new property or event must touch all five surfaces

#SurfaceLocationWhat to do
1Scrub whitelistsrc/services/telemetry/scrub.ts:8-82 (ALLOWED_PROPERTY_KEYS: Set<string>)Add the key, grouped with a category comment like the existing ones
2Scrub teststests/telemetry/scrub.test.tsCopy the pattern at :5-31 (single-group) or :81-106 (multi-key group); also confirm :139-169 drop-tests still pass
3Public docsdocs/public/telemetry.mdx fields table :26-75, events table :78-89Add a row per field; new events get an events-table row
4CLI disclosuresrc/npx-cli/commands/telemetry.ts COLLECTED_FIELDS:23-66, EVENT_NAMES:68-77Add a line per field; new event names go in EVENT_NAMES
5Capture siteper phase belowEmit via captureEvent / captureCliEvent only

Allowed APIs (verified signatures)

  • captureEvent(event: string, props?: Record<string, unknown>, opts?: { person?: boolean }): voidsrc/services/telemetry/telemetry.ts:72 (worker transport; consent-gated, scrubbed, fire-and-forget)
  • captureCliEvent(event, props?, opts?): Promise<void>src/services/telemetry/cli-telemetry.ts:22 (short-lived-process transport; direct POST, hard 2s timeout CAPTURE_TIMEOUT_MS at :15, never throws)
  • scrubProperties(props): Record<string, string | number | boolean>src/services/telemetry/scrub.ts:91-114 (drops non-whitelisted keys and non-primitives silently; strings clamped to 200 chars; numbers must be finite)
  • collectInstallStats(db): Record<string, number>src/services/telemetry/install-stats.ts:29
  • getUptimeSeconds(startedAtMs: number, now?): numbersrc/shared/uptime.ts:5-7
  • writePidFile(info: PidInfo) / readPidFile(): PidInfo | null / removePidFile()src/services/infrastructure/ProcessManager.ts:134/141/156; PidInfo = { pid, port, startedAt: string /* ISO8601 */, startToken? } (src/supervisor/process-registry.ts:49-54)
  • recordWorkerUnreachable(): numbersrc/shared/worker-utils.ts:451-470 (returns the consecutive-failure count; persists atomically in ~/.claude-mem/state/hook-failures.json; threshold default 3, env CLAUDE_MEM_HOOK_FAIL_LOUD_THRESHOLD)
  • classifyObserverOutput(raw): 'xml'|'idle'|'prose'|'poisoned'src/sdk/output-classifier.ts:60-80
  • verifyCommitHashesInText(...): CommitVerificationResult with fabricated: string[]src/sdk/commit-verification.ts:69-108
  • DATA_DIR / paths.workerPid() etc. — src/shared/paths.ts:40,129-151

Global anti-patterns (from discovery; apply to every phase)

  • Properties not added to ALLOWED_PROPERTY_KEYS are silently dropped — no error. Always whitelist first, then emit.
  • Only number | boolean | closed-enum string. Never free text, paths, queries, error messages, IDs derived from the user. (An earlier audit draft proposed error_summary: string — explicitly rejected.)
  • person: true only on lifecycle events (spec constraint, plans/2026-06-09-telemetry-metrics-spec.md:65-71). Nothing in this plan adds person properties; do not touch PERSON_PROPERTY_KEYS.
  • Never bypass captureEvent/captureCliEvent with direct PostHog calls.
  • Debug-mode verification harness: CLAUDE_MEM_TELEMETRY_DEBUG=1 prints would-be payloads to stderr and sends nothing (telemetry.ts:97-103).

Discovery discrepancy to resolve during Phase 2

One agent reported INVALID_OUTPUT_RESPAWN_THRESHOLD = 25, another = 3. Read src/services/worker/agents/ResponseProcessor.ts:25 before relying on the value.


Phase 1 — Retrieval quality: result_count + strategy/fallback on search_performed

Narrative served: Reliability + retrieval quality. Zero-result rate becomes computable; Chroma's silent degradation to FTS becomes visible (the recurring SQLiteSearchStrategy Database error incident class).

Verified obstacles (do not skip)

  • The existing capture is a middleware: SearchRoutes.ts:117-123 inside res.once('finish') — it fires after the response, outside handler scope. It can see only endpoint, res.statusCode, and elapsed time. Result arrays, totalResults (computed at SearchManager.ts:307), chromaFailed (SearchManager.ts:158, 206, 274) and chromaFailureReason (SearchManager.ts:267-275) are method-local and unreachable from there.
  • SearchManager.search() has three paths: filter-only SQLite (:165-176), Chroma (:179-286, sets chromaFailed on error), Chroma-not-initialized FTS (:288-305). Text-format responses (:420-425) do not carry counts; only format='json' (:309-316) includes totalResults.
  • search_strategy is already whitelisted (scrub.ts:55); only the new keys need whitelist entries.

What to implement

  1. In SearchManager.search(), build a small telemetry envelope alongside the existing return value — do not change response shapes. Collect: result_count (the totalResults already computed at :307), search_strategy: 'chroma' | 'fts' | 'filter_only' (one per path above), chroma_available: boolean (false when chromaFailed or not initialized), fallback_reason: 'none' | 'chroma_connection' | 'chroma_error' | 'chroma_not_initialized' (map from chromaFailureReason.isConnectionError at :271; never the message). Expose it to callers — recommended: return { ...existing, telemetry } for an internal caller, or set it on a mutable param. Simplest verified-safe plumbing: handlers stash it on res.locals.searchTelemetry, and the middleware at SearchRoutes.ts:117-123 spreads res.locals.searchTelemetry ?? {} into the existing captureEvent('search_performed', …) props.
  2. Whitelist result_count, chroma_available, fallback_reason (ritual #1–4).
  3. Note: src/services/worker/search/types.ts:53-64 has a StrategySearchResult with a strategy field but SearchManager.search() does not use it — derive strategy from the three paths; do not refactor onto SearchOrchestrator here.

Verification

  • bun test tests/telemetry/ green (new scrub cases included)
  • npm run typecheck:root clean
  • CLAUDE_MEM_TELEMETRY_DEBUG=1 + a worker search request prints search_performed with result_count, search_strategy, chroma_available, fallback_reason
  • Grep guard: grep -n "fallback_reason" src/services/telemetry/scrub.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.ts hits all three
  • Zero-result search shows result_count: 0 (not missing)

Anti-pattern guards

  • Do NOT try to introspect the response body from the middleware (no res._getBuffer()-style Express internals — unverified, fragile).
  • Do NOT put chromaFailureReason.message in any property — enum only.
  • Do NOT change the text-format response shape consumed by clients.

Phase 2 — Compression quality: fabrication, invalid-output, and abort reasons on session_compressed

Narrative served: Reliability + model quality (extends yesterday's tokens/cost/ratio work with per-model trust signals).

Verified mechanics (this is the key to doing it right)

  • compressionProps is built at ResponseProcessor.ts:194-214. Non-SDK providers emit immediately (:228); the SDK/Claude path stashes the object into session.pendingCompressionEvent (worker-types.ts:60) at :216-226, and ClaudeProvider.ts:416-435 later merges real token fields and emits; :442-445 is the no-result fallback emit. Therefore: any property added to compressionProps automatically flows through all three emit paths.
  • Fabrication scope: ResponseProcessor.ts:115-135 already computes fabricated: string[] via verifyCommitHashesInText.
  • Invalid output: ResponseProcessor.ts:48-88 returns early — no event fires at all on that path today. session.consecutiveInvalidOutputs (worker-types.ts:34) increments at :54, resets at :92; respawn decision at :67-79 (outputClass === 'poisoned' OR threshold reached — read the threshold at :25, see Phase 0 discrepancy).
  • abortReason enum: worker-types.ts:42'idle'|'shutdown'|'overflow'|'restart-guard'|'quota'|string|null; set at ClaudeProvider.ts:270 (note: 'quota:…' prefix format), :315, SessionManager.ts:272,294,407; consumed at SessionRoutes.ts:166-167. The error-path emit is SessionRoutes.ts:154-163.

What to implement

  1. Fabrication: in ResponseProcessor.ts where fabricated.length is known (:128-135), add to compressionProps: fabrication_detected: boolean, fabricated_count: number. (Flows through deferred path for free.)
  2. Invalid output: at the respawn decision (:67-79) — and ONLY when a respawn triggers, to bound volume — emit one captureEvent('session_compressed', { outcome: 'invalid_output', invalid_output_class, consecutive_invalid_outputs, respawn_triggered: true, provider, model, ide, hook }) where invalid_output_class is the classifier value ('idle'|'prose'|'poisoned').
  3. Abort reason: in the error-path emit (SessionRoutes.ts:154-163), add abort_reason normalized to a closed enum: 'idle'|'shutdown'|'overflow'|'restart_guard'|'quota'|'none' — split the 'quota:…' format on ':' and map 'restart-guard''restart_guard'.
  4. Whitelist fabrication_detected, fabricated_count, invalid_output_class, consecutive_invalid_outputs, respawn_triggered, abort_reason (ritual #1–4).

Verification

  • bun test tests/telemetry/ green; npm run typecheck:root clean
  • Debug-mode session_compressed payload shows fabrication_detected: false, fabricated_count: 0 on a normal compression
  • Grep guard: grep -rn "abort_reason" src/services/telemetry/scrub.ts src/services/worker/http/routes/SessionRoutes.ts both hit
  • Confirm the deferred path carries new props: grep the built plugin/scripts/worker-service.cjs for fabrication_detected after npm run build

Anti-pattern guards

  • Do NOT emit an event per invalid output (volume) — respawn-gated only.
  • Do NOT send raw abortReason strings ('quota:daily', 'restart-guard') — normalize to the closed enum first; the scrubber will happily pass any ≤200-char string, so enum discipline is on the emitter.
  • Do NOT add the new props anywhere except compressionProps for the fabrication fields — adding them only at the ClaudeProvider merge would miss non-SDK providers.

Phase 3 — Worker lifecycle: crash detection, worker_stopped, heartbeat health

Narrative served: Reliability ("crash-free installs") + makes the DAU/uptime data trustworthy.

Verified mechanics

  • PID file already stores startedAt ISO8601 (worker-service.ts:289, PidInfo at process-registry.ts:49-54) → previous uptime is computable on next start via Date.parse(startedAt).
  • There is NO shutdown sentinel today; marker-file pattern to copy: ProcessManager.ts:232-254 (.chroma-cleaned-v10.3) — write to DATA_DIR.
  • Graceful shutdown: worker-service.ts:565-585; shutdownTelemetry() is called at :576 and races a 3s flush (telemetry.ts:124-144) — an event captured before :576 will flush. Stop-case removePidFile() is at :836.
  • worker_started captures: :427 (trigger start, person: true), :436 (heartbeat, 24h setInterval with .unref() at :435-438); props builder buildLifecycleProps() at :401-426.
  • uncaughtException handler at :1075-1078 logs and does NOT exit (known smell — out of scope here, do not change process semantics in this plan).

What to implement

  1. Clean-shutdown sentinel: in the shutdown path (before :576), write DATA_DIR/.worker-clean-shutdown containing the ISO timestamp (copy the marker pattern from ProcessManager.ts:232-254). Delete the sentinel at startup after reading it.
  2. Crash detection on start: in the startup daemon path, before writePidFile, derive:
    • stale PID file present + no sentinel → previous_shutdown: 'crash'
    • sentinel present → 'clean'
    • neither (first run) → 'unknown'
    • previous_uptime_seconds from the stale PID file's startedAt to sentinel time (clean) or to now minus unknown gap (crash → omit rather than guess; omitted properties are fine). Add both to the existing captureEvent('worker_started', …) at :427.
  3. worker_stopped event: immediately before shutdownTelemetry() at :576, captureEvent('worker_stopped', { uptime_seconds, shutdown_reason }) with uptime_seconds from getUptimeSeconds(this.startTime) (worker-service.ts:122, uptime.ts:5-7) and shutdown_reason: 'stop' | 'restart' | 'signal' from the caller. No person: true.
  4. Heartbeat health: in the heartbeat payload (:436 / buildLifecycleProps), add process_rss_mb and heap_used_mb as integers from process.memoryUsage() (Math.round(rss / 1024 / 1024)).
  5. Whitelist previous_shutdown, previous_uptime_seconds, uptime_seconds, shutdown_reason, process_rss_mb, heap_used_mb; add worker_stopped to EVENT_NAMES and the docs events table (ritual #1–4).

Verification

  • bun test tests/telemetry/ green; npm run typecheck:root clean
  • Debug mode: worker-service restart prints worker_stopped (reason restart) then worker_started with previous_shutdown: 'clean'
  • Kill -9 the worker, start it: worker_started shows previous_shutdown: 'crash'
  • Heartbeat payload contains integer process_rss_mb
  • Sentinel file is removed after startup reads it (no stale 'clean' after a later crash)

Anti-pattern guards

  • Do NOT compute uptime from in-memory startTime for the previous run — it's never persisted; use the PID file's startedAt.
  • Do NOT emit worker_stopped after shutdownTelemetry()isShutdown (telemetry.ts:81) drops late events by design.
  • Do NOT add the new keys to PERSON_PROPERTY_KEYS (spec ingestion-cost constraint).
  • process.memoryUsage().rss is bytes — convert; the scrubber drops non-finite numbers silently.

Phase 4 — hook_failed event (threshold-gated, CLI transport)

Narrative served: Reliability — a failing hook is silent memory loss; today the fail-loud counter only writes to the user's stderr.

Verified constraints (these dictate the design — read before coding)

  • Hooks are short-lived processes (<1s typical). The worker transport (posthog-node batching) can never flush there; and emitting via the worker API is self-defeating (the defining failure IS "worker unreachable"). Transport must be captureCliEvent (cli-telemetry.ts:22, direct POST, 2s cap, never throws).
  • The trap: exitGraceful (hook-io.ts:166-173) and emitBlockingError (hook-io.ts:150-159) call process.exit() immediately and do not await pending promises — a fire-and-forget POST is killed mid-flight. The emit must be awaited before the exit call, inside the failure branch.
  • Catch taxonomy lives at hook-command.ts:99-128: AdapterRejectedInput (:100-105), non-blocking input error (:106-111), worker-unavailable (:112-119, the only branch calling recordWorkerUnreachable()), generic blocking error (:121-128, exit 2).
  • recordWorkerUnreachable(): number returns the consecutive count and knows the threshold — gate on it.
  • Hooks currently import zero telemetry code; captureCliEvent has only fs/fetch deps and bundles fine via scripts/build-hooks.js esbuild (telemetry modules are not externalized — verified at build-hooks.js:284-330).

What to implement

  1. In hook-command.ts, in exactly two branches:
    • worker-unavailable branch (:112-119): after recordWorkerUnreachable() returns count, if count has just reached the fail-loud threshold (the same condition that triggers the blocking stderr message), await captureCliEvent('hook_failed', { hook_type, error_mode: 'worker_unavailable', consecutive_failures: count, threshold_tripped: true }).
    • generic blocking-error branch (:121-128): await captureCliEvent('hook_failed', { hook_type, error_mode: 'blocking_error', threshold_tripped: false }) before emitBlockingError. Both branches are rare and already failed — the ≤2s bounded wait is acceptable there. Never emit on the success path or the two skip branches.
  2. hook_type: closed enum from the hook event already passed to hookCommand(platform, event, …) (:79) — use the event/handler name set (context | session-init | observation | summarize | file-context), not free text.
  3. Whitelist hook_type, error_mode, consecutive_failures, threshold_tripped; add hook_failed to EVENT_NAMES + docs events table (ritual #1–4).

Verification

  • bun test tests/telemetry/ green; npm run typecheck:root clean
  • npm run build then grep the built hook artifact for hook_failed (confirms bundling)
  • With the worker stopped and CLAUDE_MEM_TELEMETRY_DEBUG=1, run a hook 3× (threshold): third run prints hook_failed with consecutive_failures: 3
  • Success-path hook run emits nothing and latency is unchanged
  • Confirm exit codes unchanged (HOOK_EXIT_CODES, hook-constants.ts:15-20)

Anti-pattern guards

  • Do NOT fire-and-forget then process.exit() — the event dies with the process.
  • Do NOT emit per-invocation hook latency events (volume + inline-latency cost). Worker-side duration_ms on context_injected/search_performed already covers worker latency; defer hook-side latency to a future aggregate.
  • Do NOT route the emit through executeWithWorkerFallback or any worker API.
  • Do NOT emit in the AdapterRejectedInput / non-blocking-input branches (expected, noisy, not failures of ours).

Phase 5 — Final verification

  1. Full ritual audit — for each new key (result_count, chroma_available, fallback_reason, fabrication_detected, fabricated_count, invalid_output_class, consecutive_invalid_outputs, respawn_triggered, abort_reason, previous_shutdown, previous_uptime_seconds, uptime_seconds, shutdown_reason, process_rss_mb, heap_used_mb, hook_type, error_mode, consecutive_failures, threshold_tripped): grep -n "<key>" src/services/telemetry/scrub.ts tests/telemetry/scrub.test.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.ts — all four must hit.
  2. New events disclosed: worker_stopped, hook_failed present in EVENT_NAMES (src/npx-cli/commands/telemetry.ts:68-77) and the telemetry.mdx events table.
  3. Anti-pattern greps:
    • grep -rn "captureEvent\|captureCliEvent" src/ | grep -v services/telemetry — every site passes enums/counts only (manual scan of new sites)
    • grep -rn "posthog" src/ --include="*.ts" | grep -v services/telemetry — no direct SDK use outside the pipeline
    • no PERSON_PROPERTY_KEYS additions in the diff
  4. Tests & build: bun test tests/telemetry/ (note: bun only — the suite fails under vitest), npm run typecheck:root, npm run build-and-sync, worker /health returns ok.
  5. Live smoke: CLAUDE_MEM_TELEMETRY_DEBUG=1 walk: search (Phase 1 fields), compression (Phase 2 fields), restart (Phase 3 events), worker-down hook ×3 (Phase 4 event).
  6. Docs deploy: telemetry.mdx changes auto-deploy on push to main — confirm the public page renders the new rows after release.

Out of scope (deliberately)

  • The uncaughtException no-exit smell (worker-service.ts:1075-1078) — process-semantics change, separate plan.
  • Per-hook latency events, event-loop-lag sampling, telemetry_disabled final ping (product/privacy decision pending), installer funnel (install_started), doctor/repair distress signals — candidates for Plan 15 after this data lands.