plans/14-telemetry-reliability-signals.md
Adds the five highest-value missing telemetry signals identified by the 2026-06-10
capture-surface audit. Theme: we instrument success well; failure is invisible.
Every signal below feeds the Reliability sentence of
plans/2026-06-09-telemetry-metrics-spec.md ("Core pipeline succeeds X% of the
time at scale") — plus retrieval quality, which today has no KPI at all.
Phases are self-contained: each can be executed in a fresh chat context. Execute in order; Phase 1–4 are independent of each other but all depend on Phase 0's facts and share the Phase-ritual below.
Consolidated from 5 documentation-discovery agents (all high confidence, all findings cite read code). Do not invent APIs beyond this list.
| # | Surface | Location | What to do |
|---|---|---|---|
| 1 | Scrub whitelist | src/services/telemetry/scrub.ts:8-82 (ALLOWED_PROPERTY_KEYS: Set<string>) | Add the key, grouped with a category comment like the existing ones |
| 2 | Scrub tests | tests/telemetry/scrub.test.ts | Copy the pattern at :5-31 (single-group) or :81-106 (multi-key group); also confirm :139-169 drop-tests still pass |
| 3 | Public docs | docs/public/telemetry.mdx fields table :26-75, events table :78-89 | Add a row per field; new events get an events-table row |
| 4 | CLI disclosure | src/npx-cli/commands/telemetry.ts COLLECTED_FIELDS:23-66, EVENT_NAMES:68-77 | Add a line per field; new event names go in EVENT_NAMES |
| 5 | Capture site | per phase below | Emit via captureEvent / captureCliEvent only |
captureEvent(event: string, props?: Record<string, unknown>, opts?: { person?: boolean }): void — src/services/telemetry/telemetry.ts:72 (worker transport; consent-gated, scrubbed, fire-and-forget)captureCliEvent(event, props?, opts?): Promise<void> — src/services/telemetry/cli-telemetry.ts:22 (short-lived-process transport; direct POST, hard 2s timeout CAPTURE_TIMEOUT_MS at :15, never throws)scrubProperties(props): Record<string, string | number | boolean> — src/services/telemetry/scrub.ts:91-114 (drops non-whitelisted keys and non-primitives silently; strings clamped to 200 chars; numbers must be finite)collectInstallStats(db): Record<string, number> — src/services/telemetry/install-stats.ts:29getUptimeSeconds(startedAtMs: number, now?): number — src/shared/uptime.ts:5-7writePidFile(info: PidInfo) / readPidFile(): PidInfo | null / removePidFile() — src/services/infrastructure/ProcessManager.ts:134/141/156; PidInfo = { pid, port, startedAt: string /* ISO8601 */, startToken? } (src/supervisor/process-registry.ts:49-54)recordWorkerUnreachable(): number — src/shared/worker-utils.ts:451-470 (returns the consecutive-failure count; persists atomically in ~/.claude-mem/state/hook-failures.json; threshold default 3, env CLAUDE_MEM_HOOK_FAIL_LOUD_THRESHOLD)classifyObserverOutput(raw): 'xml'|'idle'|'prose'|'poisoned' — src/sdk/output-classifier.ts:60-80verifyCommitHashesInText(...): CommitVerificationResult with fabricated: string[] — src/sdk/commit-verification.ts:69-108DATA_DIR / paths.workerPid() etc. — src/shared/paths.ts:40,129-151ALLOWED_PROPERTY_KEYS are silently dropped — no error. Always whitelist first, then emit.number | boolean | closed-enum string. Never free text, paths, queries, error messages, IDs derived from the user. (An earlier audit draft proposed error_summary: string — explicitly rejected.)person: true only on lifecycle events (spec constraint, plans/2026-06-09-telemetry-metrics-spec.md:65-71). Nothing in this plan adds person properties; do not touch PERSON_PROPERTY_KEYS.captureEvent/captureCliEvent with direct PostHog calls.CLAUDE_MEM_TELEMETRY_DEBUG=1 prints would-be payloads to stderr and sends nothing (telemetry.ts:97-103).One agent reported INVALID_OUTPUT_RESPAWN_THRESHOLD = 25, another = 3. Read
src/services/worker/agents/ResponseProcessor.ts:25 before relying on the value.
result_count + strategy/fallback on search_performedNarrative served: Reliability + retrieval quality. Zero-result rate becomes
computable; Chroma's silent degradation to FTS becomes visible (the recurring
SQLiteSearchStrategy Database error incident class).
SearchRoutes.ts:117-123 inside
res.once('finish') — it fires after the response, outside handler scope.
It can see only endpoint, res.statusCode, and elapsed time. Result arrays,
totalResults (computed at SearchManager.ts:307), chromaFailed
(SearchManager.ts:158, 206, 274) and chromaFailureReason
(SearchManager.ts:267-275) are method-local and unreachable from there.SearchManager.search() has three paths: filter-only SQLite (:165-176),
Chroma (:179-286, sets chromaFailed on error), Chroma-not-initialized FTS
(:288-305). Text-format responses (:420-425) do not carry counts; only
format='json' (:309-316) includes totalResults.search_strategy is already whitelisted (scrub.ts:55); only the new keys
need whitelist entries.SearchManager.search(), build a small telemetry envelope alongside the
existing return value — do not change response shapes. Collect:
result_count (the totalResults already computed at :307),
search_strategy: 'chroma' | 'fts' | 'filter_only' (one per path above),
chroma_available: boolean (false when chromaFailed or not initialized),
fallback_reason: 'none' | 'chroma_connection' | 'chroma_error' | 'chroma_not_initialized'
(map from chromaFailureReason.isConnectionError at :271; never the message).
Expose it to callers — recommended: return { ...existing, telemetry } for an
internal caller, or set it on a mutable param. Simplest verified-safe plumbing:
handlers stash it on res.locals.searchTelemetry, and the middleware at
SearchRoutes.ts:117-123 spreads res.locals.searchTelemetry ?? {} into the
existing captureEvent('search_performed', …) props.result_count, chroma_available, fallback_reason (ritual #1–4).src/services/worker/search/types.ts:53-64 has a StrategySearchResult
with a strategy field but SearchManager.search() does not use it — derive
strategy from the three paths; do not refactor onto SearchOrchestrator here.bun test tests/telemetry/ green (new scrub cases included)npm run typecheck:root cleanCLAUDE_MEM_TELEMETRY_DEBUG=1 + a worker search request prints search_performed with result_count, search_strategy, chroma_available, fallback_reasongrep -n "fallback_reason" src/services/telemetry/scrub.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.ts hits all threeresult_count: 0 (not missing)res._getBuffer()-style Express internals — unverified, fragile).chromaFailureReason.message in any property — enum only.session_compressedNarrative served: Reliability + model quality (extends yesterday's tokens/cost/ratio work with per-model trust signals).
compressionProps is built at ResponseProcessor.ts:194-214. Non-SDK
providers emit immediately (:228); the SDK/Claude path stashes the object
into session.pendingCompressionEvent (worker-types.ts:60) at :216-226,
and ClaudeProvider.ts:416-435 later merges real token fields and emits;
:442-445 is the no-result fallback emit. Therefore: any property added to
compressionProps automatically flows through all three emit paths.ResponseProcessor.ts:115-135 already computes
fabricated: string[] via verifyCommitHashesInText.ResponseProcessor.ts:48-88 returns early — no event fires
at all on that path today. session.consecutiveInvalidOutputs
(worker-types.ts:34) increments at :54, resets at :92; respawn decision
at :67-79 (outputClass === 'poisoned' OR threshold reached — read the
threshold at :25, see Phase 0 discrepancy).abortReason enum: worker-types.ts:42 — 'idle'|'shutdown'|'overflow'|'restart-guard'|'quota'|string|null;
set at ClaudeProvider.ts:270 (note: 'quota:…' prefix format), :315,
SessionManager.ts:272,294,407; consumed at SessionRoutes.ts:166-167. The
error-path emit is SessionRoutes.ts:154-163.ResponseProcessor.ts where fabricated.length is known
(:128-135), add to compressionProps: fabrication_detected: boolean,
fabricated_count: number. (Flows through deferred path for free.):67-79) — and ONLY when a
respawn triggers, to bound volume — emit one
captureEvent('session_compressed', { outcome: 'invalid_output', invalid_output_class, consecutive_invalid_outputs, respawn_triggered: true, provider, model, ide, hook })
where invalid_output_class is the classifier value ('idle'|'prose'|'poisoned').SessionRoutes.ts:154-163), add
abort_reason normalized to a closed enum:
'idle'|'shutdown'|'overflow'|'restart_guard'|'quota'|'none' — split the
'quota:…' format on ':' and map 'restart-guard' → 'restart_guard'.fabrication_detected, fabricated_count, invalid_output_class,
consecutive_invalid_outputs, respawn_triggered, abort_reason (ritual #1–4).bun test tests/telemetry/ green; npm run typecheck:root cleansession_compressed payload shows fabrication_detected: false, fabricated_count: 0 on a normal compressiongrep -rn "abort_reason" src/services/telemetry/scrub.ts src/services/worker/http/routes/SessionRoutes.ts both hitplugin/scripts/worker-service.cjs for fabrication_detected after npm run buildabortReason strings ('quota:daily', 'restart-guard') — normalize to the closed enum first; the scrubber will happily pass any ≤200-char string, so enum discipline is on the emitter.compressionProps for the fabrication fields — adding them only at the ClaudeProvider merge would miss non-SDK providers.worker_stopped, heartbeat healthNarrative served: Reliability ("crash-free installs") + makes the DAU/uptime data trustworthy.
startedAt ISO8601 (worker-service.ts:289,
PidInfo at process-registry.ts:49-54) → previous uptime is computable on
next start via Date.parse(startedAt).ProcessManager.ts:232-254 (.chroma-cleaned-v10.3) — write to DATA_DIR.worker-service.ts:565-585; shutdownTelemetry() is called
at :576 and races a 3s flush (telemetry.ts:124-144) — an event captured
before :576 will flush. Stop-case removePidFile() is at :836.worker_started captures: :427 (trigger start, person: true), :436
(heartbeat, 24h setInterval with .unref() at :435-438); props builder
buildLifecycleProps() at :401-426.uncaughtException handler at :1075-1078 logs and does NOT exit (known smell — out of scope here, do not change process semantics in this plan).:576), write
DATA_DIR/.worker-clean-shutdown containing the ISO timestamp (copy the
marker pattern from ProcessManager.ts:232-254). Delete the sentinel at
startup after reading it.writePidFile, derive:
previous_shutdown: 'crash''clean''unknown'previous_uptime_seconds from the stale PID file's startedAt to sentinel
time (clean) or to now minus unknown gap (crash → omit rather than guess;
omitted properties are fine).
Add both to the existing captureEvent('worker_started', …) at :427.worker_stopped event: immediately before shutdownTelemetry() at
:576, captureEvent('worker_stopped', { uptime_seconds, shutdown_reason })
with uptime_seconds from getUptimeSeconds(this.startTime)
(worker-service.ts:122, uptime.ts:5-7) and
shutdown_reason: 'stop' | 'restart' | 'signal' from the caller. No
person: true.:436 / buildLifecycleProps),
add process_rss_mb and heap_used_mb as integers from
process.memoryUsage() (Math.round(rss / 1024 / 1024)).previous_shutdown, previous_uptime_seconds, uptime_seconds,
shutdown_reason, process_rss_mb, heap_used_mb; add worker_stopped to
EVENT_NAMES and the docs events table (ritual #1–4).bun test tests/telemetry/ green; npm run typecheck:root cleanworker-service restart prints worker_stopped (reason restart) then worker_started with previous_shutdown: 'clean'worker_started shows previous_shutdown: 'crash'process_rss_mb'clean' after a later crash)startTime for the previous run — it's never persisted; use the PID file's startedAt.worker_stopped after shutdownTelemetry() — isShutdown (telemetry.ts:81) drops late events by design.PERSON_PROPERTY_KEYS (spec ingestion-cost constraint).process.memoryUsage().rss is bytes — convert; the scrubber drops non-finite numbers silently.hook_failed event (threshold-gated, CLI transport)Narrative served: Reliability — a failing hook is silent memory loss; today the fail-loud counter only writes to the user's stderr.
captureCliEvent (cli-telemetry.ts:22, direct POST, 2s cap, never throws).exitGraceful (hook-io.ts:166-173) and emitBlockingError
(hook-io.ts:150-159) call process.exit() immediately and do not await
pending promises — a fire-and-forget POST is killed mid-flight. The emit must
be awaited before the exit call, inside the failure branch.hook-command.ts:99-128: AdapterRejectedInput
(:100-105), non-blocking input error (:106-111), worker-unavailable
(:112-119, the only branch calling recordWorkerUnreachable()), generic
blocking error (:121-128, exit 2).recordWorkerUnreachable(): number returns the consecutive count and knows the
threshold — gate on it.captureCliEvent has only
fs/fetch deps and bundles fine via scripts/build-hooks.js esbuild (telemetry
modules are not externalized — verified at build-hooks.js:284-330).hook-command.ts, in exactly two branches:
:112-119): after
recordWorkerUnreachable() returns count, if count has just reached the
fail-loud threshold (the same condition that triggers the blocking stderr
message), await captureCliEvent('hook_failed', { hook_type, error_mode: 'worker_unavailable', consecutive_failures: count, threshold_tripped: true }).:121-128):
await captureCliEvent('hook_failed', { hook_type, error_mode: 'blocking_error', threshold_tripped: false }) before emitBlockingError.
Both branches are rare and already failed — the ≤2s bounded wait is
acceptable there. Never emit on the success path or the two skip branches.hook_type: closed enum from the hook event already passed to
hookCommand(platform, event, …) (:79) — use the event/handler name set
(context | session-init | observation | summarize | file-context), not free text.hook_type, error_mode, consecutive_failures,
threshold_tripped; add hook_failed to EVENT_NAMES + docs events table
(ritual #1–4).bun test tests/telemetry/ green; npm run typecheck:root cleannpm run build then grep the built hook artifact for hook_failed (confirms bundling)CLAUDE_MEM_TELEMETRY_DEBUG=1, run a hook 3× (threshold): third run prints hook_failed with consecutive_failures: 3HOOK_EXIT_CODES, hook-constants.ts:15-20)process.exit() — the event dies with the process.duration_ms on context_injected/search_performed already covers worker latency; defer hook-side latency to a future aggregate.executeWithWorkerFallback or any worker API.result_count, chroma_available, fallback_reason, fabrication_detected, fabricated_count, invalid_output_class, consecutive_invalid_outputs, respawn_triggered, abort_reason, previous_shutdown, previous_uptime_seconds, uptime_seconds, shutdown_reason, process_rss_mb, heap_used_mb, hook_type, error_mode, consecutive_failures, threshold_tripped):
grep -n "<key>" src/services/telemetry/scrub.ts tests/telemetry/scrub.test.ts docs/public/telemetry.mdx src/npx-cli/commands/telemetry.ts — all four must hit.worker_stopped, hook_failed present in
EVENT_NAMES (src/npx-cli/commands/telemetry.ts:68-77) and the
telemetry.mdx events table.grep -rn "captureEvent\|captureCliEvent" src/ | grep -v services/telemetry — every site passes enums/counts only (manual scan of new sites)grep -rn "posthog" src/ --include="*.ts" | grep -v services/telemetry — no direct SDK use outside the pipelinePERSON_PROPERTY_KEYS additions in the diffbun test tests/telemetry/ (note: bun only — the suite
fails under vitest), npm run typecheck:root, npm run build-and-sync,
worker /health returns ok.CLAUDE_MEM_TELEMETRY_DEBUG=1 walk: search (Phase 1 fields),
compression (Phase 2 fields), restart (Phase 3 events), worker-down hook ×3
(Phase 4 event).uncaughtException no-exit smell (worker-service.ts:1075-1078) — process-semantics change, separate plan.telemetry_disabled final ping (product/privacy decision pending), installer funnel (install_started), doctor/repair distress signals — candidates for Plan 15 after this data lands.