plans/2026-06-19-posthog-telemetry-overhaul.md
Date: 2026-06-19 Status: Ready to execute Author: orchestrated via /make-plan + sequential-thinking, grounded in live PostHog data (project CMEM, 463218)
The first PostHog bill forecast ~$7,660/mo. A PostHog rep diagnosed two causes: (1) session_compressed events created a person profile on nearly every event (identified-event double-billing, ~$3,440), and (2) raw event volume (~7.8M session_compressed/day, ~$4,020). The user wants the telemetry rebuilt properly: per-session rollups emitted at session end, a verified historical backfill, telemetry unified into the logging system, and real error-message data — "no shortcuts, no fallbacks, do the right thing."
telemetry.ts, cli-telemetry.ts, backfill.ts set $process_person_profile: false on every non-lifecycle event. Only low-volume lifecycle events (worker_started, install_*, uninstall_completed) build the anonymous install-UUID person profile via buildPersonSet().src/services/telemetry/buffer.ts aggregates session_compressed→observer_turn_rollup and context_injected→context_injected_rollup.session_compressed/context_injected come ONLY from versions ≤13.6.1; the rollups come ONLY from 13.6.2 / 13.7.0. Raw volume is legacy fleet decaying as installs update — this is not a fire. We have room to do it right.src/utils/logger.ts) are two separate subsystems with duplicated call sites. User wants them "all together."error_category/error_mode). No real error text reaches PostHog. User wants "actual error message data."backfill.ts, BACKFILL_VERSION=2) is well-built but needs verification + field alignment with the new per-session grain.observer_turn_rollup but scrub.ts comments/docs reference session_compressed_rollup.test_event / test_event_2 noise events exist in the project.DO_NOT_TRACK > CLAUDE_MEM_TELEMETRY env > telemetry.json > default ON. Consent off ⇒ nothing sent, no client constructed, no marker written.scrub.ts). The error scrubber (Phase 3) is a SEPARATE allow-then-redact path used ONLY for $exception.historical_activity.console.* in background services (enforced by tests/logger-usage-standards.test.ts). Use logger.*.These facts are verified with sources. Treat as the "Allowed APIs" list.
posthog-node SDK (pinned ^5.36.15; verified against 5.38.2 .d.ts — API stable across 5.x)| Need | Verified API | Notes |
|---|---|---|
| Capture event (already used) | capture(props: EventMessage): void | EventMessage = { distinctId?, event, properties?, timestamp?, uuid?, ... }. $set/$process_person_profile go inside properties. |
| Capture exception | captureException(error: unknown, distinctId?: string, additionalProperties?: Record<string|number, any>, uuid?, flags?): void | distinctId is the 2nd positional arg. Put $process_person_profile: false in additionalProperties (3rd arg) to keep exceptions profile-less. |
| Capture exception (await) | captureExceptionImmediate(error, distinctId?, additionalProperties?, flags?): Promise<void> | Use in short-lived/CLI or shutdown flush contexts. |
| Flush / shutdown | flush(): Promise<void>, shutdown(shutdownTimeoutMs?): Promise<void> | Current telemetry.ts:149 usage is correct. |
| Constructor opts | PostHogOptions (host, flushAt, flushInterval, maxBatchSize, maxQueueSize, disableGeoip, historicalMigration, before_send, enableExceptionAutocapture) | before_send?: BeforeSendFn | BeforeSendFn[] — returning null drops before ingest = not billed. enableExceptionAutocapture: true auto-captures uncaught exceptions/unhandled rejections (relevant: our worker is long-lived). |
$exception event: SDK builds event: '$exception' with properties.$exception_list ([{type, value, stacktrace, mechanism}]); PostHog derives $exception_fingerprint + $exception_level at ingest for issue grouping. Billing: $exception bills as a standard event (100k/mo free, then ~$0.00037/event). There is NO built-in per-event rate limit — we MUST rate-limit/dedupe client-side or drop via before_send. (Sources: posthog.com/docs/error-tracking/{installation/node,capture,pricing}.)
src/services/worker/)sessionDbId (number). Sessions tracked in SessionManager as private sessions: Map<number, ActiveSession> (SessionManager.ts:10).ActiveSession (src/services/worker-types.ts:9-63) has: sessionDbId, startTime, platformSource, pendingCompressionEvent?, cumulativeInputTokens/OutputTokens, etc.deleteSession(sessionDbId) — SessionManager.ts:281 (full cleanup; aborts generator, disposes buffer, sessions.delete).removeSessionImmediate(sessionDbId) — SessionManager.ts:346 (fast removal; called from GeneratorExitHandler after generator done — this is the normal session-end path).shutdownAll() — SessionManager.ts:367 (Promise.all over deleteSession for every active session — the worker-shutdown path).respawnPoisonedSession(sessionDbId) — SessionManager.ts:251 (does NOT remove from map; do not flush here — session continues).worker-service.ts:680 shutdown() → beforeGracefulShutdown (emits worker_stopped, calls shutdownTelemetry() at ~:705) → performGracefulShutdown (GracefulShutdown.ts:38 calls sessionManager.shutdownAll()). Note ordering risk: shutdownTelemetry() currently runs BEFORE shutdownAll(). Per-session flush on shutdown must emit before the PostHog client is shut down — see Phase 2 ordering task.telemetryBuffer.start() called at worker-service.ts:542.telemetryBuffer.record() call sites (fields to preserve):
| File:line | Event | Session-scoped? |
|---|---|---|
ClaudeProvider.ts:425 | session_compressed | yes (session.sessionDbId) |
ClaudeProvider.ts:443 | session_compressed (session.pendingCompressionEvent) | yes |
ResponseProcessor.ts:87 | session_compressed (outcome: invalid_output) | yes |
ResponseProcessor.ts:246 | session_compressed (deferred pendingCompressionEvent) | yes |
ResponseProcessor.ts:250 | session_compressed (outcome: ok, full compressionProps) | yes |
SessionRoutes.ts:177 | session_compressed (outcome: error) | yes |
SessionRoutes.ts:196 | session_compressed (outcome: aborted) | yes |
SearchRoutes.ts:434 | context_injected (outcome: error) | NO — hook-level |
SearchRoutes.ts:446 | context_injected (outcome: ok, ...stats) | NO — hook-level |
context_injected fires from the context-injection HTTP route (UserPromptSubmit hook), not within a session generator. It has no sessionDbId. It must keep a bounded path (time-window rollup OR per-hook-process rollup), NOT the per-session accumulator.bun:test. Global PostHog mock in tests/preload.ts exposes postHogConstructorCalls / postHogCaptureCalls.__resetTelemetryForTests() (telemetry.ts:126), telemetryBuffer.__resetForTests() (buffer.ts:259).tests/telemetry/buffer.test.ts:61-118.tests/telemetry/backfill.test.ts:434-440.tests/telemetry/scrub.test.ts:207-237.tests/telemetry/buffer.test.ts:20-54.src/services/telemetry/consent.ts): resolveTelemetryConsent, explainTelemetryConsent, loadTelemetryConfig, saveTelemetryConfig, getOrCreateInstallId. Precedence fixed — do not change.docs/public/telemetry.mdx (191 lines: header, "What is collected" whitelist table, events table, historical backfill, "What is NEVER collected", opt-out, debug, config). Nav entry in docs/public/docs.json.Goal: make a future misconfig hit a cap, not the invoice. This is the only true urgency.
Verification: billing limit + alert visible on the billing page. No code change. Reference: posthog.com/docs/billing/limits-alerts.
Note on session replay: the rep mentioned session replay as a cheaper home for "session data captured by hand." N/A — claude-mem is a Node backend with no web app; there is no browser session to replay. Document this in telemetry.mdx so it doesn't resurface.
Goal: one instrumentation path; every significant event fans out to (a) the local logger (full fidelity, file) and (b) the telemetry pipeline (scrubbed/rolled-up, PostHog). Everything in later phases plugs into this.
src/services/telemetry/instrument.ts exporting a single entry point, e.g.
instrument(component: Component, level: LogLevel, message: string, ctx?: LogContext, telemetry?: { event: string; props?: Record<string, unknown>; rollup?: 'session'|'hook'|'none'; person?: boolean }).
logger[level](component, message, ctx, ctx?.data) for the local line (full detail), THEN, only if telemetry is provided and consent passes, routes to the telemetry sink (captureEvent / per-session accumulator / error capture).instrument → logger (always) and instrument → telemetry (optional, consent-gated, swallow-all). The logger must never import telemetry (keeps logging working with telemetry disabled and avoids a cycle).logger.ts telemetry-free. Do the fan-out in instrument.ts, not inside Logger. (Phase 3 wires logger.error/logger.failure → exception capture via a thin optional hook set on the logger by instrument/worker init, still consent-gated and swallow-all — see Phase 3.)SessionRoutes.ts:153 logs an error and :177 records telemetry) to a single instrument(...) call. Do this incrementally — Phase 1 establishes the API and migrates 2-3 exemplar sites; later phases migrate the rest as they touch those files.src/utils/logger.ts:284-343 (debug/info/warn/error/failure, Component enum at :15-52).telemetry.ts:73 captureEvent.telemetry.ts:22 hasConsent() (30s TTL cache) — instrument must respect it.instrument() with consent OFF produces a local log line but ZERO postHogCaptureCalls (copy assertion from backfill.test.ts:434-440).instrument() with consent ON produces both a log line and exactly one capture (or one accumulator record).tests/logger-usage-standards.test.ts still passes (no console.*, logger imported where required).bun run build-and-sync succeeds; worker starts.Logger import the telemetry client (cycle + breaks telemetry-disabled logging).instrument throw — wrap the telemetry branch in try/catch that swallows.scrubProperties for structured props.Goal: emit ONE session_compressed rollup per session, at session end — not per 5-minute wall-clock window.
buffer.ts (or a sibling session-rollup.ts): Map<number /*sessionDbId*/, SessionCompressedBucket>. Replace the single module-level sessionCompressedBucket for the session-scoped path. Reuse computeSessionCompressedRollup() unchanged (it already produces the right aggregate shape).record('session_compressed', sessionDbId, props) — add the sessionDbId key. Append to that session's bucket. Preserve ALL existing fields from the 7 call sites (see Phase 0.B table; especially the full compressionProps from ResponseProcessor.ts:212-236 and the deferred pendingCompressionEvent merge).flushSession(sessionDbId, 'session_end') from removeSessionImmediate() (SessionManager.ts:346) AND deleteSession() (SessionManager.ts:281), at function entry while the session still exists. (Guard against double-flush: flushing removes the bucket, so the second call is a no-op.)worker_shutdown. Fix ordering: ensure these flush BEFORE the PostHog client is shut down. Either (a) move the per-session flush into beforeGracefulShutdown before shutdownTelemetry(), or (b) have shutdownTelemetry() drain session buckets before current.shutdown(). Prefer (b) for a single drain point.unref'd interval) emits a partial rollup for any session whose bucket exceeds a max age OR max record count, tagging rollup_reason: 'safety_flush' and incrementing a window_seq so long-lived sessions still report and memory stays bounded.rollup_reason enum (session_end | worker_shutdown | safety_flush) and window_seq (int) to the rollup props + ALLOWED_PROPERTY_KEYS in scrub.ts.context_injected stays bounded but separate. It is hook-level (no sessionDbId). Keep its time-window rollup (context_injected_rollup) OR convert to a per-hook-process single flush at process exit. Decision: keep the existing time-window rollup for context_injected (it is already low-volume relative to session_compressed and has no session boundary). Document this asymmetry.buffer.ts:63-143 computeSessionCompressedRollup.SessionManager.ts:281, 346, 367 (Phase 0.B).telemetry.ts:137-159 shutdownTelemetry.ResponseProcessor.ts:212-236.record('session_compressed', id, ...) calls for one session + flushSession(id,'session_end') ⇒ exactly ONE session_compressed-rollup capture with correct sums/counts and rollup_reason:'session_end' (copy buffer.test.ts:61-118).rollup_reason:'worker_shutdown', emitted before client shutdown.rollup_reason:'safety_flush' + incremented window_seq; memory map shrinks after flush.sessionDbId.respawnPoisonedSession (session continues; would split one session into many rollups).sessionDbId itself in the emitted props (it is not whitelisted and is install-correlatable).Goal: capture actual error text/stack to PostHog Error Tracking ($exception), safely and at low volume.
One-way-door note (surface to user before shipping): sending free-form error messages is a shift from claude-mem's strictly-anonymous, whitelist-only telemetry. PostHog data cannot be selectively deleted after ingest. The user has effectively opted in ("actual error message data would be great"), but the redaction below is mandatory and the behavior must honor the same consent gate + a dedicated env kill-switch.
src/services/telemetry/error-scrub.ts — an allow-then-redact scrubber (opposite of the property whitelist, because messages are free-form):
error.name/type, error.message, a trimmed stack (top N frames).~ (use os.homedir()), absolute paths → basename or ~-relative, URL query strings stripped, mask anything matching email / sk-/phc_/token / long-hex / JWT patterns, collapse whitespace, cap message ≤ 500 chars and stack ≤ ~2KB.captureException(err, ctx?) in telemetry.ts (and a CLI variant if needed): consent-gated, builds redacted payload, calls SDK captureException(error, getOrCreateInstallId(), { $process_person_profile: false, ...whitelistedContext }). Profile-less. Swallow-all.Map<fingerprint, {count, firstTs, lastSentTs}>. Fingerprint = hash(name + redacted message template + top frame). Send at most once per fingerprint per window (e.g. 1/min), attach an occurrence count. This is the "never an unbounded stream" invariant applied to errors.logger.error() and logger.failure() route their Error data through captureException (consent-gated, rate-limited). Replace the enum-only error_occurred capture at BaseRouteHandler.ts:61 with a real exception capture (keep an aggregate count too if useful).enableExceptionAutocapture: true on the worker client to catch uncaught exceptions/unhandled rejections — but ONLY with the rate-limiter in front (autocapture can storm). Gate behind the same consent + kill-switch. If risk is unclear, ship manual captureException first and add autocapture in a follow-up.CLAUDE_MEM_TELEMETRY_ERRORS=0 disables exception capture independently of analytics (defaults ON when telemetry is on). Document it.captureException(error, distinctId?, additionalProperties?, ...) (Phase 0.A). $process_person_profile:false goes in additionalProperties.scrub.ts (structured path) — error-scrub is the free-form sibling.before_send drop option (Phase 0.A) as an extra ingest-side guard.error-scrub redacts: home dir, abs paths, emails, phc_/sk-/token-like strings, URL query params; caps length; never throws on hostile/circular input (copy hostile-input pattern from scrub.test.ts:314-326).captureException with consent OFF ⇒ zero captures.$exception sends, with count reflecting occurrences.$exception payload carries $process_person_profile:false (no person profile created).logger.error(component, msg, ctx, new Error(...)) triggers one redacted exception capture.CLAUDE_MEM_TELEMETRY_ERRORS=0 ⇒ zero exception captures, analytics unaffected.error-scrub.Goal: confirm the historical rollup is correct/complete and comparable to the new live per-session grain.
historical_activity + install_inferred are landing (confirmed present). Spot-check that day coverage and first_active_date look sane for known installs.session_compressed economics (tokens_input/output, cost_usd, compression_ms, outcomes, fabrication). The backfill ships per-day activity counts + read_tokens/tokens_saved_vs_naive and intentionally OMITS generation-side cost (never persisted to SQLite — backfill.ts:336-340). Keep that omission (don't fabricate cost), but ensure shared keys (observation_count, session_count, obs_type_*) use identical names/semantics so historical and live series stack in one chart. Document which fields are live-only vs historical-only.BACKFILL_VERSION (backfill.ts:77) so already-backfilled installs re-run idempotently (deterministic UUIDs make this dedup-safe).buildBaseProperties() to historical_activity (would poison version-over-time charts — backfill.ts:446-448).backfill.ts:463-510 buildBackfillEvents, :528-644 runHistoricalBackfill, :140-149 isBackfillComplete.tests/telemetry/backfill.test.ts (epoch normalization, day windows, deterministic UUID, consent-off).CLAUDE_MEM_TELEMETRY_DEBUG=1 ⇒ dry-run prints expected day range + event count, sends nothing, writes no marker.backfill.test.ts:434-440).observer_turn_rollup; scrub.ts comments and telemetry.mdx reference session_compressed_rollup. Pick ONE (recommend keeping observer_turn_rollup since it's what's live — just fix the stale comments/docs). Update consistently.captureEvent('session_compressed'|'context_injected', ...) directly (only the rollup path should exist). grep-guard it.test_event / test_event_2 sources (search repo + any test harness that emits them to the real project).docs/public/telemetry.mdx — new events (per-session rollup rollup_reason/window_seq, $exception), the unified logging model, the error-tracking opt-in + CLAUDE_MEM_TELEMETRY_ERRORS switch + one-way-door note, and a line explaining session replay is N/A (backend). Update docs.json if a new page is added.query-trends) after rollout: confirm raw session_compressed/context_injected continue decaying, observer_turn_rollup volume is sane per active install, $exception volume is bounded, and no person profiles are created for non-lifecycle events.grep -rn "captureEvent('session_compressed'\|captureEvent('context_injected'" src ⇒ no matches.grep -rn "session_compressed_rollup" src docs ⇒ no stale references (or all intentional).grep -rn "test_event" src tests ⇒ no production emitters.docs/public/telemetry.mdx covers rollups, errors, unified logging, opt-out.bun test tests/telemetry/ + the new test files all pass.tests/logger-usage-standards.test.ts passes.bun run build-and-sync succeeds; worker starts and /api/health is green.CLAUDE_MEM_TELEMETRY_DEBUG=1: drive one session end-to-end → observe ONE session_compressed rollup with rollup_reason:'session_end'; trigger an error → observe ONE redacted $exception; consent off → observe nothing./do against this file to execute phase-by-phase.