plans/2026-06-11-posthog-historical-backfill.md
Goal: One-time, per-install backfill of anonymized daily activity rollups from the local SQLite DB into PostHog, using PostHog's historical-migration ingestion mode, so growth (installs over time, reconstructed WAU/MAU, cohort retention) is visible for activity that predates telemetry shipping.
What gets sent (anonymous counts/sums only — never titles, text, prompts, project names, or any raw string column):
historical_activity event per active day per install, profile-less
($process_person_profile: false), carrying ONLY rollup counters + backfilled: true
(no buildBaseProperties() — stamping the current version/os onto 2025-dated events
would permanently poison version-over-time charts):
{ observation_count, session_count, summary_count, prompt_count, project_count, discovery_tokens, obs_type_bugfix, obs_type_discovery, obs_type_decision, obs_type_refactor, obs_type_other, session_completed_count, session_failed_count, sessions_claude_count, sessions_codex_count, sessions_gemini_count, sessions_other_platform_count, subagent_obs_count, backfilled: true }.install_inferred event at noon UTC of the install day, person: true, with
first_active_date in properties and $set — this single event draws the adoption curve.Known caveats (accept, do not solve):
disableGeoip: false kept for consistency with live telemetry).version must filter backfilled != true; combined install metrics
must dedupe install_inferred ∪ install_completed(is_update=false) per distinct_id
(telemetry-era installs emit both).All implementation phases MUST use only the APIs listed here. Citations were verified against
the working tree, the published posthog-node v5.36.15 source, and the real
~/.claude-mem/claude-mem.db (read-only) on 2026-06-11, then adversarially re-verified on
2026-06-12 (see Review log at the bottom).
Precondition: this worktree has no node_modules — run bun install (or npm install)
before Phase 1 so imports and bun test resolve.
posthog-node ^5.36.15 (root package.json:143, imported in
src/services/telemetry/telemetry.ts:1; caret range — if the installed major ever drifts,
re-verify historicalMigration forwarding in dist/, search for historical_migration):
new PostHog(apiKey, options) — PostHogOptions = Omit<PostHogCoreOptions, 'before_send'> & {...}.
PostHogCoreOptions includes historicalMigration?: boolean ("Special flag to indicate
ingested data is for a historical migration", default false). It is forwarded as
historical_migration: true on every /batch request body (verified in
@posthog/[email protected] shipped source). Also valid: host, flushAt, maxBatchSize,
maxQueueSize, disableGeoip.client.capture(msg: EventMessage) — EventMessage includes distinctId, event,
properties, timestamp?: Date (a Date object, NOT an ISO string), uuid?: string.
Both are forwarded end-to-end to the payload.await client.flush() does NOT join those pending captures and
can resolve while events are still un-enqueued. Only client.shutdown() joins pending
capture promises and then loops flush until the queue is empty — shutdown() is the only
delivery barrier.shutdown() SWALLOWS fetch errors internally (logFlushError); its resolution proves
nothing about delivery. Delivery failure is observable only via the public error emitter:
client.on('error', handler) (@posthog/core posthog-core-stateless.ts:301).maxQueueSize (default 1,000).
Multi-year installs can exceed 1,000 active days — maxQueueSize must be raised.flushAt; their errors are
swallowed AND a non-network (4xx) failure removes the batch from the queue before throwing.
Therefore: set flushAt/maxBatchSize/maxQueueSize all to 5000 so NO background flush
ever fires and the entire backfill goes as one request at shutdown (the SDK auto-halves the
batch on HTTP 413). One request also makes a cross-restart retry byte-identical — the best
possible dedupe-key match.PostHog historical-migration rules (https://posthog.com/docs/migrate):
historicalMigration: true so events bypass standard ingestion/billing handling.install_inferred.uuid + completion marker).(toDate(timestamp), event, distinct_id, uuid) — retried events must carry byte-identical
timestamp+uuid+event+distinctId. The marker (not the uuid) is the PRIMARY idempotency gate;
the uuid minimizes damage in the crash-retry window. Residual accepted risk: a crash between
server-side ingest and marker write can leave transient duplicates until ClickHouse merges.Existing telemetry modules to copy from (do not reinvent):
resolveTelemetryConsent(process.env, loadTelemetryConfig()) —
src/services/telemetry/consent.ts:68-73. Install UUID: getOrCreateInstallId() —
consent.ts:114-124 (random v4 UUID persisted to telemetry.json via getTelemetryConfigPath()).scrubProperties(props) whitelist scrubber — src/services/telemetry/scrub.ts:126-149;
whitelist set ALLOWED_PROPERTY_KEYS at scrub.ts:8-117. Properties not in the whitelist
are silently dropped — new keys MUST be added there. Already present (verified): the
obs_type_* family (live context_injected ships them), session_count,
observation_count. Confirmed ABSENT (must add): discovery_tokens.getTelemetryApiKey(), getTelemetryHost() — src/services/telemetry/common.ts:22-28.
Note: getTelemetryApiKey() falls back to the embedded production key and is never
falsy — every worker boot anywhere can send real data (see Phase 2/3 sequencing).
Base props: buildBaseProperties() — common.ts:100-113 (returns CURRENT version,
os_version, runtime_version, locale, is_ci — historical events must NOT carry it).
Person-prop subset: PERSON_PROPERTY_KEYS + buildPersonSet() — common.ts:36-76.
buildPersonSet only copies keys PRESENT on the event's properties — a person trait that is
never assigned to the event silently never ships.captureEvent() — src/services/telemetry/telemetry.ts:72-117.collectInstallStats(db: Database) —
src/services/telemetry/install-stats.ts:29-99. Per-block try/catch; typed .get() casts.
Keep the per-block try/catch in the rollup too: older installs may lack a table or column —
skip that block's keys, never throw.asMs(col) → `CASE WHEN ${col} < 1000000000000 THEN ${col} * 1000 ELSE ${col} END` —
install-stats.ts:23-25. DAY_MS = 86_400_000 — install-stats.ts:27. Note MIN(asMs(x))
applies normalization INSIDE the MIN (correct); asMs(MIN(x)) would not be.Bun.randomUUIDv5(name, namespace) — exists and is deterministic on
the worker's embedded Bun 1.3.9 (verified by execution). Use it with one fixed namespace
UUID constant in the module. The npm uuid package is not installed and must not be added.src/services/sqlite/SessionStore.ts:1901-2015 — ONE discoveryTokens value per
compression turn is written identically to EVERY observation row of that turn (line 1962)
AND to the turn's session_summaries row (line 2004). Summing across observations
multi-counts by the obs-per-turn factor. Sum session_summaries.discovery_tokens only.Database access (bun:sqlite, synchronous):
import { Database } from 'bun:sqlite' — pattern at src/services/sqlite/SessionStore.ts:1-2.db.query(sql).get(...) / .all(...) with as type casts —
e.g. src/services/worker-service.ts:508-513, install-stats.ts:36-50.src/services/sqlite/migrations/runner.ts:43-112 —
observations(project, created_at_epoch, memory_session_id, type, agent_type),
sdk_sessions(project, started_at_epoch, memory_session_id, status, platform_source),
session_summaries(project, created_at_epoch), user_prompts(created_at_epoch)
(counts only — NEVER select prompt_text). discovery_tokens is NOT in 43-112: it is
added by ensureDiscoveryTokensColumn at runner.ts:381-402
(ALTER TABLE ... ADD COLUMN discovery_tokens INTEGER DEFAULT 0 on both observations and
session_summaries). Query only columns created in runner.ts — the dev DB has extra columns
other installs lack.type buckets: use the closed STAT_TYPE_BUCKETS set from
src/services/context/ContextBuilder.ts (bugfix/discovery/decision/refactor/other) so the
backfill vocabulary is identical to live context_injected.platform_source is a user-influenceable string — bucket in JS to the closed enum
{claude, codex, gemini, other}; never ship the raw value.Worker integration points (src/services/worker-service.ts):
worker_started capture: lines 532-540 (inside initializeBackground()).ChromaSync.backfillAllProjects(...) .then/.catch block at lines 551-556.setInterval(..., 24*60*60*1000) + .unref?.() at lines 544-547.worker_stopped capture BEFORE shutdownTelemetry() (lines 699-703).State files in ~/.claude-mem:
DATA_DIR / paths.dataDir() — src/shared/paths.ts:40, 129-151.readJsonSafe<T>(path, fallback) from src/utils/json-utils.js (as
consent.ts:5 imports it). There is no JSON-write helper — write with
mkdirSync + writeFileSync exactly as saveTelemetryConfig does at consent.ts:103-107.backfill.json — do NOT merge it into telemetry.json
(the consent save path would clobber it).Logger (console.* is forbidden in services — enforced by
scripts/check-hook-io-discipline.cjs / tests/logger-usage-standards.test.ts):
import { logger } from '../utils/logger.js'; logger.info('SYSTEM', msg, ctx, err?).
'TELEMETRY' is NOT in the Component union (src/utils/logger.ts:15-52) — use 'SYSTEM'.Tests (bun test; global posthog-node mock):
posthog-node in tests/preload.ts:45-55 records
postHogConstructorCalls / postHogCaptureCalls — extend, don't replace. The mock class
has NO flush(), shutdown() race-compatible signature, or on() — add no-op
flush()/shutdown()/on() to it.__resetTelemetryForTests() — src/services/telemetry/telemetry.ts:125-129.tests/telemetry/telemetry-client.test.ts:30-53
(CLAUDE_MEM_DATA_DIR = mkdtempSync(...)).tests/telemetry/install-stats.test.ts:8-28 — but its makeDb
schema LACKS discovery_tokens, memory_session_id, type, agent_type, status,
platform_source, and the user_prompts table. Extend the test schema with every column
the rollup queries touch before writing tests, or the queries throw no such column (and
the per-block try/catch would mask it as silently-empty rollups).historical_migration (snake_case) as a posthog-node constructor option — the SDK option
is historicalMigration (camelCase). Snake_case is only for the raw /batch HTTP API.timestamp as an ISO string to client.capture() — EventMessage.timestamp is
typed Date. Construct new Date(...).telemetry.ts for backfill — it lacks
historicalMigration and its isShutdown latch must stay untouched. Build a dedicated,
short-lived client.ALLOWED_PROPERTY_KEYS — the scrubber drops
them silently and the backfill would ship empty events.created_at_epoch / started_at_epoch math without asMs() — legacy second-unit
rows land in 1970 and poison the earliest cohort. Additionally apply the project-epoch
floor (below): corrupt epochs can also land on plausible-looking wrong days, not just 1970.discovery_tokens across observations AND session_summaries — the same per-turn
value is stored on every row of the turn (see Phase 0); sum summaries only.buildBaseProperties() on historical_activity events — current version/os on
2025-dated events is permanently wrong version-over-time data.uuid package — use Bun.randomUUIDv5 (verified present and deterministic).console.* anywhere in src/services/ — use logger (debug-mode payload printing is the
one exception and must use process.stderr.write, copying telemetry.ts:97-103).await client.shutdown() resolved AND zero client.on('error') events, NOT a bare
flush() (see Phase 0 SDK facts). A crash mid-send must retry on next startup.CLAUDE_MEM_TELEMETRY=0 (or CLAUDE_MEM_TELEMETRY_DEBUG=1) exported — the embedded
production key + default-on consent means a casual npm run build-and-sync ships the dev
machine's entire history to production PostHog before the dry-run gate.Create src/services/telemetry/backfill.ts (one module, one test file).
lastFullDay = utcDayString(nowMs - 60 * 3_600_000) — 60h = 48h (PostHog contract) + 12h
(noon-UTC event timestamps). Noon of any included day is then guaranteed ≥48h old.PROJECT_EPOCH_FLOOR = Date.parse('2024-01-01T00:00:00Z') — predates claude-mem's first
release; rows with normalized epoch below it are corrupt and ignored everywhere (rollups
AND first-activity MIN).installDay = utcDayString(firstActivityEpochMs) (1.3). Include only whole UTC days where
installDay <= day <= lastFullDay, comparing day strings (YYYY-MM-DD compares
lexicographically) — never raw epochs, so no partial days can ship. The lower bound discards
backdated artifact rows (verified on the reference DB: obs ids 66888/66889 carry a
2025-08-12 epoch but belong to a session started 2026-04-10).collectDailyRollups(db: Database, lastFullDay: string, installDay: string): DailyRollup[]Copy the query style of collectInstallStats (per-block try/catch — a missing table/column
skips that block's keys), bucketing every table by
date(<asMs(col)>/1000,'unixepoch') AS day, merged in a Map<day, rollup>. Pinned
semantics — these exact aggregations, so any two implementers ship identical numbers:
| key | source (per day) |
|---|---|
observation_count | COUNT(*) FROM observations |
obs_type_bugfix/discovery/decision/refactor/other | observations GROUP BY day, type, bucketed via STAT_TYPE_BUCKETS |
subagent_obs_count | COUNT(*) FROM observations WHERE agent_type IS NOT NULL |
session_count | COUNT(*) FROM sdk_sessions only (do NOT add observations' distinct memory_session_id — same sessions, double count) |
session_completed_count / session_failed_count | sdk_sessions GROUP BY day, status (closed enum) |
sessions_claude_count / sessions_codex_count / sessions_gemini_count / sessions_other_platform_count | sdk_sessions platform_source, bucketed in JS to the closed enum |
summary_count | COUNT(*) FROM session_summaries |
discovery_tokens | SUM(discovery_tokens) FROM session_summaries only (per-turn cost — see Phase 0 storage semantics) |
prompt_count | COUNT(*) FROM user_prompts (count only — never prompt_text) |
project_count | COUNT(DISTINCT project) over (SELECT day, project FROM observations UNION SELECT day, project FROM sdk_sessions) — cross-table distinct in ONE query; never sum per-table distincts (triple-counts the same project) |
Omitted on purpose: session durations (verified dirty — 22% of completed sessions show >24h stale spans), dev-only columns not in runner.ts, anything string-valued.
findFirstActivityEpochMs(db: Database): number | nullSELECT MIN(asMs(started_at_epoch)) FROM sdk_sessions WHERE asMs(started_at_epoch) >= FLOOR
(copy install-stats.ts:56-61 — note asMs INSIDE the MIN), falling back to the observations
MIN only if sdk_sessions is empty. Keep sessions-first: session timestamps are write-time
and trustworthy, while observation epochs can be backdated artifacts (verified — see 1.1);
a cross-table MIN would bake an artifact date into undeletable data.
deterministicEventUuid(installId, event, day): stringBun.randomUUIDv5(${installId}|${event}|${day}, BACKFILL_NAMESPACE) where
BACKFILL_NAMESPACE is one fixed UUID constant in the module. One line; deterministic;
no hand-rolled hashing.
buildBackfillEvents(db, installId, nowMs): EventMessage-like[]Pure assembly:
historical_activity with timestamp: new Date(day + 'T12:00:00Z') and the
deterministic uuid. Noon UTC is load-bearing TWICE: it keeps the event inside its day for
dashboard timezones in UTC-12..+11 (keep the PostHog project timezone on UTC), and it makes
the timestamp retry-stable, which the dedupe key requires — do not "simplify" to a
non-deterministic timestamp. Properties: the rollup counters + backfilled: true, passed
through scrubProperties(...), then $process_person_profile: false.
No buildBaseProperties() (see anti-patterns).install_inferred with timestamp: new Date(installDay + 'T12:00:00Z') (noon UTC —
retry-stable even if the sessions/observations MIN source flips between runs) and uuid keyed
on day 'install'. Properties: { ...buildBaseProperties(), first_active_date: installDay, backfilled: true } through scrubProperties, then $set = buildPersonSet(props)
(copy telemetry.ts:91-95). first_active_date MUST be assigned here — buildPersonSet
only copies keys present on the event, and a whitelisted-but-never-assigned trait silently
never ships. Base props are fine on this one person-event ($set = current person state).installDay > lastFullDay (install younger than ~60h): return []. Such installs have
live telemetry coverage for their entire life — there is no pre-telemetry history to
reconstruct, and shipping a <48h timestamp violates the migration contract.Add to ALLOWED_PROPERTY_KEYS (scrub.ts:8-117) — verify each against the file first:
discovery_tokens (confirmed absent), summary_count, prompt_count, project_count,
backfilled, first_active_date, session_completed_count, session_failed_count,
sessions_claude_count, sessions_codex_count, sessions_gemini_count,
sessions_other_platform_count, subagent_obs_count.
Already whitelisted — reuse, do not re-add: observation_count, session_count,
the obs_type_* family. (session_count/observation_count carry different semantics on
live context_injected — acceptable: PostHog properties are filtered per-event in practice;
note the collision in the PR description.)
Add first_active_date to PERSON_PROPERTY_KEYS (common.ts:36-60).
runHistoricalBackfill(db: Database): Promise<void>Marker file ~/.claude-mem/backfill.json at join(dataDir, 'backfill.json') (mirror
getTelemetryConfigPath(), consent.ts:76-78). Shape:
{ completedAt: string, throughDay: string, eventCount: number, installId: string }.
Read with readJsonSafe; write with mkdirSync + writeFileSync (as
saveTelemetryConfig, consent.ts:103-107).
Gate sequence (ORDER MATTERS — debug must precede every marker write):
resolveTelemetryConsent(process.env, loadTelemetryConfig()) false → return
without writing the marker (a later opt-in still backfills).CLAUDE_MEM_TELEMETRY_DEBUG === '1' → process.stderr.write one summary line
(event count + day range) then one line per event (copy telemetry.ts:97-103), do NOT
send, do NOT write marker — even when the event list is empty. This dry-run intentionally
re-runs on every debug-mode worker start; the marker must never latch from debug mode.
(Only the exact value '1' activates it, matching captureEvent.)install_completed/daily events cover them from day 0 — install_inferred
is intentionally NOT emitted for them.)!getTelemetryApiKey() → return without marker. (Vestigial — the embedded key makes this
unreachable; keep the one-liner for symmetry with captureEvent, write no test for it.)new PostHog(getTelemetryApiKey(), { host: getTelemetryHost(), historicalMigration: true, flushAt: 5000, maxBatchSize: 5000, maxQueueSize: 5000, disableGeoip: false }) — the 5000s
guarantee a single batch, no swallowed background flushes, and no silent queue-cap drops
(see Phase 0 SDK facts).const errors: unknown[] = []; client.on('error', e => errors.push(e)), then
client.capture({ distinctId: getOrCreateInstallId(), event, properties, timestamp, uuid })
per event, then await client.shutdown() — NO separate flush() call (it is not a
delivery barrier) and NO 3s race (this runs fire-and-forget in the background; the SDK's
default shutdown timeout is fine).errors.length === 0. Wrap the whole body
in try/catch that logs via logger ('SYSTEM') and never throws (telemetry must never
break the worker — telemetry.ts:114-116).Verification checklist:
tests/telemetry/backfill.test.ts using the :memory: pattern from
tests/telemetry/install-stats.test.ts:8-28 with the schema EXTENDED per Phase 0
(discovery_tokens, memory_session_id, type, agent_type, status, platform_source,
user_prompts table). Cases:
(a) mixed second/ms epochs land on the correct day (insert created_at_epoch = 1755000000 seconds; assert no 1970 day);
(b) day window: a row 47h old is excluded AND no partial day ever ships (whole-day
buckets only); a row before PROJECT_EPOCH_FLOOR or before installDay produces no day;
(c) deterministicEventUuid stable across calls, general UUID shape (it is a v5 — do
not assert a v4 version nibble);
(d) every property in built events survives scrubProperties (no silent drops);
(e) empty DB → zero events, no throw;
(f) discovery_tokens dedupe: one turn = 3 observations + 1 summary all carrying 100 →
day total 100, not 400;
(g) one session + one observation in it, same day → session_count 1 and
project_count 1 (no double counting);
(h) install_inferred uses the sessions MIN, is stamped noon UTC of installDay, and
its $set contains first_active_date;
(i) first activity 1h ago → zero events;
(j) consent-off → zero postHogCaptureCalls, no marker; marker present → zero calls;
debug mode → zero calls, no marker including on an empty DB; second invocation after
success → zero calls;
(k) happy path → constructor received historicalMigration: true, every capture has
uuid + Date timestamp, marker written with correct throughDay; an error emitted
via the mock's on handler → NO marker.bun test tests/telemetry/ passes.Anti-pattern guards: no marker write reachable before the debug gate; sum
summaries-only for discovery_tokens; day-string windowing only; no buildBaseProperties on
historical_activity; no console.*; no raw epoch math without asMs.
This phase MUST complete before Phase 3 wires anything into the worker, because PostHog data
cannot be selectively deleted and getTelemetryApiKey() always returns the embedded
production key.
runHistoricalBackfill against the real ~/.claude-mem/claude-mem.db with
CLAUDE_MEM_TELEMETRY_DEBUG=1 (small runner script or one-off test), and eyeball the
stderr payload dump:
lastFullDay (≈ T-2.5 days), no 1970 days, no day before
install_inferred's first_active_date.install_inferred, first_active_date: 2025-10-19, noon-UTC timestamp.first_active_date date string.rm ~/.claude-mem/backfill.json.Export CLAUDE_MEM_TELEMETRY=0 (or CLAUDE_MEM_TELEMETRY_DEBUG=1) in the dev shell for
every worker boot in this phase — npm run build-and-sync restarts the worker, and an
unguarded boot performs the real production send from the dev machine.
Wire into startup: in src/services/worker-service.ts, immediately after the
worker_started capture (lines 532-540) and alongside the ChromaSync fire-and-forget
precedent (lines 551-556), add:
runHistoricalBackfill(this.dbManager.getConnection()).catch(error => {
logger.error('SYSTEM', 'Telemetry historical backfill failed (non-blocking)', {}, error as Error);
});
Non-blocking, after core init. Do not add it to the heartbeat — it is one-shot by marker (a failed run retries on the NEXT worker start because no marker was written).
Docs: update docs/public/telemetry.mdx — add a short "Historical backfill" section:
what is sent (daily anonymous counts + inferred install date), that it runs once, that it
honors the same consent gates (DO_NOT_TRACK, CLAUDE_MEM_TELEMETRY=0, telemetry.json),
that opting out before first worker start after upgrade prevents it entirely, and that geo
properties on backfilled events reflect upload-time location. Follow the existing page's
tone/structure (read it first).
Verification checklist:
bun test (full suite) passes — especially tests/logger-usage-standards.test.ts.grep -rn "runHistoricalBackfill" src/ shows exactly two hits: definition + the one
worker-service call site.CLAUDE_MEM_TELEMETRY=0 npm run build-and-sync, then confirm via worker log (~/.claude-mem/logs/) that
startup completes and no backfill error is logged.Anti-pattern guards: do not block initializeBackground() on the backfill promise; do not
capture worker_stopped-style events from inside backfill; no unguarded dev-machine boots.
grep -rn "historical_migration" src/ (wrong spelling for SDK path)grep -rn "console\." src/services/telemetry/backfill.tsgrep -rn "from 'uuid'" src/grep -n "buildBaseProperties" src/services/telemetry/backfill.ts → must appear ONLY in
the install_inferred builder, never for historical_activity.scrubProperties(buildBackfillEvents(...)[i].properties)
retains every expected key on both event types.bun test green; npm run build-and-sync (telemetry-guarded shell)
succeeds; worker starts.rm ~/.claude-mem/backfill.json first).worker_started) and one profile-less event (session_compressed) for a known
installId and confirm it counts 1 user — this validates that profile-less
historical_activity and person install_inferred merge as one unique user.install_inferred ∪ install_completed(is_update=false), deduped per
distinct_id (telemetry-era installs emit both — document on the dashboard).historical_activity users by week = reconstructed WAU.version slicing must filter
backfilled != true; geo on backfilled events is upload-time.Phase 1 (module + tests) → Phase 2 (live dry-run gate, BEFORE any wiring) → Phase 3
(wiring + docs, telemetry-guarded) → Phase 4 (verification + post-ship). An executor starting
any phase cold should read this file's Phase 0 plus the cited source files for that phase
before writing code. Run bun install first — the worktree has no node_modules.
Adversarially reviewed by a 55-agent workflow (5 dimensions × independent skeptic verification; 26 findings confirmed, 6 downgraded, 1 disputed, 0 refuted). Material changes vs. the first draft:
utcDay(now-60h); was an ambiguous row-level
48h filter that could ship truncated days and <48h timestamps).first_active_date assignment (was whitelisted but
never attached to any event).flush() is not a delivery barrier
and shutdown() swallows errors — single-batch config + on('error') latch + marker only
on clean shutdown. Dropped the 3s race and the dead "flush every 5,000" advice.