Back to Claude Mem

Plan 03 — Worker / Daemon Lifecycle Hardening

plans/03-worker-lifecycle.md

13.2.052.0 KB
Original Source

Plan 03 — Worker / Daemon Lifecycle Hardening

Scope: Fix accumulated worker / daemon lifecycle bugs in claude-mem. Address DB bloat, chroma-mcp leaks, retry storms, port/PID races, queue zombies, missing supervision, and observability gaps.

Non-implementation: This document is a plan. Each phase is self-contained; an executing agent should be able to run a single phase without re-discovering context.

Audience: Subsequent agents executing one phase per session.


Phase 0 — Documentation Discovery & Allowed APIs

Goal: Anchor every implementation phase in real APIs that exist in the current codebase or in vetted libraries. Prevent phantom-method invention.

0.1 Read these files end-to-end before touching code

FileWhy
CLAUDE.md (project root)Architecture, exit-code strategy, Pro/OSS boundary, settings conventions
src/services/worker-service.tsWorkerService class, --daemon main(), signal registration, all CLI subcommands
src/services/worker-spawner.tsensureWorkerStarted 3-state machine (ready/warming/dead)
src/services/infrastructure/ProcessManager.tsspawnDaemon, PID file ops, captureProcessStartToken, isProcessAlive
src/services/infrastructure/HealthMonitor.tsisPortInUse, waitForHealth, waitForReadiness, httpShutdown
src/services/infrastructure/GracefulShutdown.tsperformGracefulShutdown ordering
src/services/infrastructure/CleanupV12_4_3.tsrunOneTimeV12_4_3Cleanup, STUCK_PENDING_THRESHOLD = 10, observer-purge SQL
src/services/sync/ChromaMcpManager.tsensureConnected, connectInternal, stop, killProcessTree, collectDescendantPids, RECONNECT_BACKOFF_MS = 10_000, MCP_CONNECTION_TIMEOUT_MS = 30_000
src/supervisor/index.tsSupervisor class, validateWorkerPidFile, signal-handler config
src/supervisor/process-registry.tsProcessRegistry, getSdkProcessForSession, ensureSdkProcessExit, waitForSlot, TOTAL_PROCESS_HARD_CAP = 10
src/supervisor/health-checker.ts30s pruneDeadEntries loop (already present — extend, don't replace)
src/supervisor/shutdown.tsrunShutdownCascade, signalProcess, loadTreeKill
src/services/worker/SessionManager.tsIn-memory session map, deleteSession, queue/pending integration
src/services/worker/RestartGuard.tsPer-session restart cap (10/60s window, 5 consecutive)
src/services/worker/retry.tsProvider-level retry (withRetry, classified errors) — DO NOT mutate; circuit breaker layers ABOVE this
src/shared/worker-utils.tsrecordWorkerUnreachable (line 401), executeWithWorkerFallback (line 443), fail-loud counter file at ~/.claude-mem/state/hook-failures.json
src/services/sqlite/Database.tsPRAGMA setup (lines 27-32, 69-74) — single source of truth for DB pragmas
src/services/server/Server.ts/api/health (line 161), /api/readiness (line 178), /api/version (line 192)
src/shared/SettingsDefaultsManager.tsWhere every new setting key MUST be declared with a default
src/shared/hook-constants.tsHOOK_TIMEOUTS, HOOK_EXIT_CODES — extend here, don't inline
plugin/bun-runner.js, plugin/scripts/worker-service.cjsBuilt worker entrypoint — note the build pipeline (scripts/build-hooks.js)

0.2 Allowed APIs (use these, do NOT invent siblings)

SQLite (bun:sqlite) — pragma calls are db.run('PRAGMA …') or db.prepare('PRAGMA …').get(). Existing pragmas: journal_mode=WAL, synchronous=NORMAL, foreign_keys=ON, temp_store=memory, mmap_size, cache_size. VACUUM runs only outside a transaction. VACUUM INTO 'path' is the backup form already used in CleanupV12_4_3.ts:135. wal_checkpoint(TRUNCATE) is the truncating-checkpoint form.

Process supervisiongetSupervisor(), getProcessRegistry(), registerProcess(id, info, processRef?), unregisterProcess(id), pruneDeadEntries(), assertCanSpawn(type), runShutdownCascade(...). Tree-kill on POSIX uses pgrep -P recursion + process.kill(-pgid, signal); on Windows uses taskkill /T /F /PID or tree-kill npm.

HTTP/ExpressServer.app.get('/api/...', handler) via registerRoutes (handlers implement setupRoutes(app) on a RouteHandler interface). Every new endpoint must follow the existing RouteHandler pattern under src/services/worker/http/routes/.

SettingsSettingsDefaultsManager.get('CLAUDE_MEM_…'), SettingsDefaultsManager.loadFromFile(path). New keys require: (a) type added to the interface in SettingsDefaultsManager.ts, (b) default value declared in the same file, (c) documented in CLAUDE.md if user-tunable.

Logginglogger.info(category, msg, fields), logger.warn, logger.error(category, msg, fields, error). Categories used here: SYSTEM, WORKER, SESSION, CHROMA_MCP, SDK, DB, QUEUE, PROCESS. Add new category MAINTENANCE for VACUUM / reaper events.

0.3 Anti-patterns — explicitly forbidden

  • Do not add a new singleton supervisor — extend getSupervisor().
  • Do not spawn child processes without going through getSupervisor().assertCanSpawn(...) and registerProcess(...).
  • Do not call process.exit(1) on hook-side error paths — it accumulates Windows Terminal tabs (CLAUDE.md exit-code strategy). Use 0 for graceful, 2 only for blocking-error paths that need to surface stderr to Claude.
  • Do not delete sdk_sessions rows if observations or session_summaries still reference their memory_session_id without an explicit user-opt-in flag.
  • Do not hold a SQLite write lock during VACUUM while ingestion is hot. Pause queue processing first.
  • Do not introduce setInterval timers that keep the event loop alive — every new timer must call .unref().
  • Do not invent settings keys — declare them in SettingsDefaultsManager.ts first.

0.4 Confidence note

Confidence: HIGH on file/API inventory (read-pass complete on all referenced files). MEDIUM on Windows behavior of new advisory locks (Windows mandatory locking via lockf is bun-runtime-dependent — verify via spike before committing).


Phase 1 — Inventory & Instrumentation (read-only, safe)

Goal: Produce a written state-machine diagram and an exit-site catalog that subsequent phases reference. No code changes; create a scratch document at docs/internal/worker-lifecycle-state-machine.md if the executor wants an artifact, otherwise capture findings in commit messages.

1.1 Tasks

  1. Trace the worker daemon spawn → terminate path end-to-end. Source order:

    • Hook entry → src/shared/worker-utils.ts:ensureWorkerRunning (lazy spawn) OR src/services/worker-spawner.ts:ensureWorkerStarted (explicit)
    • spawnDaemon (src/services/infrastructure/ProcessManager.ts:408) — POSIX uses setsid if available, Windows uses Start-Process -WindowStyle Hidden
    • --daemon branch in src/services/worker-service.ts:937 — duplicate-PID/duplicate-port guard
    • WorkerService.start() (line 258) → startSupervisor()server.listen()writePidFile()getSupervisor().registerProcess('worker', ...)initializeBackground()
    • Signal handlers via configureSupervisorSignalHandlers (src/supervisor/index.ts:49) — SIGTERM/SIGINT; SIGHUP ignored in --daemon mode on POSIX
    • Shutdown: WorkerService.shutdown()performGracefulShutdown → server close → sessionManager.shutdownAll() → mcp client close → chroma stop → db close → getSupervisor().stop()runShutdownCascade → PID file unlink
  2. Catalog every process.exit(...) site in worker-service.ts (already mapped — 21 sites; lines 764, 772, 794, 804, 810, 813, 828, 835, 842, 853, 870, 878, 888, 895, 916, 933, 945, 950, 971, 975, 991). Annotate each with: code, intent, whether it leaks the worker on the same path, whether shutdown ran first.

  3. Catalog every retry / unreachable site:

    • src/shared/worker-utils.ts:401 recordWorkerUnreachable (the #1874 counter)
    • src/cli/handlers/{context,file-context,file-edit,summarize,observation,user-message,session-init}.ts — every executeWithWorkerFallback caller
    • src/servers/mcp-server.ts:72,100,145 — direct workerHttpRequest
    • src/services/transcripts/processor.ts:331,371,373 — direct workerHttpRequest
    • src/services/integrations/CursorHooksInstaller.ts:64,349,352 — direct workerHttpRequest
    • src/utils/claude-md-utils.ts:305 — direct workerHttpRequest
  4. Catalog every spawn site:

    • spawnDaemon (worker self-spawn)
    • ChromaMcpManager.connectInternal (chroma-mcp via uvx → uv → python → chroma-mcp)
    • spawnSdkProcess (src/supervisor/process-registry.ts:532) — Claude SDK subprocesses
    • runMcpSelfCheck (src/services/worker-service.ts:405) — MCP loopback probe via process.execPath
    • Any execSync / execFile / spawnSync in ChromaMcpManager (cert resolution) or ProcessManager (binary lookup, cwd-remap)

1.2 Acceptance criteria

  • Markdown table written (commit message or scratch doc) listing every spawn and exit site with file:line.
  • A 1-paragraph English description of the worker state machine (states + transitions) suitable to paste into PR descriptions.
  • Confirmed list of which executeWithWorkerFallback callers run inside hooks (Claude Code's strict timeout window) vs. inside the worker (no timeout pressure) — this drives Phase 4 circuit-breaker scoping.

1.3 Verification

  • grep -rn "process.exit" src/ --include="*.ts" | wc -l matches the catalog.
  • grep -rn "executeWithWorkerFallback\|workerHttpRequest" src/ --include="*.ts" | grep -v worker-utils.ts | wc -l matches the catalog.

1.4 Deliverable

Hand-off note for Phase 2-8 executors with file/line anchors; no code committed.


Phase 5 — PID/Port Reclamation & Race-Free Startup

Shipping order: Phase 5 first (per Phase 8 ordering). Idempotent and safe.

Goal: Eliminate the silent-exit-0 case where a fresh --daemon spawn loses the port race; harden cross-platform PID-reuse detection; serialize concurrent spawns with an OS-level advisory lock.

5.1 Files to modify

FileChange
src/supervisor/process-registry.tsExtend captureProcessStartToken for macOS (already partial via ps -o lstart) and Windows (wmic process where ProcessId=X get CreationDate /value). Add unit test for each platform branch.
src/supervisor/index.ts:validateWorkerPidFileAdd port-on-pid match check — if pidInfo.port !== currentExpectedPort, treat as 'stale'.
src/services/infrastructure/ProcessManager.tsAdd new exports: acquireDaemonLock() / releaseDaemonLock() using POSIX flock (via fcntl/flock syscall through bun:ffi or shelling to flock(1) on Linux only) and Windows mandatory file lock via LockFile (or fall back to atomic-rename sentinel on Windows).
src/services/worker-service.ts:937 (--daemon branch)Wrap startup in acquireDaemonLock(). If port is in use, perform a /api/version probe; if the listener returns OUR BUILT_IN_VERSION → exit 0 (legit duplicate); if it returns a different version → log a warning and exit 0 (stale worker, will be restarted by version-mismatch path); if the listener doesn't respond → wait HOOK_TIMEOUTS.PORT_IN_USE_WAIT then write a clear stderr line with diagnostic before exiting.
src/services/worker-spawner.tsSame lock acquisition before spawnDaemon. Release on success or error.

5.2 Detailed tasks

  1. macOS start-time token: extend captureProcessStartToken (registry line 56). On Darwin, prefer ps -p <pid> -o lstart= (already in fallback path). Verify with LC_ALL=C LANG=C env so locale doesn't change the timestamp format. Add a comment explaining that ps lstart resolution is 1-second — collisions still possible but vastly less likely than no-token.

  2. Windows start-time token: add a Win32 branch using wmic process where ProcessId=<pid> get CreationDate /value. Parse the CreationDate=YYYYMMDDHHMMSS.ffffff+TZ line. Cache the wmic resolution per-pid for 5s (avoid re-shelling on repeat checks).

  3. Port-on-pid match: in validateWorkerPidFile, after confirming isPidAlive(pidInfo.pid), verify the recorded pidInfo.port is reachable via isPortInUse(pidInfo.port) AND the listener's /api/version returns a version string. If port is dead but PID alive → return 'stale' (worker crashed mid-listen, PID about to be reused).

  4. Advisory lock:

    • POSIX: open <DATA_DIR>/.worker-spawn.lock with O_RDWR | O_CREAT, flock(fd, LOCK_EX | LOCK_NB). On EAGAIN, log Another spawn in progress, waiting up to 5s and retry with LOCK_EX (blocking) under a setTimeout race. Implement via bun:ffi for POSIX flock(2) if available, otherwise shell flock -n -x <path> <command>. Spike first: confirm bun's bun:ffi exposes flock. If not, use a watch-and-rename sentinel (less ideal but works).
    • Windows: Use LockFile via Win32 API or fall back to atomic mkdirSync of <DATA_DIR>/.worker-spawn.lock.dir (fails if exists) with stale-timeout cleanup at 30s.
  5. Diagnostic stderr: when port-in-use without our worker responding, write to stderr (and log INFO) with: claude-mem worker port <N> in use by an unidentified process; not spawning duplicate. This must NOT block the hook — exit 0 still per CLAUDE.md.

5.3 New settings

KeyDefaultRangePurpose
CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS50000–60000Max wait for the spawn lock
CLAUDE_MEM_PID_PORT_RECHECK_MS2000500–30000Wait window before treating port-in-use without /api/version response as "unknown listener"

5.4 Acceptance criteria

  • Run two claude-mem start commands in parallel → exactly one daemon ends up alive; the other exits cleanly with a log line referencing the lock.
  • Kill the worker -9 (skip cleanup), reuse the PID with python -c 'import time; time.sleep(60)'validateWorkerPidFile returns 'stale' and removes the file.
  • On macOS, run worker, capture token, kill, spawn unrelated process with same PID, spawn worker again → token mismatch detected; old PID file ignored.
  • /api/version probe path: spawn a fake server on the worker port → daemon exits 0 with the new diagnostic stderr, NOT silently.

5.5 Observability hooks

  • Log SYSTEM INFO Daemon spawn lock acquired on success.
  • Log SYSTEM WARN Daemon spawn lock contention, fields {waitedMs}.
  • Log SYSTEM WARN Worker port occupied by foreign listener, fields {port, probeStatus}.
  • New /api/healthz fields (added in Phase 7): pid_file_path, pid_start_token, daemon_lock_held: bool.

5.6 Verification checklist

  • grep "process.exit(0)" src/services/worker-service.ts — count unchanged (no new silent exits introduced).
  • Manual two-process race test (Linux + macOS + Windows VM).
  • Existing health-check tests still pass.
  • No new always-on setInterval introduced.

Phase 6 — DB Maintenance (VACUUM / WAL)

Ships alongside Phase 5 (idempotent).

Goal: Recover the 504 MB of free pages, prevent recurrence, surface DB-size metrics.

6.1 Files to modify

FileChange
src/services/sqlite/Database.ts:27-32 and :69-74Add PRAGMA auto_vacuum = INCREMENTAL BEFORE the first table is created (only takes effect on a fresh DB; harmless on existing DBs but logs a no-op). For existing DBs, the migration path is the one-shot Phase-6 startup VACUUM.
src/services/maintenance/DbMaintenance.ts (new)Periodic maintenance task: on a 24h timer (configurable), call PRAGMA incremental_vacuum, PRAGMA wal_checkpoint(TRUNCATE), then collect metrics (page_count, freelist_count, file size). Emit MAINTENANCE INFO log. Acquire dbMaintenanceMutex so other writers wait.
src/services/maintenance/DbMaintenance.tsStartup check: if freelist_count / page_count > FREE_RATIO_VACUUM_THRESHOLD (default 0.40), perform full VACUUM after VACUUM INTO backup to <DATA_DIR>/backups/claude-mem-pre-vacuum-<ts>.db. Pause queue processor first.
src/services/worker-service.ts:initializeBackgroundWire the maintenance task — start after dbManager.initialize(). Timer must .unref().
src/services/worker/SessionManager.tsExpose pauseQueueProcessing(): Promise<void> and resumeQueueProcessing(): void. Use the existing AbortController + emitter to drain in-flight work; don't introduce new state. Maintenance acquires; readers continue (WAL allows them).
src/services/infrastructure/CleanupV12_4_3.ts:135Reuse the existing VACUUM INTO backup pattern verbatim — copy the disk-space pre-flight check (statfsSync, line 115).

6.2 Detailed tasks

  1. Auto-vacuum on new DBs: Add PRAGMA auto_vacuum = INCREMENTAL in Database.ts BEFORE migrationRunner.runAllMigrations(). Verify with a comment that this is no-op on existing DBs (sqlite docs say a full VACUUM is required to flip auto_vacuum mode after tables exist). Document the migration path: existing users get the freed-page reclamation via the startup full VACUUM in step 3.

  2. Periodic incremental vacuum + WAL checkpoint:

    • Schedule via setInterval with .unref(). Default cadence: 24h. Setting: CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS (default 24, min 1, max 168).
    • Each tick: acquire mutex → db.run('PRAGMA incremental_vacuum')db.run('PRAGMA wal_checkpoint(TRUNCATE)') → snapshot metrics → release.
    • Skip the tick if a VACUUM is in progress.
  3. Startup full VACUUM (one-shot per session) when free-ratio is high:

    • Read page_count (PRAGMA page_count) and freelist_count (PRAGMA freelist_count).
    • If freelist_count / page_count >= CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO (default 0.40), schedule a deferred VACUUM (5 minutes after worker becomes ready) to avoid slowing startup.
    • VACUUM steps: pause queue → VACUUM INTO '<backup>' → verify backup → VACUUM (full) → resume queue → log freed pages and ms taken.
    • Disk-space pre-flight: statfsSync (mirror CleanupV12_4_3.ts:115). Skip if free space < 1.2 * dbSize + 100MB. Log MAINTENANCE ERROR in that case so the user sees actionable info.
  4. Pause/resume hook in SessionManager: The existing for await ... of getMessageIterator() loop in queue processor needs a "pause" semaphore. Implementation: add a Promise<void> gate that the iterator awaits before yielding. Maintenance flips it to a pending promise during VACUUM; resolve to release. Do not abort in-flight messages — they can complete; new messages wait.

  5. Cleanup-V12.4.3 regression detection: Re-scan sdk_sessions WHERE project = OBSERVER_SESSIONS_PROJECT and pending_messages matching the stuck-pending pattern at maintenance ticks. If any match AND the marker exists, log MAINTENANCE WARN and re-run the purge (idempotent). Setting: CLAUDE_MEM_CLEANUP_REGRESSION_CHECK = true.

6.3 New settings

KeyDefaultRangePurpose
CLAUDE_MEM_DB_MAINTENANCE_ENABLEDtrueboolMaster kill-switch
CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS241–168Periodic cadence
CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO0.400.05–0.95Free-ratio above which we auto-VACUUM at startup
CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS300000 (5 min)0–3600000Defer startup VACUUM so it doesn't block readiness
CLAUDE_MEM_CLEANUP_REGRESSION_CHECKtrueboolRe-scan v12.4.3-shaped pollution

6.4 Acceptance criteria

  • Reproduce the bloat scenario: stuff pending_messages with 100k stuck processing rows, run worker → startup VACUUM fires within 5 min after readiness, freed-pages log line appears, file size drops.
  • Existing 532 MB DBs reclaim ≥ 95% of free pages on first run (matches the 28 MB target observed manually).
  • Hot-ingestion test: enqueue 1000 observations during a maintenance tick → no SQLITE_BUSY or database is locked errors; queue resumes after VACUUM.
  • PRAGMA auto_vacuum returns 2 (incremental) on freshly-created DBs.
  • Maintenance loop ticks honor .unref()process.exit(0) from a clean shutdown returns immediately, not after the 24h interval.

6.5 Observability hooks

  • New log category: MAINTENANCE.
  • Events: MaintenanceStart, MaintenanceTick, VacuumStart, VacuumComplete ({freedPages, ms, dbSizeBeforeMb, dbSizeAfterMb}), VacuumSkippedLowDisk, RegressionDetected, MaintenanceComplete.
  • /api/healthz fields (Phase 7): db_page_count, db_freelist_count, db_free_ratio_pct, db_size_bytes, db_last_vacuum_at, db_last_vacuum_freed_pages, db_last_maintenance_at.

6.6 Anti-pattern guards

  • Do not call VACUUM inside a transaction (sqlite errors).
  • Do not hold the queue pause across the VACUUM INTO backup phase — only the final full VACUUM needs the writer-lock window. (VACUUM INTO works on a read-only snapshot.)
  • Do not call PRAGMA wal_checkpoint(FULL) — TRUNCATE is required to actually shrink the WAL file.

6.7 Verification checklist

  • Backup created at <DATA_DIR>/backups/ before every full VACUUM.
  • Maintenance timer registered with .unref() (grep for setInterval in the new file → unref() follows each).
  • No new direct setInterval outside the maintenance file.
  • PRAGMA list in Database.ts extended with auto_vacuum and includes a comment about migration.

Phase 2 — Stuck-Session Reaper (fix v12.4.3 bloat)

Goal: Stop pending_messages and sdk_sessions from accumulating zombies.

2.1 Files to modify

FileChange
src/services/maintenance/SessionReaper.ts (new)Periodic reaper. Plugs into the supervisor's existing health-checker.ts 30s tick (extend, do not replace).
src/supervisor/health-checker.ts:9 runHealthCheckCall SessionReaper.tick() after pruneDeadEntries().
src/services/worker/SessionManager.ts:deleteSessionAfter in-memory delete, call pendingStore.clearPendingForSession(sessionDbId) synchronously (it already does this via clearPendingForSession on a separate path — verify and unify).
src/services/sqlite/PendingMessageStore.tsAdd reapStuckProcessing(olderThanMs: number): number returning the count of rows reset to pending.
src/services/sqlite/SessionStore.tsAdd findInactiveSdkSessions(olderThanDays: number): Array<{id, project, contentSessionId, memorySessionId, lastActivityAt}>.
src/services/sqlite/SessionStore.tsAdd markSdkSessionInactive(id: number) — adds an inactive_at column or sets a sentinel.
src/services/sqlite/migrations/runner.tsNew migration: add inactive_at TEXT NULL to sdk_sessions if absent.

2.2 Reaper logic

Per tick (default 30s, gated by CLAUDE_MEM_REAPER_ENABLED):

  1. Stuck-processing sweep: UPDATE pending_messages SET status='pending' WHERE status='processing' AND updated_at < <now - PROCESSING_STUCK_MS> (default 5 minutes). Log count if > 0.

  2. Orphan-pending sweep: DELETE FROM pending_messages WHERE session_db_id NOT IN (SELECT id FROM sdk_sessions) (defensive — should already be FK-protected but log if any deleted).

  3. Inactive-session detection (does NOT delete):

    • SELECT sdk_sessions where id NOT IN <in-memory session ids> AND last_activity > N days ago (computed from MAX of related observations / pending_messages / session_summaries timestamps).
    • For each: UPDATE sdk_sessions SET inactive_at = <now> WHERE id = ? AND inactive_at IS NULL.
  4. Observer-pollution regression check (matches Phase 6 task 5):

    • If OBSERVER_SESSIONS_PROJECT rows reappear after the v12.4.3 marker is present, re-run the purge SQL from CleanupV12_4_3.runObserverSessionsPurge (lines 196-218).
    • Log MAINTENANCE WARN with counts.
  5. Hard delete is opt-in via CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS (default 0 = disabled; nonzero = days threshold). When enabled and a session has inactive_at older than the threshold AND no FK-referencing rows, hard-delete the session row. Default-off because user data safety > disk space.

2.3 New settings

KeyDefaultRangePurpose
CLAUDE_MEM_REAPER_ENABLEDtrueboolMaster switch
CLAUDE_MEM_REAPER_TICK_MS300005000–600000Tick cadence (piggy-backs supervisor; this value gates whether the reaper runs each tick)
CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS300000 (5 min)30000–86400000Threshold for a processing row to be considered stuck
CLAUDE_MEM_REAPER_INACTIVE_DAYS301–365When to mark a session inactive_at
CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS00–3650 = never; otherwise, hard-delete inactive rows older than N days

2.4 Acceptance criteria

  • Inject 50 stuck processing rows older than 5 minutes → next reaper tick resets them → /api/healthz shows oldest_pending_processing_age_sec drop to 0.
  • Inject OBSERVER_SESSIONS_PROJECT rows post-marker → next tick logs regression and purges them.
  • Reaper survives a worker restart without losing state (everything is DB-backed).
  • Active sessions (in-memory) are NEVER marked inactive even if their last DB write is old (in-memory presence wins).

2.5 Observability

  • Log: MAINTENANCE INFO ReaperTick, fields {stuckProcessing, orphanPending, markedInactive, hardDeleted, observerRegression}.
  • New /api/healthz fields (Phase 7): oldest_processing_pending_age_sec, processing_pending_count, pending_count_total, sdk_sessions_total, sdk_sessions_inactive, sdk_sessions_by_project: { [project]: count }.

2.6 Verification checklist

  • Migration adds inactive_at column without breaking existing data (test on a copy of a real DB).
  • In-memory active sessions never appear in findInactiveSdkSessions.
  • Reaper does NOT cascade-delete observations / session_summaries unless explicit hard-delete + zero-FK-reference precondition.
  • /api/healthz shows reaper metrics.

Phase 3 — chroma-mcp Child-Process Supervisor

Goal: Stop the 23-concurrent-chroma-mcp leak. Bound concurrency, reap idle, scan for orphans at startup.

3.1 Files to modify

FileChange
src/services/sync/ChromaMcpManager.tsAdd idle reaper; enforce single-instance via supervisor registry; add startup orphan scan; add lastCallAt timestamp updated by callTool.
src/services/sync/ChromaMcpManager.ts:ensureConnected (line 43)Before connect, check getProcessRegistry().getAll().filter(r => r.type === 'chroma') — if non-empty AND PID alive AND PID not the current _process.pid, refuse to spawn (alert + reuse existing if possible; otherwise wait for backoff).
src/services/sync/ChromaMcpManager.ts:registerManagedProcess (line 613)Already calls getSupervisor().registerProcess(CHROMA_SUPERVISOR_ID, ...) — verify the supervisor enforces single-instance for this id. (Currently register is keyed by id so same id replaces; document this.)
src/supervisor/process-registry.tsAdd getActiveCountByType(type: string): number. Add findChromaOrphans(): Promise<number[]> — POSIX pgrep -af 'chroma-mcp' filtered by PPID == 1.
src/services/worker-service.ts:initializeBackgroundAfter ChromaMcpManager.getInstance(), kick off await ChromaMcpManager.scanAndReapOrphans() (best-effort; never throws).

3.2 Detailed tasks

  1. Startup orphan scan: New static method ChromaMcpManager.scanAndReapOrphans():

    • POSIX: pgrep -af 'chroma-mcp' → for each PID, check PPID. If PPID == 1 (re-parented to init), call killProcessTree(pid) (existing function at line 388). Log CHROMA_MCP INFO ReapedOrphan, fields {pid, ageSec}.
    • Windows: Get-CimInstance Win32_Process -Filter "Name='chroma-mcp.exe'" filter by parent process state, kill with taskkill.
    • Bound the scan to processes whose command-line includes chroma-mcp==<CHROMA_MCP_PINNED_VERSION> to avoid killing unrelated chroma installations.
  2. Idle reaper: Add lastCallAt: number = 0 field to ChromaMcpManager. Update on every callTool. Run a setInterval(checkIdle, 60_000) (.unref()) — if connected && Date.now() - lastCallAt > CHROMA_MCP_IDLE_SHUTDOWN_MS (default 15 min), call await this.stop(). Lazy-reconnect resumes on next callTool.

  3. Single-instance guard on reconnect: In ensureConnected, before connectInternal, call getProcessRegistry().getActiveCountByType('chroma'). If > 0 AND the registered PID is alive but this.connected === false, this is a stale process (we lost track). Tear it down via killProcessTree(registeredPid) first, then proceed with fresh spawn. Otherwise the count grows by one each reconnect — exactly the leak observed.

  4. Hard cap: extend getSupervisor().assertCanSpawn('chroma mcp') (already called at line 87) to actually count and reject. Cap = 1 chroma-mcp per worker. Cap = TOTAL_PROCESS_HARD_CAP (10) overall — already enforced for SDK processes; extend to chroma-mcp.

  5. Tighten close path: in connectInternal (line 74), after transport.close() / client.close(), if the underlying _process.pid is still in the registry, call killProcessTree and unregisterProcess explicitly. Don't rely on transport.onclose alone — it has the stale-callback guard but doesn't always fire on connect-time failures.

3.3 New settings

KeyDefaultRangePurpose
CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS900000 (15 min)60000–86400000Idle reaper threshold
CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_STARTtrueboolMaster switch for startup scan
CLAUDE_MEM_CHROMA_MAX_CONCURRENT11–4Cap chroma-mcp instances per worker

3.4 Acceptance criteria

  • Spawn 5 chroma-mcp processes manually parented to init; restart worker → all 5 are reaped at startup.
  • Force connect-time failure (kill transport mid-connect) 10 times → registry count never exceeds 1.
  • Run worker for 30 min with no chroma calls → process is reaped after 15 min and getProcessRegistry().getActiveCountByType('chroma') returns 0.
  • callTool after idle-shutdown lazy-reconnects successfully.

3.5 Observability

  • Log: CHROMA_MCP INFO OrphanScan {found, killed}.
  • Log: CHROMA_MCP INFO IdleShutdown {idleMs}.
  • Log: CHROMA_MCP WARN RegistryStale when single-instance guard tears down a phantom.
  • /api/healthz fields (Phase 7): chroma_mcp_pid_count, chroma_mcp_last_call_at, chroma_mcp_state ('connected'|'disconnected'|'backoff'), chroma_mcp_backoff_remaining_ms.

3.6 Anti-pattern guards

  • Do not kill chroma processes whose command-line doesn't match chroma-mcp==<PINNED_VERSION> — could match unrelated user installs.
  • Do not spin up the idle-reaper timer if chromaMcpManager is null (chroma disabled via CLAUDE_MEM_CHROMA_ENABLED=false).
  • Do not call getProcessRegistry() from outside the worker process — it's worker-internal.

3.7 Verification checklist

  • After 2.5 hours of normal use, ps aux | grep chroma-mcp | wc -l ≤ 1.
  • Idle-reaper timer is .unref()d.
  • Orphan scan tolerates pgrep returning empty (no false-error logs).
  • Build still passes on Windows (Win32 branch compiles even if not unit-tested).

Phase 4 — Circuit Breaker for Retry Storms

Goal: Replace the unbounded counter at worker-utils.ts:401 with a real circuit breaker. Stop hooks from hammering the worker when it's down.

4.1 Files to modify

FileChange
src/shared/worker-circuit-breaker.ts (new)CircuitBreaker class: states CLOSED, OPEN, HALF_OPEN. Persist to ~/.claude-mem/state/circuit-breaker.json.
src/shared/worker-utils.ts:executeWithWorkerFallback (line 443)Wrap the call in breaker.run(...). On OPEN, return WorkerFallback immediately (no HTTP).
src/shared/worker-utils.ts:recordWorkerUnreachable (line 401)Becomes a thin shim that calls breaker.recordFailure(). Hard cap (MAX_LIFETIME_FAILURES = 50) trips the breaker permanently until manual reset.
src/shared/worker-utils.ts:resetWorkerFailureCounter (line 419)Becomes breaker.recordSuccess().
src/cli/hook-command.tsVerify the swallowed-stderr fix from observation 2026-05-07 is applied (it's marked as a "no-op replacement bug"). The breaker's stderr-fail-loud path must actually write to process.stderr.write(), not a stub.
src/services/server/Server.tsAdd /api/admin/breaker/reset POST endpoint (gated by localhost only) for manual unsticking.

4.2 Breaker semantics

States and transitions:

CLOSED ──[N consecutive failures]──> OPEN
OPEN   ──[reset_timeout_ms elapsed]──> HALF_OPEN
HALF_OPEN ──[1 success]──> CLOSED
HALF_OPEN ──[1 failure]──> OPEN  (resets timer)
ANY    ──[lifetime failures > MAX_LIFETIME_FAILURES]──> OPEN_PERMANENT (until manual reset via API or settings reload)

Defaults:

SettingDefaultRange
CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD51–50
CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS300001000–600000
CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES11–10
CLAUDE_MEM_BREAKER_LIFETIME_CAP500–10000 (0 = no cap)

Persistent state file shape:

json
{
  "state": "CLOSED|OPEN|HALF_OPEN|OPEN_PERMANENT",
  "consecutiveFailures": 0,
  "lifetimeFailures": 0,
  "openedAt": null,
  "lastFailureAt": null,
  "lastSuccessAt": null,
  "lastTrippedAt": null
}

4.3 Detailed tasks

  1. CircuitBreaker class: pure logic class, no I/O. Methods: getState(), canAttempt(), recordFailure(reason), recordSuccess(), forceReset(). Atomic file writes (write tmp + rename) for the JSON snapshot, mirroring writeHookFailureStateAtomic (worker-utils.ts:372).

  2. Wire into executeWithWorkerFallback:

    if (!breaker.canAttempt()) {
      // Optional: print one-line stderr if state changed during this call
      return { continue: true, reason: 'circuit_breaker_open', [WORKER_FALLBACK_BRAND]: true };
    }
    const alive = await ensureWorkerAliveOnce();
    if (!alive) { breaker.recordFailure('unreachable'); ... }
    ...
    if (response.ok) breaker.recordSuccess();
    
  3. Fail-loud stderr fix: The 2026-05-07 observation mentions a "stderr no-op replacement bug" in hookCommand. Investigate src/cli/hook-command.ts for any process.stderr.write shim that suppresses output. The breaker's diagnostic ("Worker unreachable; circuit breaker OPEN; will retry in Xs") MUST appear on the user's terminal so they know what's happening. Test by intentionally killing the worker and running a hook — message should appear on stderr.

  4. Manual reset endpoint: POST /api/admin/breaker/reset (no body required). Restricted to 127.0.0.1 only. Logs SYSTEM WARN BreakerForceReset with caller info.

  5. Lifetime cap: when lifetimeFailures > CLAUDE_MEM_BREAKER_LIFETIME_CAP, transition to OPEN_PERMANENT. The only way out is the manual-reset API or restarting the worker with a fresh state file. Print prominent stderr: claude-mem: 50 lifetime worker failures detected. Disabling memory hooks until reset. Run: claude-mem worker doctor.

4.4 Acceptance criteria

  • Kill the worker, run 100 hooks → exactly CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD HTTP attempts made; rest short-circuit.
  • After 30s idle, next hook makes ONE probe (HALF_OPEN); if probe succeeds, breaker closes.
  • Lifetime cap (set to 5 for testing): 6th lifetime failure → permanent open until POST /api/admin/breaker/reset clears it.
  • Stderr message visible to user when breaker opens (manual repro: kill worker, run 5+ hooks).
  • Existing hook-failures.json file is migrated to the new breaker JSON format on first run (one-shot migration in worker-utils.ts).

4.5 Observability

  • Log: SYSTEM WARN BreakerOpened, fields {lifetime, consecutiveBefore}.
  • Log: SYSTEM INFO BreakerHalfOpen.
  • Log: SYSTEM INFO BreakerClosed, fields {recoveredAfterMs}.
  • Log: SYSTEM ERROR BreakerOpenedPermanent.
  • /api/healthz fields (Phase 7): breaker_state, breaker_consecutive_failures, breaker_lifetime_failures, breaker_opened_at, breaker_total_trips.

4.6 Anti-pattern guards

  • Do not call the breaker from inside the worker process — it's a hook-side concern. The worker has RestartGuard for its own session-level limits.
  • Do not auto-reset the lifetime counter on restart; persist it. Otherwise restart-loops mask the underlying failure.
  • Do not block the breaker reset endpoint on initialization (/api/admin/breaker/reset should work even if initializationCompleteFlag === false).

4.7 Verification checklist

  • No call site bypasses the breaker (grep for workerHttpRequest outside executeWithWorkerFallback and audit each — some integrations may need breaker.canAttempt() guards added).
  • State file readable/writable across process restarts.
  • Stderr fail-loud path verified end-to-end on Linux + macOS + Windows Terminal.
  • No process.exit(1) introduced — breaker tripping returns WorkerFallback, not exit codes.

Phase 7 — /api/healthz Endpoint with Concrete Metrics

Goal: Centralized observability so future regressions are detectable at a glance.

7.1 Files to modify

FileChange
src/services/worker/http/routes/HealthzRoutes.ts (new)Implements RouteHandler. GET /api/healthz and /api/healthz?format=prom.
src/services/worker-service.ts:registerRoutesRegister the new HealthzRoutes(...).
src/services/worker/MetricsCollector.ts (new)Aggregates metrics; refreshed on the supervisor's existing 30s health-check tick to avoid amplifying load.
src/supervisor/health-checker.ts:runHealthCheckCall MetricsCollector.refresh() after pruneDeadEntries.

7.2 Endpoint contract

GET /api/healthz → 200 JSON:

json
{
  "status": "ok|degraded|unhealthy",
  "ts": "2026-05-07T21:30:00.000Z",
  "uptime_sec": 12345,
  "versions": {
    "plugin": "12.7.5",
    "worker": "12.7.5",
    "matches": true
  },
  "process": {
    "pid": 12345,
    "rss_mb": 145.2,
    "event_loop_lag_ms": 3.1,
    "managed": true,
    "platform": "darwin"
  },
  "pid_file": {
    "path": "/Users/.../worker.pid",
    "start_token": "Wed May  7 14:23:15 2026",
    "daemon_lock_held": true
  },
  "db": {
    "path": "/Users/.../claude-mem.db",
    "size_bytes": 31457280,
    "page_count": 7680,
    "freelist_count": 12,
    "free_ratio_pct": 0.16,
    "last_vacuum_at": "2026-05-07T20:00:00.000Z",
    "last_vacuum_freed_pages": 130000,
    "last_maintenance_at": "2026-05-07T20:00:00.000Z",
    "oldest_processing_pending_age_sec": 4,
    "processing_pending_count": 1,
    "pending_count_total": 12,
    "sdk_sessions_total": 145,
    "sdk_sessions_inactive": 13,
    "sdk_sessions_by_project": { "claude-mem": 25, "...": 120 }
  },
  "child_processes": {
    "chroma_mcp_pid_count": 1,
    "chroma_mcp_last_call_at": "2026-05-07T21:25:11.000Z",
    "chroma_mcp_state": "connected",
    "chroma_mcp_backoff_remaining_ms": 0,
    "sdk_process_count": 0,
    "supervisor_registry_size": 2
  },
  "network": {
    "hook_consecutive_failures": 0,
    "breaker_state": "CLOSED",
    "breaker_consecutive_failures": 0,
    "breaker_lifetime_failures": 3,
    "breaker_opened_at": null,
    "breaker_total_trips": 1,
    "last_request_at": "2026-05-07T21:29:55.000Z",
    "request_rate_per_min": 12.3
  },
  "ai": {
    "provider": "claude",
    "auth_method": "...",
    "last_interaction": { ... }
  }
}

GET /api/healthz?format=prom → 200 text/plain with Prometheus text format. One metric per JSON leaf (e.g. claude_mem_db_free_ratio_pct 0.16).

status derivation:

  • unhealthy if breaker is OPEN_PERMANENT, OR DB initialization failed, OR chroma-mcp pid count > CLAUDE_MEM_CHROMA_MAX_CONCURRENT.
  • degraded if breaker is OPEN, OR free_ratio > 0.4, OR oldest_processing_pending > 1 hour, OR worker version mismatches plugin version.
  • ok otherwise.

7.3 Detailed tasks

  1. MetricsCollector class: a Map<string, unknown> snapshot. Public refresh() collects fresh data; public getSnapshot() returns the cached object. Refresh is called by the 30s health-check tick AND on-demand if last refresh > 5s ago (debounced).

  2. DB metrics queries (use db.prepare + .get()):

    • PRAGMA page_count{ page_count: number }
    • PRAGMA freelist_count{ freelist_count: number }
    • PRAGMA page_size → for size_bytes computation
    • SELECT MIN(updated_at) FROM pending_messages WHERE status='processing' (with julianday math for age in seconds)
    • SELECT COUNT(*) FROM sdk_sessions GROUP BY project
  3. Process metrics: process.memoryUsage().rss / 1024 / 1024. Event-loop lag via perf_hooks.monitorEventLoopDelay (Node API, available in bun) — sample over 30s window.

  4. Network metrics: maintain a rolling 1-min request counter in middleware (existing createMiddleware in Server.ts:156). Increment on each /api/* request.

  5. Prometheus format: emit # HELP and # TYPE lines per metric. Use the same naming convention (claude_mem_<group>_<name>).

  6. Compatibility: leave /api/health UNCHANGED (existing integrations break otherwise). /api/healthz is the new richer endpoint.

7.4 Acceptance criteria

  • curl 127.0.0.1:<port>/api/healthz | jq .status returns ok on a healthy worker.
  • After Phase 6 ships, db.free_ratio_pct updates at 30s cadence (verify by manually inflating freelist).
  • Phase 4 breaker state changes are visible within 30s.
  • ?format=prom parses with promtool check metrics.
  • No new endpoint blocks for > 50ms (snapshot is cached; refresh is async).

7.5 Observability hooks (yes, for the observability endpoint itself)

  • Log WORKER DEBUG MetricsRefresh, fields {durationMs}.
  • Log WORKER WARN MetricsRefreshSlow if refresh > 250ms (DB query stall signal).

7.6 Verification checklist

  • /api/health response body unchanged byte-for-byte (regression test).
  • All Phase 2-6 metrics exposed (cross-check the field list in those phases).
  • ?format=prom output validates with promtool if available; otherwise visual inspection.
  • Endpoint mounted via RouteHandler pattern (no direct app.get in worker-service.ts).

Phase 8 — Observability, CLI, & Rollout

Goal: User-facing surface so operators can see what the new machinery did. Ordered last to allow phases 2-7 to stabilize.

8.1 Files to modify

FileChange
src/cli/handlers/worker-doctor.ts (new)New CLI subcommand claude-mem worker doctor — fetches /api/healthz, formats it for terminals, includes recent reaper actions.
src/services/worker-service.ts:main()Register the worker doctor CLI route (alongside existing cursor, gemini-cli cases).
plugin/scripts/worker-cli.jsWire to the new doctor command.
CLAUDE.md (project root)Document new settings under a "Worker Maintenance" section.
docs/public/ (optional)User-facing explanation of the breaker, reaper, and health endpoint.

8.2 worker doctor output (example)

claude-mem worker doctor

Status:           OK
Version:          plugin=12.7.5 worker=12.7.5 (match)
Uptime:           3h 25m
PID:              12345  (lock held: yes)

Database:
  Size:             32 MB    (free: 0.16%)
  Last vacuum:      4h ago, freed 130k pages
  Pending:          12 total / 1 processing (oldest 4s)
  SDK sessions:     145 total / 13 inactive

Child processes:
  chroma-mcp:       1  (last call: 5s ago, state: connected)
  SDK processes:    0
  Supervisor:       2 entries

Circuit breaker:
  State:            CLOSED
  Consecutive:      0
  Lifetime:         3
  Total trips:      1

Recent maintenance (last 24h):
  2026-05-07 20:00  Vacuum: freed 130k pages in 1.4s
  2026-05-07 19:30  Reaper: 5 stuck-processing reset, 2 inactive marked
  2026-05-07 18:00  Chroma orphan scan: 0 found

If status != ok, append a "Recommended actions" block:

  • breaker open → claude-mem worker reset-breaker
  • DB free ratio high → mention next vacuum window
  • chroma orphans → claude-mem worker reap-chroma

8.3 Detailed tasks

  1. Doctor command: GET /api/healthz via workerHttpRequest. Format as the table above. Color-code (red/yellow/green) using existing chalk integration if present, otherwise plain text. JSON pass-through via --json flag.

  2. Recent-actions feed: store the last 50 maintenance events in a circular buffer in MetricsCollector (in-memory only — survives one worker lifetime; not persistent). Expose at /api/healthz/events (separate to avoid bloating the main response).

  3. Update CLAUDE.md: add a "Worker Maintenance" section with: settings reference table, the doctor command, a brief description of the reaper/breaker/vacuum behavior. Per CLAUDE.md "Important: No need to edit the changelog ever" — only edit CLAUDE.md, never CHANGELOG.

  4. Rollout ordering (per problem statement constraint):

    • Wave 1 (idempotent, low-risk): Phase 5 (PID/port reclamation), Phase 6 (DB maintenance).
    • Wave 2 (reapers — needs careful testing on busy DBs): Phase 2 (session reaper), Phase 3 (chroma supervisor).
    • Wave 3 (user-visible behavior change): Phase 4 (circuit breaker), Phase 7 (/api/healthz).
    • Wave 4 (CLI surface): Phase 8 (doctor command, docs).

    Each wave can ship as a separate release. Inter-wave dependencies: Phase 7 depends on data sources from Phases 2/3/4/6 — but the endpoint can ship with partial data (fields gated by phase availability).

8.4 Acceptance criteria

  • claude-mem worker doctor prints a green-OK summary on a healthy worker.
  • claude-mem worker doctor --json returns valid JSON pipeable to jq.
  • Killing the worker → claude-mem worker doctor cleanly reports Worker unreachable instead of hanging.
  • CLAUDE.md updates are limited to a new section; no churn elsewhere.

8.5 Verification checklist

  • claude-mem worker doctor exits 0 on healthy state, 1 on unhealthy, 2 if worker unreachable (mirrors hook-exit-codes convention).
  • No new public marketplace API surface beyond what's documented.
  • Doctor command works without the worker running (unreachable path covered).

Final Phase — Cross-Phase Verification

Goal: Prove the system works end-to-end before declaring victory.

F.1 Soak test (24h)

Run the worker for 24 hours under realistic Claude Code usage. After 24h:

MetricPass criterion
ps aux | grep chroma-mcp | wc -l≤ 1
ps aux | grep claude-mem | wc -l≤ a small constant (1-2)
DB size growth rate< 5 MB/hr; free_ratio < 20%
/api/healthz breaker.lifetime_failures< 10 (vs. the #1874 starting baseline)
Stuck processing rows older than 10 min0
Worker memory RSS< 300 MB (no leak)

F.2 Failure-injection tests

InjectExpected behavior
Kill worker via kill -9Lazy-respawn on next hook; PID file cleaned
Two parallel claude-mem startExactly one daemon survives; lock log line visible
100 stuck processing rowsReaper resets all within REAPER_PROCESSING_STUCK_MS + REAPER_TICK_MS
Spawn fake listener on worker portNew --daemon exits 0 with diagnostic stderr (no silent exit)
Fork 5 chroma-mcp orphansWorker startup reaps all 5
Pull network during 10 hooksBreaker opens after threshold; subsequent hooks short-circuit

F.3 Anti-pattern grep

# No new always-on intervals
grep -rn "setInterval" src/ --include="*.ts" | grep -v "unref()" | grep -v "^src/.*test"

# No new process.exit(1) on hook paths
git diff main -- src/shared/worker-utils.ts src/cli/ | grep "process.exit(1)"

# No invented settings
git diff main -- src/shared/SettingsDefaultsManager.ts | grep "CLAUDE_MEM_"
# Cross-reference with all phases' settings tables.

# No hardcoded magic numbers in business logic
git diff main | grep -E "[0-9]{4,}" | grep -v SettingsDefaultsManager | grep -v test

F.4 Documentation diff

  • CLAUDE.md adds: Worker Maintenance section (Phase 8.3).
  • docs/public/ (optional): user-facing explanation.
  • No CHANGELOG edits (auto-generated per CLAUDE.md).

F.5 Sign-off checklist

  • All 8 phases shipped.
  • /api/healthz reports status: "ok" 24h after deployment.
  • No new ERROR-level logs in production for 24h (excluding pre-existing).
  • Manual worker doctor on 3 production-like environments confirms expected output.
  • Phase 0 doc-discovery anti-patterns not violated (grep git log -p).

Appendix A — Settings Reference (consolidated)

All settings declared in src/shared/SettingsDefaultsManager.ts:

SettingPhaseDefaultRange
CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS550000–60000
CLAUDE_MEM_PID_PORT_RECHECK_MS52000500–30000
CLAUDE_MEM_DB_MAINTENANCE_ENABLED6truebool
CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS6241–168
CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO60.400.05–0.95
CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS63000000–3600000
CLAUDE_MEM_CLEANUP_REGRESSION_CHECK6truebool
CLAUDE_MEM_REAPER_ENABLED2truebool
CLAUDE_MEM_REAPER_TICK_MS2300005000–600000
CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS230000030000–86400000
CLAUDE_MEM_REAPER_INACTIVE_DAYS2301–365
CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS200–365
CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS390000060000–86400000
CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START3truebool
CLAUDE_MEM_CHROMA_MAX_CONCURRENT311–4
CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD451–50
CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS4300001000–600000
CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES411–10
CLAUDE_MEM_BREAKER_LIFETIME_CAP4500–10000

Appendix B — File Change Summary

FilePhases that touch it
src/services/worker-service.ts3 (initializeBackground), 5 (--daemon), 6 (maintenance wiring), 7 (route registration), 8 (CLI)
src/services/worker-spawner.ts5
src/services/infrastructure/ProcessManager.ts5 (lock + start-token)
src/services/infrastructure/HealthMonitor.ts5 (port-on-pid match)
src/services/infrastructure/CleanupV12_4_3.ts6 (regression detection — read only)
src/services/sync/ChromaMcpManager.ts3
src/supervisor/index.ts5 (validateWorkerPidFile)
src/supervisor/process-registry.ts3 (orphan scan), 5 (start-token)
src/supervisor/health-checker.ts2 (reaper), 7 (metrics refresh)
src/services/worker/SessionManager.ts2 (delete hook), 6 (pause/resume)
src/shared/worker-utils.ts4 (breaker integration)
src/services/sqlite/Database.ts6 (auto_vacuum)
src/services/sqlite/PendingMessageStore.ts2 (reapStuckProcessing)
src/services/sqlite/SessionStore.ts2 (findInactiveSdkSessions)
src/services/sqlite/migrations/runner.ts2 (inactive_at column)
src/services/server/Server.ts4 (breaker reset), 7 (healthz route)
src/shared/SettingsDefaultsManager.ts2-6 (settings keys)
src/services/maintenance/DbMaintenance.ts6 (NEW)
src/services/maintenance/SessionReaper.ts2 (NEW)
src/shared/worker-circuit-breaker.ts4 (NEW)
src/services/worker/MetricsCollector.ts7 (NEW)
src/services/worker/http/routes/HealthzRoutes.ts7 (NEW)
src/cli/handlers/worker-doctor.ts8 (NEW)
CLAUDE.md8 (Worker Maintenance section)

Appendix C — Open Questions for Executor

  1. bun:ffi flock support: confirm via spike before committing Phase 5.4. If unavailable, fall back to flock(1) shell on Linux + atomic mkdirSync sentinel on macOS/Windows.
  2. Event-loop lag sampling on bun: verify perf_hooks.monitorEventLoopDelay works in bun's Node-compat layer. If not, fall back to a setImmediate-based heuristic.
  3. Existing-DB auto_vacuum migration: verify that the startup full VACUUM in Phase 6.3 is sufficient to reclaim the 504 MB without requiring users to run PRAGMA auto_vacuum = INCREMENTAL; VACUUM; manually. (It should — full VACUUM with auto_vacuum already set takes effect.)
  4. Pro-features compatibility: confirm with maintainers that /api/healthz does not duplicate any planned Pro endpoint. Per CLAUDE.md "Pro Features Architecture", the worker's local HTTP API stays open — /api/healthz is fine to add OSS-side.