plans/03-worker-lifecycle.md
Scope: Fix accumulated worker / daemon lifecycle bugs in claude-mem. Address DB bloat, chroma-mcp leaks, retry storms, port/PID races, queue zombies, missing supervision, and observability gaps.
Non-implementation: This document is a plan. Each phase is self-contained; an executing agent should be able to run a single phase without re-discovering context.
Audience: Subsequent agents executing one phase per session.
Goal: Anchor every implementation phase in real APIs that exist in the current codebase or in vetted libraries. Prevent phantom-method invention.
| File | Why |
|---|---|
CLAUDE.md (project root) | Architecture, exit-code strategy, Pro/OSS boundary, settings conventions |
src/services/worker-service.ts | WorkerService class, --daemon main(), signal registration, all CLI subcommands |
src/services/worker-spawner.ts | ensureWorkerStarted 3-state machine (ready/warming/dead) |
src/services/infrastructure/ProcessManager.ts | spawnDaemon, PID file ops, captureProcessStartToken, isProcessAlive |
src/services/infrastructure/HealthMonitor.ts | isPortInUse, waitForHealth, waitForReadiness, httpShutdown |
src/services/infrastructure/GracefulShutdown.ts | performGracefulShutdown ordering |
src/services/infrastructure/CleanupV12_4_3.ts | runOneTimeV12_4_3Cleanup, STUCK_PENDING_THRESHOLD = 10, observer-purge SQL |
src/services/sync/ChromaMcpManager.ts | ensureConnected, connectInternal, stop, killProcessTree, collectDescendantPids, RECONNECT_BACKOFF_MS = 10_000, MCP_CONNECTION_TIMEOUT_MS = 30_000 |
src/supervisor/index.ts | Supervisor class, validateWorkerPidFile, signal-handler config |
src/supervisor/process-registry.ts | ProcessRegistry, getSdkProcessForSession, ensureSdkProcessExit, waitForSlot, TOTAL_PROCESS_HARD_CAP = 10 |
src/supervisor/health-checker.ts | 30s pruneDeadEntries loop (already present — extend, don't replace) |
src/supervisor/shutdown.ts | runShutdownCascade, signalProcess, loadTreeKill |
src/services/worker/SessionManager.ts | In-memory session map, deleteSession, queue/pending integration |
src/services/worker/RestartGuard.ts | Per-session restart cap (10/60s window, 5 consecutive) |
src/services/worker/retry.ts | Provider-level retry (withRetry, classified errors) — DO NOT mutate; circuit breaker layers ABOVE this |
src/shared/worker-utils.ts | recordWorkerUnreachable (line 401), executeWithWorkerFallback (line 443), fail-loud counter file at ~/.claude-mem/state/hook-failures.json |
src/services/sqlite/Database.ts | PRAGMA setup (lines 27-32, 69-74) — single source of truth for DB pragmas |
src/services/server/Server.ts | /api/health (line 161), /api/readiness (line 178), /api/version (line 192) |
src/shared/SettingsDefaultsManager.ts | Where every new setting key MUST be declared with a default |
src/shared/hook-constants.ts | HOOK_TIMEOUTS, HOOK_EXIT_CODES — extend here, don't inline |
plugin/bun-runner.js, plugin/scripts/worker-service.cjs | Built worker entrypoint — note the build pipeline (scripts/build-hooks.js) |
SQLite (bun:sqlite) — pragma calls are db.run('PRAGMA …') or db.prepare('PRAGMA …').get(). Existing pragmas: journal_mode=WAL, synchronous=NORMAL, foreign_keys=ON, temp_store=memory, mmap_size, cache_size. VACUUM runs only outside a transaction. VACUUM INTO 'path' is the backup form already used in CleanupV12_4_3.ts:135. wal_checkpoint(TRUNCATE) is the truncating-checkpoint form.
Process supervision — getSupervisor(), getProcessRegistry(), registerProcess(id, info, processRef?), unregisterProcess(id), pruneDeadEntries(), assertCanSpawn(type), runShutdownCascade(...). Tree-kill on POSIX uses pgrep -P recursion + process.kill(-pgid, signal); on Windows uses taskkill /T /F /PID or tree-kill npm.
HTTP/Express — Server.app.get('/api/...', handler) via registerRoutes (handlers implement setupRoutes(app) on a RouteHandler interface). Every new endpoint must follow the existing RouteHandler pattern under src/services/worker/http/routes/.
Settings — SettingsDefaultsManager.get('CLAUDE_MEM_…'), SettingsDefaultsManager.loadFromFile(path). New keys require: (a) type added to the interface in SettingsDefaultsManager.ts, (b) default value declared in the same file, (c) documented in CLAUDE.md if user-tunable.
Logging — logger.info(category, msg, fields), logger.warn, logger.error(category, msg, fields, error). Categories used here: SYSTEM, WORKER, SESSION, CHROMA_MCP, SDK, DB, QUEUE, PROCESS. Add new category MAINTENANCE for VACUUM / reaper events.
getSupervisor().getSupervisor().assertCanSpawn(...) and registerProcess(...).process.exit(1) on hook-side error paths — it accumulates Windows Terminal tabs (CLAUDE.md exit-code strategy). Use 0 for graceful, 2 only for blocking-error paths that need to surface stderr to Claude.sdk_sessions rows if observations or session_summaries still reference their memory_session_id without an explicit user-opt-in flag.VACUUM while ingestion is hot. Pause queue processing first..unref().SettingsDefaultsManager.ts first.Confidence: HIGH on file/API inventory (read-pass complete on all referenced files). MEDIUM on Windows behavior of new advisory locks (Windows mandatory locking via lockf is bun-runtime-dependent — verify via spike before committing).
Goal: Produce a written state-machine diagram and an exit-site catalog that subsequent phases reference. No code changes; create a scratch document at docs/internal/worker-lifecycle-state-machine.md if the executor wants an artifact, otherwise capture findings in commit messages.
Trace the worker daemon spawn → terminate path end-to-end. Source order:
src/shared/worker-utils.ts:ensureWorkerRunning (lazy spawn) OR src/services/worker-spawner.ts:ensureWorkerStarted (explicit)spawnDaemon (src/services/infrastructure/ProcessManager.ts:408) — POSIX uses setsid if available, Windows uses Start-Process -WindowStyle Hidden--daemon branch in src/services/worker-service.ts:937 — duplicate-PID/duplicate-port guardWorkerService.start() (line 258) → startSupervisor() → server.listen() → writePidFile() → getSupervisor().registerProcess('worker', ...) → initializeBackground()configureSupervisorSignalHandlers (src/supervisor/index.ts:49) — SIGTERM/SIGINT; SIGHUP ignored in --daemon mode on POSIXWorkerService.shutdown() → performGracefulShutdown → server close → sessionManager.shutdownAll() → mcp client close → chroma stop → db close → getSupervisor().stop() → runShutdownCascade → PID file unlinkCatalog every process.exit(...) site in worker-service.ts (already mapped — 21 sites; lines 764, 772, 794, 804, 810, 813, 828, 835, 842, 853, 870, 878, 888, 895, 916, 933, 945, 950, 971, 975, 991). Annotate each with: code, intent, whether it leaks the worker on the same path, whether shutdown ran first.
Catalog every retry / unreachable site:
src/shared/worker-utils.ts:401 recordWorkerUnreachable (the #1874 counter)src/cli/handlers/{context,file-context,file-edit,summarize,observation,user-message,session-init}.ts — every executeWithWorkerFallback callersrc/servers/mcp-server.ts:72,100,145 — direct workerHttpRequestsrc/services/transcripts/processor.ts:331,371,373 — direct workerHttpRequestsrc/services/integrations/CursorHooksInstaller.ts:64,349,352 — direct workerHttpRequestsrc/utils/claude-md-utils.ts:305 — direct workerHttpRequestCatalog every spawn site:
spawnDaemon (worker self-spawn)ChromaMcpManager.connectInternal (chroma-mcp via uvx → uv → python → chroma-mcp)spawnSdkProcess (src/supervisor/process-registry.ts:532) — Claude SDK subprocessesrunMcpSelfCheck (src/services/worker-service.ts:405) — MCP loopback probe via process.execPathexecSync / execFile / spawnSync in ChromaMcpManager (cert resolution) or ProcessManager (binary lookup, cwd-remap)executeWithWorkerFallback callers run inside hooks (Claude Code's strict timeout window) vs. inside the worker (no timeout pressure) — this drives Phase 4 circuit-breaker scoping.grep -rn "process.exit" src/ --include="*.ts" | wc -l matches the catalog.grep -rn "executeWithWorkerFallback\|workerHttpRequest" src/ --include="*.ts" | grep -v worker-utils.ts | wc -l matches the catalog.Hand-off note for Phase 2-8 executors with file/line anchors; no code committed.
Shipping order: Phase 5 first (per Phase 8 ordering). Idempotent and safe.
Goal: Eliminate the silent-exit-0 case where a fresh --daemon spawn loses the port race; harden cross-platform PID-reuse detection; serialize concurrent spawns with an OS-level advisory lock.
| File | Change |
|---|---|
src/supervisor/process-registry.ts | Extend captureProcessStartToken for macOS (already partial via ps -o lstart) and Windows (wmic process where ProcessId=X get CreationDate /value). Add unit test for each platform branch. |
src/supervisor/index.ts:validateWorkerPidFile | Add port-on-pid match check — if pidInfo.port !== currentExpectedPort, treat as 'stale'. |
src/services/infrastructure/ProcessManager.ts | Add new exports: acquireDaemonLock() / releaseDaemonLock() using POSIX flock (via fcntl/flock syscall through bun:ffi or shelling to flock(1) on Linux only) and Windows mandatory file lock via LockFile (or fall back to atomic-rename sentinel on Windows). |
src/services/worker-service.ts:937 (--daemon branch) | Wrap startup in acquireDaemonLock(). If port is in use, perform a /api/version probe; if the listener returns OUR BUILT_IN_VERSION → exit 0 (legit duplicate); if it returns a different version → log a warning and exit 0 (stale worker, will be restarted by version-mismatch path); if the listener doesn't respond → wait HOOK_TIMEOUTS.PORT_IN_USE_WAIT then write a clear stderr line with diagnostic before exiting. |
src/services/worker-spawner.ts | Same lock acquisition before spawnDaemon. Release on success or error. |
macOS start-time token: extend captureProcessStartToken (registry line 56). On Darwin, prefer ps -p <pid> -o lstart= (already in fallback path). Verify with LC_ALL=C LANG=C env so locale doesn't change the timestamp format. Add a comment explaining that ps lstart resolution is 1-second — collisions still possible but vastly less likely than no-token.
Windows start-time token: add a Win32 branch using wmic process where ProcessId=<pid> get CreationDate /value. Parse the CreationDate=YYYYMMDDHHMMSS.ffffff+TZ line. Cache the wmic resolution per-pid for 5s (avoid re-shelling on repeat checks).
Port-on-pid match: in validateWorkerPidFile, after confirming isPidAlive(pidInfo.pid), verify the recorded pidInfo.port is reachable via isPortInUse(pidInfo.port) AND the listener's /api/version returns a version string. If port is dead but PID alive → return 'stale' (worker crashed mid-listen, PID about to be reused).
Advisory lock:
<DATA_DIR>/.worker-spawn.lock with O_RDWR | O_CREAT, flock(fd, LOCK_EX | LOCK_NB). On EAGAIN, log Another spawn in progress, waiting up to 5s and retry with LOCK_EX (blocking) under a setTimeout race. Implement via bun:ffi for POSIX flock(2) if available, otherwise shell flock -n -x <path> <command>. Spike first: confirm bun's bun:ffi exposes flock. If not, use a watch-and-rename sentinel (less ideal but works).LockFile via Win32 API or fall back to atomic mkdirSync of <DATA_DIR>/.worker-spawn.lock.dir (fails if exists) with stale-timeout cleanup at 30s.Diagnostic stderr: when port-in-use without our worker responding, write to stderr (and log INFO) with: claude-mem worker port <N> in use by an unidentified process; not spawning duplicate. This must NOT block the hook — exit 0 still per CLAUDE.md.
| Key | Default | Range | Purpose |
|---|---|---|---|
CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS | 5000 | 0–60000 | Max wait for the spawn lock |
CLAUDE_MEM_PID_PORT_RECHECK_MS | 2000 | 500–30000 | Wait window before treating port-in-use without /api/version response as "unknown listener" |
claude-mem start commands in parallel → exactly one daemon ends up alive; the other exits cleanly with a log line referencing the lock.-9 (skip cleanup), reuse the PID with python -c 'import time; time.sleep(60)' → validateWorkerPidFile returns 'stale' and removes the file./api/version probe path: spawn a fake server on the worker port → daemon exits 0 with the new diagnostic stderr, NOT silently.SYSTEM INFO Daemon spawn lock acquired on success.SYSTEM WARN Daemon spawn lock contention, fields {waitedMs}.SYSTEM WARN Worker port occupied by foreign listener, fields {port, probeStatus}./api/healthz fields (added in Phase 7): pid_file_path, pid_start_token, daemon_lock_held: bool.grep "process.exit(0)" src/services/worker-service.ts — count unchanged (no new silent exits introduced).setInterval introduced.Ships alongside Phase 5 (idempotent).
Goal: Recover the 504 MB of free pages, prevent recurrence, surface DB-size metrics.
| File | Change |
|---|---|
src/services/sqlite/Database.ts:27-32 and :69-74 | Add PRAGMA auto_vacuum = INCREMENTAL BEFORE the first table is created (only takes effect on a fresh DB; harmless on existing DBs but logs a no-op). For existing DBs, the migration path is the one-shot Phase-6 startup VACUUM. |
src/services/maintenance/DbMaintenance.ts (new) | Periodic maintenance task: on a 24h timer (configurable), call PRAGMA incremental_vacuum, PRAGMA wal_checkpoint(TRUNCATE), then collect metrics (page_count, freelist_count, file size). Emit MAINTENANCE INFO log. Acquire dbMaintenanceMutex so other writers wait. |
src/services/maintenance/DbMaintenance.ts | Startup check: if freelist_count / page_count > FREE_RATIO_VACUUM_THRESHOLD (default 0.40), perform full VACUUM after VACUUM INTO backup to <DATA_DIR>/backups/claude-mem-pre-vacuum-<ts>.db. Pause queue processor first. |
src/services/worker-service.ts:initializeBackground | Wire the maintenance task — start after dbManager.initialize(). Timer must .unref(). |
src/services/worker/SessionManager.ts | Expose pauseQueueProcessing(): Promise<void> and resumeQueueProcessing(): void. Use the existing AbortController + emitter to drain in-flight work; don't introduce new state. Maintenance acquires; readers continue (WAL allows them). |
src/services/infrastructure/CleanupV12_4_3.ts:135 | Reuse the existing VACUUM INTO backup pattern verbatim — copy the disk-space pre-flight check (statfsSync, line 115). |
Auto-vacuum on new DBs: Add PRAGMA auto_vacuum = INCREMENTAL in Database.ts BEFORE migrationRunner.runAllMigrations(). Verify with a comment that this is no-op on existing DBs (sqlite docs say a full VACUUM is required to flip auto_vacuum mode after tables exist). Document the migration path: existing users get the freed-page reclamation via the startup full VACUUM in step 3.
Periodic incremental vacuum + WAL checkpoint:
setInterval with .unref(). Default cadence: 24h. Setting: CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS (default 24, min 1, max 168).db.run('PRAGMA incremental_vacuum') → db.run('PRAGMA wal_checkpoint(TRUNCATE)') → snapshot metrics → release.VACUUM is in progress.Startup full VACUUM (one-shot per session) when free-ratio is high:
page_count (PRAGMA page_count) and freelist_count (PRAGMA freelist_count).freelist_count / page_count >= CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO (default 0.40), schedule a deferred VACUUM (5 minutes after worker becomes ready) to avoid slowing startup.VACUUM INTO '<backup>' → verify backup → VACUUM (full) → resume queue → log freed pages and ms taken.statfsSync (mirror CleanupV12_4_3.ts:115). Skip if free space < 1.2 * dbSize + 100MB. Log MAINTENANCE ERROR in that case so the user sees actionable info.Pause/resume hook in SessionManager: The existing for await ... of getMessageIterator() loop in queue processor needs a "pause" semaphore. Implementation: add a Promise<void> gate that the iterator awaits before yielding. Maintenance flips it to a pending promise during VACUUM; resolve to release. Do not abort in-flight messages — they can complete; new messages wait.
Cleanup-V12.4.3 regression detection: Re-scan sdk_sessions WHERE project = OBSERVER_SESSIONS_PROJECT and pending_messages matching the stuck-pending pattern at maintenance ticks. If any match AND the marker exists, log MAINTENANCE WARN and re-run the purge (idempotent). Setting: CLAUDE_MEM_CLEANUP_REGRESSION_CHECK = true.
| Key | Default | Range | Purpose |
|---|---|---|---|
CLAUDE_MEM_DB_MAINTENANCE_ENABLED | true | bool | Master kill-switch |
CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS | 24 | 1–168 | Periodic cadence |
CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO | 0.40 | 0.05–0.95 | Free-ratio above which we auto-VACUUM at startup |
CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS | 300000 (5 min) | 0–3600000 | Defer startup VACUUM so it doesn't block readiness |
CLAUDE_MEM_CLEANUP_REGRESSION_CHECK | true | bool | Re-scan v12.4.3-shaped pollution |
pending_messages with 100k stuck processing rows, run worker → startup VACUUM fires within 5 min after readiness, freed-pages log line appears, file size drops.SQLITE_BUSY or database is locked errors; queue resumes after VACUUM.PRAGMA auto_vacuum returns 2 (incremental) on freshly-created DBs..unref() — process.exit(0) from a clean shutdown returns immediately, not after the 24h interval.MAINTENANCE.MaintenanceStart, MaintenanceTick, VacuumStart, VacuumComplete ({freedPages, ms, dbSizeBeforeMb, dbSizeAfterMb}), VacuumSkippedLowDisk, RegressionDetected, MaintenanceComplete./api/healthz fields (Phase 7): db_page_count, db_freelist_count, db_free_ratio_pct, db_size_bytes, db_last_vacuum_at, db_last_vacuum_freed_pages, db_last_maintenance_at.VACUUM inside a transaction (sqlite errors).VACUUM INTO backup phase — only the final full VACUUM needs the writer-lock window. (VACUUM INTO works on a read-only snapshot.)PRAGMA wal_checkpoint(FULL) — TRUNCATE is required to actually shrink the WAL file.<DATA_DIR>/backups/ before every full VACUUM..unref() (grep for setInterval in the new file → unref() follows each).setInterval outside the maintenance file.Database.ts extended with auto_vacuum and includes a comment about migration.Goal: Stop pending_messages and sdk_sessions from accumulating zombies.
| File | Change |
|---|---|
src/services/maintenance/SessionReaper.ts (new) | Periodic reaper. Plugs into the supervisor's existing health-checker.ts 30s tick (extend, do not replace). |
src/supervisor/health-checker.ts:9 runHealthCheck | Call SessionReaper.tick() after pruneDeadEntries(). |
src/services/worker/SessionManager.ts:deleteSession | After in-memory delete, call pendingStore.clearPendingForSession(sessionDbId) synchronously (it already does this via clearPendingForSession on a separate path — verify and unify). |
src/services/sqlite/PendingMessageStore.ts | Add reapStuckProcessing(olderThanMs: number): number returning the count of rows reset to pending. |
src/services/sqlite/SessionStore.ts | Add findInactiveSdkSessions(olderThanDays: number): Array<{id, project, contentSessionId, memorySessionId, lastActivityAt}>. |
src/services/sqlite/SessionStore.ts | Add markSdkSessionInactive(id: number) — adds an inactive_at column or sets a sentinel. |
src/services/sqlite/migrations/runner.ts | New migration: add inactive_at TEXT NULL to sdk_sessions if absent. |
Per tick (default 30s, gated by CLAUDE_MEM_REAPER_ENABLED):
Stuck-processing sweep: UPDATE pending_messages SET status='pending' WHERE status='processing' AND updated_at < <now - PROCESSING_STUCK_MS> (default 5 minutes). Log count if > 0.
Orphan-pending sweep: DELETE FROM pending_messages WHERE session_db_id NOT IN (SELECT id FROM sdk_sessions) (defensive — should already be FK-protected but log if any deleted).
Inactive-session detection (does NOT delete):
id NOT IN <in-memory session ids> AND last_activity > N days ago (computed from MAX of related observations / pending_messages / session_summaries timestamps).UPDATE sdk_sessions SET inactive_at = <now> WHERE id = ? AND inactive_at IS NULL.Observer-pollution regression check (matches Phase 6 task 5):
OBSERVER_SESSIONS_PROJECT rows reappear after the v12.4.3 marker is present, re-run the purge SQL from CleanupV12_4_3.runObserverSessionsPurge (lines 196-218).MAINTENANCE WARN with counts.Hard delete is opt-in via CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS (default 0 = disabled; nonzero = days threshold). When enabled and a session has inactive_at older than the threshold AND no FK-referencing rows, hard-delete the session row. Default-off because user data safety > disk space.
| Key | Default | Range | Purpose |
|---|---|---|---|
CLAUDE_MEM_REAPER_ENABLED | true | bool | Master switch |
CLAUDE_MEM_REAPER_TICK_MS | 30000 | 5000–600000 | Tick cadence (piggy-backs supervisor; this value gates whether the reaper runs each tick) |
CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS | 300000 (5 min) | 30000–86400000 | Threshold for a processing row to be considered stuck |
CLAUDE_MEM_REAPER_INACTIVE_DAYS | 30 | 1–365 | When to mark a session inactive_at |
CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS | 0 | 0–365 | 0 = never; otherwise, hard-delete inactive rows older than N days |
processing rows older than 5 minutes → next reaper tick resets them → /api/healthz shows oldest_pending_processing_age_sec drop to 0.OBSERVER_SESSIONS_PROJECT rows post-marker → next tick logs regression and purges them.MAINTENANCE INFO ReaperTick, fields {stuckProcessing, orphanPending, markedInactive, hardDeleted, observerRegression}./api/healthz fields (Phase 7): oldest_processing_pending_age_sec, processing_pending_count, pending_count_total, sdk_sessions_total, sdk_sessions_inactive, sdk_sessions_by_project: { [project]: count }.inactive_at column without breaking existing data (test on a copy of a real DB).findInactiveSdkSessions.observations / session_summaries unless explicit hard-delete + zero-FK-reference precondition./api/healthz shows reaper metrics.Goal: Stop the 23-concurrent-chroma-mcp leak. Bound concurrency, reap idle, scan for orphans at startup.
| File | Change |
|---|---|
src/services/sync/ChromaMcpManager.ts | Add idle reaper; enforce single-instance via supervisor registry; add startup orphan scan; add lastCallAt timestamp updated by callTool. |
src/services/sync/ChromaMcpManager.ts:ensureConnected (line 43) | Before connect, check getProcessRegistry().getAll().filter(r => r.type === 'chroma') — if non-empty AND PID alive AND PID not the current _process.pid, refuse to spawn (alert + reuse existing if possible; otherwise wait for backoff). |
src/services/sync/ChromaMcpManager.ts:registerManagedProcess (line 613) | Already calls getSupervisor().registerProcess(CHROMA_SUPERVISOR_ID, ...) — verify the supervisor enforces single-instance for this id. (Currently register is keyed by id so same id replaces; document this.) |
src/supervisor/process-registry.ts | Add getActiveCountByType(type: string): number. Add findChromaOrphans(): Promise<number[]> — POSIX pgrep -af 'chroma-mcp' filtered by PPID == 1. |
src/services/worker-service.ts:initializeBackground | After ChromaMcpManager.getInstance(), kick off await ChromaMcpManager.scanAndReapOrphans() (best-effort; never throws). |
Startup orphan scan: New static method ChromaMcpManager.scanAndReapOrphans():
pgrep -af 'chroma-mcp' → for each PID, check PPID. If PPID == 1 (re-parented to init), call killProcessTree(pid) (existing function at line 388). Log CHROMA_MCP INFO ReapedOrphan, fields {pid, ageSec}.Get-CimInstance Win32_Process -Filter "Name='chroma-mcp.exe'" filter by parent process state, kill with taskkill.chroma-mcp==<CHROMA_MCP_PINNED_VERSION> to avoid killing unrelated chroma installations.Idle reaper: Add lastCallAt: number = 0 field to ChromaMcpManager. Update on every callTool. Run a setInterval(checkIdle, 60_000) (.unref()) — if connected && Date.now() - lastCallAt > CHROMA_MCP_IDLE_SHUTDOWN_MS (default 15 min), call await this.stop(). Lazy-reconnect resumes on next callTool.
Single-instance guard on reconnect: In ensureConnected, before connectInternal, call getProcessRegistry().getActiveCountByType('chroma'). If > 0 AND the registered PID is alive but this.connected === false, this is a stale process (we lost track). Tear it down via killProcessTree(registeredPid) first, then proceed with fresh spawn. Otherwise the count grows by one each reconnect — exactly the leak observed.
Hard cap: extend getSupervisor().assertCanSpawn('chroma mcp') (already called at line 87) to actually count and reject. Cap = 1 chroma-mcp per worker. Cap = TOTAL_PROCESS_HARD_CAP (10) overall — already enforced for SDK processes; extend to chroma-mcp.
Tighten close path: in connectInternal (line 74), after transport.close() / client.close(), if the underlying _process.pid is still in the registry, call killProcessTree and unregisterProcess explicitly. Don't rely on transport.onclose alone — it has the stale-callback guard but doesn't always fire on connect-time failures.
| Key | Default | Range | Purpose |
|---|---|---|---|
CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS | 900000 (15 min) | 60000–86400000 | Idle reaper threshold |
CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START | true | bool | Master switch for startup scan |
CLAUDE_MEM_CHROMA_MAX_CONCURRENT | 1 | 1–4 | Cap chroma-mcp instances per worker |
getProcessRegistry().getActiveCountByType('chroma') returns 0.callTool after idle-shutdown lazy-reconnects successfully.CHROMA_MCP INFO OrphanScan {found, killed}.CHROMA_MCP INFO IdleShutdown {idleMs}.CHROMA_MCP WARN RegistryStale when single-instance guard tears down a phantom./api/healthz fields (Phase 7): chroma_mcp_pid_count, chroma_mcp_last_call_at, chroma_mcp_state ('connected'|'disconnected'|'backoff'), chroma_mcp_backoff_remaining_ms.chroma-mcp==<PINNED_VERSION> — could match unrelated user installs.chromaMcpManager is null (chroma disabled via CLAUDE_MEM_CHROMA_ENABLED=false).getProcessRegistry() from outside the worker process — it's worker-internal.ps aux | grep chroma-mcp | wc -l ≤ 1..unref()d.pgrep returning empty (no false-error logs).Goal: Replace the unbounded counter at worker-utils.ts:401 with a real circuit breaker. Stop hooks from hammering the worker when it's down.
| File | Change |
|---|---|
src/shared/worker-circuit-breaker.ts (new) | CircuitBreaker class: states CLOSED, OPEN, HALF_OPEN. Persist to ~/.claude-mem/state/circuit-breaker.json. |
src/shared/worker-utils.ts:executeWithWorkerFallback (line 443) | Wrap the call in breaker.run(...). On OPEN, return WorkerFallback immediately (no HTTP). |
src/shared/worker-utils.ts:recordWorkerUnreachable (line 401) | Becomes a thin shim that calls breaker.recordFailure(). Hard cap (MAX_LIFETIME_FAILURES = 50) trips the breaker permanently until manual reset. |
src/shared/worker-utils.ts:resetWorkerFailureCounter (line 419) | Becomes breaker.recordSuccess(). |
src/cli/hook-command.ts | Verify the swallowed-stderr fix from observation 2026-05-07 is applied (it's marked as a "no-op replacement bug"). The breaker's stderr-fail-loud path must actually write to process.stderr.write(), not a stub. |
src/services/server/Server.ts | Add /api/admin/breaker/reset POST endpoint (gated by localhost only) for manual unsticking. |
States and transitions:
CLOSED ──[N consecutive failures]──> OPEN
OPEN ──[reset_timeout_ms elapsed]──> HALF_OPEN
HALF_OPEN ──[1 success]──> CLOSED
HALF_OPEN ──[1 failure]──> OPEN (resets timer)
ANY ──[lifetime failures > MAX_LIFETIME_FAILURES]──> OPEN_PERMANENT (until manual reset via API or settings reload)
Defaults:
| Setting | Default | Range |
|---|---|---|
CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD | 5 | 1–50 |
CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS | 30000 | 1000–600000 |
CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES | 1 | 1–10 |
CLAUDE_MEM_BREAKER_LIFETIME_CAP | 50 | 0–10000 (0 = no cap) |
Persistent state file shape:
{
"state": "CLOSED|OPEN|HALF_OPEN|OPEN_PERMANENT",
"consecutiveFailures": 0,
"lifetimeFailures": 0,
"openedAt": null,
"lastFailureAt": null,
"lastSuccessAt": null,
"lastTrippedAt": null
}
CircuitBreaker class: pure logic class, no I/O. Methods: getState(), canAttempt(), recordFailure(reason), recordSuccess(), forceReset(). Atomic file writes (write tmp + rename) for the JSON snapshot, mirroring writeHookFailureStateAtomic (worker-utils.ts:372).
Wire into executeWithWorkerFallback:
if (!breaker.canAttempt()) {
// Optional: print one-line stderr if state changed during this call
return { continue: true, reason: 'circuit_breaker_open', [WORKER_FALLBACK_BRAND]: true };
}
const alive = await ensureWorkerAliveOnce();
if (!alive) { breaker.recordFailure('unreachable'); ... }
...
if (response.ok) breaker.recordSuccess();
Fail-loud stderr fix: The 2026-05-07 observation mentions a "stderr no-op replacement bug" in hookCommand. Investigate src/cli/hook-command.ts for any process.stderr.write shim that suppresses output. The breaker's diagnostic ("Worker unreachable; circuit breaker OPEN; will retry in Xs") MUST appear on the user's terminal so they know what's happening. Test by intentionally killing the worker and running a hook — message should appear on stderr.
Manual reset endpoint: POST /api/admin/breaker/reset (no body required). Restricted to 127.0.0.1 only. Logs SYSTEM WARN BreakerForceReset with caller info.
Lifetime cap: when lifetimeFailures > CLAUDE_MEM_BREAKER_LIFETIME_CAP, transition to OPEN_PERMANENT. The only way out is the manual-reset API or restarting the worker with a fresh state file. Print prominent stderr: claude-mem: 50 lifetime worker failures detected. Disabling memory hooks until reset. Run: claude-mem worker doctor.
CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD HTTP attempts made; rest short-circuit.POST /api/admin/breaker/reset clears it.worker-utils.ts).SYSTEM WARN BreakerOpened, fields {lifetime, consecutiveBefore}.SYSTEM INFO BreakerHalfOpen.SYSTEM INFO BreakerClosed, fields {recoveredAfterMs}.SYSTEM ERROR BreakerOpenedPermanent./api/healthz fields (Phase 7): breaker_state, breaker_consecutive_failures, breaker_lifetime_failures, breaker_opened_at, breaker_total_trips.RestartGuard for its own session-level limits./api/admin/breaker/reset should work even if initializationCompleteFlag === false).workerHttpRequest outside executeWithWorkerFallback and audit each — some integrations may need breaker.canAttempt() guards added).process.exit(1) introduced — breaker tripping returns WorkerFallback, not exit codes./api/healthz Endpoint with Concrete MetricsGoal: Centralized observability so future regressions are detectable at a glance.
| File | Change |
|---|---|
src/services/worker/http/routes/HealthzRoutes.ts (new) | Implements RouteHandler. GET /api/healthz and /api/healthz?format=prom. |
src/services/worker-service.ts:registerRoutes | Register the new HealthzRoutes(...). |
src/services/worker/MetricsCollector.ts (new) | Aggregates metrics; refreshed on the supervisor's existing 30s health-check tick to avoid amplifying load. |
src/supervisor/health-checker.ts:runHealthCheck | Call MetricsCollector.refresh() after pruneDeadEntries. |
GET /api/healthz → 200 JSON:
{
"status": "ok|degraded|unhealthy",
"ts": "2026-05-07T21:30:00.000Z",
"uptime_sec": 12345,
"versions": {
"plugin": "12.7.5",
"worker": "12.7.5",
"matches": true
},
"process": {
"pid": 12345,
"rss_mb": 145.2,
"event_loop_lag_ms": 3.1,
"managed": true,
"platform": "darwin"
},
"pid_file": {
"path": "/Users/.../worker.pid",
"start_token": "Wed May 7 14:23:15 2026",
"daemon_lock_held": true
},
"db": {
"path": "/Users/.../claude-mem.db",
"size_bytes": 31457280,
"page_count": 7680,
"freelist_count": 12,
"free_ratio_pct": 0.16,
"last_vacuum_at": "2026-05-07T20:00:00.000Z",
"last_vacuum_freed_pages": 130000,
"last_maintenance_at": "2026-05-07T20:00:00.000Z",
"oldest_processing_pending_age_sec": 4,
"processing_pending_count": 1,
"pending_count_total": 12,
"sdk_sessions_total": 145,
"sdk_sessions_inactive": 13,
"sdk_sessions_by_project": { "claude-mem": 25, "...": 120 }
},
"child_processes": {
"chroma_mcp_pid_count": 1,
"chroma_mcp_last_call_at": "2026-05-07T21:25:11.000Z",
"chroma_mcp_state": "connected",
"chroma_mcp_backoff_remaining_ms": 0,
"sdk_process_count": 0,
"supervisor_registry_size": 2
},
"network": {
"hook_consecutive_failures": 0,
"breaker_state": "CLOSED",
"breaker_consecutive_failures": 0,
"breaker_lifetime_failures": 3,
"breaker_opened_at": null,
"breaker_total_trips": 1,
"last_request_at": "2026-05-07T21:29:55.000Z",
"request_rate_per_min": 12.3
},
"ai": {
"provider": "claude",
"auth_method": "...",
"last_interaction": { ... }
}
}
GET /api/healthz?format=prom → 200 text/plain with Prometheus text format. One metric per JSON leaf (e.g. claude_mem_db_free_ratio_pct 0.16).
status derivation:
unhealthy if breaker is OPEN_PERMANENT, OR DB initialization failed, OR chroma-mcp pid count > CLAUDE_MEM_CHROMA_MAX_CONCURRENT.degraded if breaker is OPEN, OR free_ratio > 0.4, OR oldest_processing_pending > 1 hour, OR worker version mismatches plugin version.ok otherwise.MetricsCollector class: a Map<string, unknown> snapshot. Public refresh() collects fresh data; public getSnapshot() returns the cached object. Refresh is called by the 30s health-check tick AND on-demand if last refresh > 5s ago (debounced).
DB metrics queries (use db.prepare + .get()):
PRAGMA page_count → { page_count: number }PRAGMA freelist_count → { freelist_count: number }PRAGMA page_size → for size_bytes computationSELECT MIN(updated_at) FROM pending_messages WHERE status='processing' (with julianday math for age in seconds)SELECT COUNT(*) FROM sdk_sessions GROUP BY projectProcess metrics: process.memoryUsage().rss / 1024 / 1024. Event-loop lag via perf_hooks.monitorEventLoopDelay (Node API, available in bun) — sample over 30s window.
Network metrics: maintain a rolling 1-min request counter in middleware (existing createMiddleware in Server.ts:156). Increment on each /api/* request.
Prometheus format: emit # HELP and # TYPE lines per metric. Use the same naming convention (claude_mem_<group>_<name>).
Compatibility: leave /api/health UNCHANGED (existing integrations break otherwise). /api/healthz is the new richer endpoint.
curl 127.0.0.1:<port>/api/healthz | jq .status returns ok on a healthy worker.db.free_ratio_pct updates at 30s cadence (verify by manually inflating freelist).?format=prom parses with promtool check metrics.WORKER DEBUG MetricsRefresh, fields {durationMs}.WORKER WARN MetricsRefreshSlow if refresh > 250ms (DB query stall signal)./api/health response body unchanged byte-for-byte (regression test).?format=prom output validates with promtool if available; otherwise visual inspection.RouteHandler pattern (no direct app.get in worker-service.ts).Goal: User-facing surface so operators can see what the new machinery did. Ordered last to allow phases 2-7 to stabilize.
| File | Change |
|---|---|
src/cli/handlers/worker-doctor.ts (new) | New CLI subcommand claude-mem worker doctor — fetches /api/healthz, formats it for terminals, includes recent reaper actions. |
src/services/worker-service.ts:main() | Register the worker doctor CLI route (alongside existing cursor, gemini-cli cases). |
plugin/scripts/worker-cli.js | Wire to the new doctor command. |
CLAUDE.md (project root) | Document new settings under a "Worker Maintenance" section. |
docs/public/ (optional) | User-facing explanation of the breaker, reaper, and health endpoint. |
worker doctor output (example)claude-mem worker doctor
Status: OK
Version: plugin=12.7.5 worker=12.7.5 (match)
Uptime: 3h 25m
PID: 12345 (lock held: yes)
Database:
Size: 32 MB (free: 0.16%)
Last vacuum: 4h ago, freed 130k pages
Pending: 12 total / 1 processing (oldest 4s)
SDK sessions: 145 total / 13 inactive
Child processes:
chroma-mcp: 1 (last call: 5s ago, state: connected)
SDK processes: 0
Supervisor: 2 entries
Circuit breaker:
State: CLOSED
Consecutive: 0
Lifetime: 3
Total trips: 1
Recent maintenance (last 24h):
2026-05-07 20:00 Vacuum: freed 130k pages in 1.4s
2026-05-07 19:30 Reaper: 5 stuck-processing reset, 2 inactive marked
2026-05-07 18:00 Chroma orphan scan: 0 found
If status != ok, append a "Recommended actions" block:
claude-mem worker reset-breakerclaude-mem worker reap-chromaDoctor command: GET /api/healthz via workerHttpRequest. Format as the table above. Color-code (red/yellow/green) using existing chalk integration if present, otherwise plain text. JSON pass-through via --json flag.
Recent-actions feed: store the last 50 maintenance events in a circular buffer in MetricsCollector (in-memory only — survives one worker lifetime; not persistent). Expose at /api/healthz/events (separate to avoid bloating the main response).
Update CLAUDE.md: add a "Worker Maintenance" section with: settings reference table, the doctor command, a brief description of the reaper/breaker/vacuum behavior. Per CLAUDE.md "Important: No need to edit the changelog ever" — only edit CLAUDE.md, never CHANGELOG.
Rollout ordering (per problem statement constraint):
/api/healthz).Each wave can ship as a separate release. Inter-wave dependencies: Phase 7 depends on data sources from Phases 2/3/4/6 — but the endpoint can ship with partial data (fields gated by phase availability).
claude-mem worker doctor prints a green-OK summary on a healthy worker.claude-mem worker doctor --json returns valid JSON pipeable to jq.claude-mem worker doctor cleanly reports Worker unreachable instead of hanging.claude-mem worker doctor exits 0 on healthy state, 1 on unhealthy, 2 if worker unreachable (mirrors hook-exit-codes convention).Goal: Prove the system works end-to-end before declaring victory.
Run the worker for 24 hours under realistic Claude Code usage. After 24h:
| Metric | Pass criterion |
|---|---|
ps aux | grep chroma-mcp | wc -l | ≤ 1 |
ps aux | grep claude-mem | wc -l | ≤ a small constant (1-2) |
| DB size growth rate | < 5 MB/hr; free_ratio < 20% |
/api/healthz breaker.lifetime_failures | < 10 (vs. the #1874 starting baseline) |
Stuck processing rows older than 10 min | 0 |
| Worker memory RSS | < 300 MB (no leak) |
| Inject | Expected behavior |
|---|---|
Kill worker via kill -9 | Lazy-respawn on next hook; PID file cleaned |
Two parallel claude-mem start | Exactly one daemon survives; lock log line visible |
| 100 stuck processing rows | Reaper resets all within REAPER_PROCESSING_STUCK_MS + REAPER_TICK_MS |
| Spawn fake listener on worker port | New --daemon exits 0 with diagnostic stderr (no silent exit) |
| Fork 5 chroma-mcp orphans | Worker startup reaps all 5 |
| Pull network during 10 hooks | Breaker opens after threshold; subsequent hooks short-circuit |
# No new always-on intervals
grep -rn "setInterval" src/ --include="*.ts" | grep -v "unref()" | grep -v "^src/.*test"
# No new process.exit(1) on hook paths
git diff main -- src/shared/worker-utils.ts src/cli/ | grep "process.exit(1)"
# No invented settings
git diff main -- src/shared/SettingsDefaultsManager.ts | grep "CLAUDE_MEM_"
# Cross-reference with all phases' settings tables.
# No hardcoded magic numbers in business logic
git diff main | grep -E "[0-9]{4,}" | grep -v SettingsDefaultsManager | grep -v test
CLAUDE.md adds: Worker Maintenance section (Phase 8.3).docs/public/ (optional): user-facing explanation./api/healthz reports status: "ok" 24h after deployment.worker doctor on 3 production-like environments confirms expected output.git log -p).All settings declared in src/shared/SettingsDefaultsManager.ts:
| Setting | Phase | Default | Range |
|---|---|---|---|
CLAUDE_MEM_DAEMON_LOCK_TIMEOUT_MS | 5 | 5000 | 0–60000 |
CLAUDE_MEM_PID_PORT_RECHECK_MS | 5 | 2000 | 500–30000 |
CLAUDE_MEM_DB_MAINTENANCE_ENABLED | 6 | true | bool |
CLAUDE_MEM_DB_MAINTENANCE_INTERVAL_HOURS | 6 | 24 | 1–168 |
CLAUDE_MEM_DB_VACUUM_THRESHOLD_RATIO | 6 | 0.40 | 0.05–0.95 |
CLAUDE_MEM_DB_VACUUM_STARTUP_DELAY_MS | 6 | 300000 | 0–3600000 |
CLAUDE_MEM_CLEANUP_REGRESSION_CHECK | 6 | true | bool |
CLAUDE_MEM_REAPER_ENABLED | 2 | true | bool |
CLAUDE_MEM_REAPER_TICK_MS | 2 | 30000 | 5000–600000 |
CLAUDE_MEM_REAPER_PROCESSING_STUCK_MS | 2 | 300000 | 30000–86400000 |
CLAUDE_MEM_REAPER_INACTIVE_DAYS | 2 | 30 | 1–365 |
CLAUDE_MEM_REAPER_HARD_DELETE_INACTIVE_DAYS | 2 | 0 | 0–365 |
CLAUDE_MEM_CHROMA_IDLE_SHUTDOWN_MS | 3 | 900000 | 60000–86400000 |
CLAUDE_MEM_CHROMA_ORPHAN_SCAN_ON_START | 3 | true | bool |
CLAUDE_MEM_CHROMA_MAX_CONCURRENT | 3 | 1 | 1–4 |
CLAUDE_MEM_BREAKER_FAILURE_THRESHOLD | 4 | 5 | 1–50 |
CLAUDE_MEM_BREAKER_RESET_TIMEOUT_MS | 4 | 30000 | 1000–600000 |
CLAUDE_MEM_BREAKER_HALF_OPEN_MAX_PROBES | 4 | 1 | 1–10 |
CLAUDE_MEM_BREAKER_LIFETIME_CAP | 4 | 50 | 0–10000 |
| File | Phases that touch it |
|---|---|
src/services/worker-service.ts | 3 (initializeBackground), 5 (--daemon), 6 (maintenance wiring), 7 (route registration), 8 (CLI) |
src/services/worker-spawner.ts | 5 |
src/services/infrastructure/ProcessManager.ts | 5 (lock + start-token) |
src/services/infrastructure/HealthMonitor.ts | 5 (port-on-pid match) |
src/services/infrastructure/CleanupV12_4_3.ts | 6 (regression detection — read only) |
src/services/sync/ChromaMcpManager.ts | 3 |
src/supervisor/index.ts | 5 (validateWorkerPidFile) |
src/supervisor/process-registry.ts | 3 (orphan scan), 5 (start-token) |
src/supervisor/health-checker.ts | 2 (reaper), 7 (metrics refresh) |
src/services/worker/SessionManager.ts | 2 (delete hook), 6 (pause/resume) |
src/shared/worker-utils.ts | 4 (breaker integration) |
src/services/sqlite/Database.ts | 6 (auto_vacuum) |
src/services/sqlite/PendingMessageStore.ts | 2 (reapStuckProcessing) |
src/services/sqlite/SessionStore.ts | 2 (findInactiveSdkSessions) |
src/services/sqlite/migrations/runner.ts | 2 (inactive_at column) |
src/services/server/Server.ts | 4 (breaker reset), 7 (healthz route) |
src/shared/SettingsDefaultsManager.ts | 2-6 (settings keys) |
src/services/maintenance/DbMaintenance.ts | 6 (NEW) |
src/services/maintenance/SessionReaper.ts | 2 (NEW) |
src/shared/worker-circuit-breaker.ts | 4 (NEW) |
src/services/worker/MetricsCollector.ts | 7 (NEW) |
src/services/worker/http/routes/HealthzRoutes.ts | 7 (NEW) |
src/cli/handlers/worker-doctor.ts | 8 (NEW) |
CLAUDE.md | 8 (Worker Maintenance section) |
bun:ffi flock support: confirm via spike before committing Phase 5.4. If unavailable, fall back to flock(1) shell on Linux + atomic mkdirSync sentinel on macOS/Windows.perf_hooks.monitorEventLoopDelay works in bun's Node-compat layer. If not, fall back to a setImmediate-based heuristic.PRAGMA auto_vacuum = INCREMENTAL; VACUUM; manually. (It should — full VACUUM with auto_vacuum already set takes effect.)/api/healthz does not duplicate any planned Pro endpoint. Per CLAUDE.md "Pro Features Architecture", the worker's local HTTP API stays open — /api/healthz is fine to add OSS-side.