Back to Claude Mem

Worker Restart: Single Source of Truth

plans/2026-06-10-worker-restart-single-source-of-truth.md

13.5.625.0 KB
Original Source

Worker Restart: Single Source of Truth

Created: 2026-06-10 Root-cause analysis: 12-agent diagnosis, adversarially verified (workflow wf_f07f3541-b05). Summary: worker-restart failures are caused by five redundant "who is the worker" oracles with uncoordinated writers, a sync-script "hot reload" mirror that writes version-N code into the version-(N-1) cache dir, a kill-only restart endpoint that races hook lazy-spawns, and a build chain that fires two uncoordinated restarts and never verifies the outcome. 98 version-recycle ping-pong events and six EADDRINUSE hard failures observed Jun 8–10.

Execution model: Each phase is self-contained and lands independently (one commit/PR per phase, in order). Phases 1–2 are the high-leverage, low-risk slice. Run npm test plus the phase's verification checklist before moving on.


Phase 0: Consolidated Documentation — Allowed APIs and Ground Truth

(Discovery already performed by three fact-extraction subagents on 2026-06-10; consolidated here. Executors: trust these refs, but re-read the cited lines before editing — line numbers drift.)

Allowed APIs (verified to exist — use ONLY these, with these exact signatures)

APILocationSignature / shape
ensureWorkerRunning()src/shared/worker-utils.ts:293-381(): Promise<boolean> — hook lazy-spawn + PR #2768 version-recycle
resolveWorkerScriptPath()src/shared/worker-utils.ts:206-215candidates: MARKETPLACE_ROOT/plugin/scripts/worker-service.cjs, then cwd()/plugin/scripts/worker-service.cjs
resolveBunRuntime()src/shared/worker-utils.ts:217-235hook-side resolver; MISSING ~/.bun/bin fallbacks
waitForWorkerPort / waitForWorkerReadinesssrc/shared/worker-utils.ts:237-273polls GET /api/readiness; budget env CLAUDE_MEM_HOOK_READINESS_TIMEOUT_MS
ensureWorkerStarted(port, workerScriptPath)src/services/worker-spawner.ts:70-153returns 'ready' | 'warming' | 'dead'; NO version guard
spawnDaemon(scriptPath, port, extraEnv?)src/services/infrastructure/ProcessManager.ts:408-472returns PID or undefined; setsid on Unix, PowerShell on Windows
resolveWorkerRuntimePath(options?)src/services/infrastructure/ProcessManager.ts:63-125full bun resolver chain (BUN, BUN_PATH, ~/.bun/bin, brew paths, which)
PID file APIsProcessManager.ts:134-168, 508-520writePidFile(info), readPidFile(): PidInfo|null, removePidFile(), touchPidFile(), cleanStalePidFile()
httpShutdown(port, reason?)src/services/infrastructure/HealthMonitor.ts:94-114POSTs /api/admin/shutdown?reason=restart
waitForPortFree(port, timeoutMs?)HealthMonitor.ts:85-92500ms poll
checkVersionMatch(port)src/services/infrastructure/HealthMonitor.ts:~120-161returns {matches, pluginVersion, workerVersion}; fail-open on ENOENT
performGracefulShutdown(config)src/services/infrastructure/GracefulShutdown.ts:30-58sequential closes, NO global deadline
flushResponseThen(res, payload, action)src/services/server/flushResponseThen.ts:3-16responds, runs action on 'finish', ALWAYS process.exit(0)
writeJsonFileAtomic()src/npx-cli/utils/paths.ts:124-205the ONLY atomic-write helper in the repo
MARKETPLACE_ROOTsrc/shared/paths.ts:43~/.claude/plugins/marketplaces/thedotmack
resolveDataDir() / DATA_DIRsrc/shared/paths.ts:18-40env CLAUDE_MEM_DATA_DIR wins (line 19-20); module-level const — computed at import time
Worker port defaultsrc/shared/SettingsDefaultsManager.ts:9137700 + (uid % 100) — NEVER hardcode 37777
/api/health responsesrc/services/server/Server.ts:212-233{status, version, workerPath, uptime, pid, initialized, mcpReady, ...} — has everything verification needs
/api/readiness responseServer.ts:235-247{status: 'ready'|'initializing', mcpReady}
/api/admin/restartServer.ts:282-294kill-only via flushResponseThenonRestart()shutdown('restart'); NOTHING respawns (macOS/Linux)
version: BUILT_IN_VERSIONbaked via __DEFAULT_PACKAGE_VERSION__ esbuild definescripts/build-hooks.js:312,365,421,469,495,540

Anti-patterns (verified NOT to exist — do not invent)

  • There is NO /api/admin/status, NO /api/version route in Server.ts (version comes from /api/health), NO respawn anywhere in the HTTP restart path.
  • There is NO flock/O_EXCL/wx-flag lockfile pattern anywhere in src/ — Phase 4 introduces the first one; copy writeJsonFileAtomic for the write discipline.
  • plugin/scripts/*.cjs are BUILT artifacts — never hand-edit; rebuild with npm run build.
  • package-lock.json is gitignored — do not commit it.
  • Tests are bun test (bunfig.toml preloads tests/preload.ts for the PostHog mock). Mock pattern: mock.module() + query-param cache-bust fresh import (tests/shared/worker-utils-version-recycle.test.ts:22-32).

Behavioral contracts that MUST keep passing

  • tests/shared/worker-utils-version-recycle.test.ts: on version mismatch ensureWorkerRunning() POSTs /api/admin/restart ≥1×; on match, 0×.
  • tests/integrations/spawn-contract-windows.test.ts: spawn-contract env overrides.
  • Full suite: 2203 pass / 0 fail as of v13.5.5.

Phase 1: Delete the cache-mirror; make the CLI restart the single initiator

Why first: The mirror (sync-marketplace.cjs:164-173) is the largest manufactured source of version disagreement (7 of 10 cache dirs hold off-by-one content); the double restart (HTTP POST + sleep 1 + CLI restart) is the race generator. Both fixes are deletions.

What to implement

  1. In scripts/sync-marketplace.cjs:
    • Delete the cache-mirror block (lines ~164-173: INSTALLED_CACHE_PATH + rsync + bun install into the cache dir).
    • Delete detectInstalledVersion() (lines ~79-114) and its call site — it exists only to feed the mirror.
    • Delete the HTTP restart trigger block (lines ~196-216: the http.request POST to /api/admin/restart and its success/error prints). The sync script's job ends at "files synced + bun install in the marketplace copy".
  2. In root package.json line ~67, simplify:
    • From: "build-and-sync": "npm run build && npm run sync-marketplace && sleep 1 && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)"
    • To: "build-and-sync": "npm run build && npm run sync-marketplace && (cd ~/.claude/plugins/marketplaces/thedotmack && npm run worker:restart)"
    • (drop the sleep 1 — it existed to let the now-deleted HTTP kill land before the CLI restart)
  3. Do NOT touch the existing one-time legacy-cache cleanup elsewhere in sync-marketplace.cjs (if present) — only the three blocks named above.

Documentation references

  • Verbatim quotes of all three blocks: Phase 0 table + the restart/shutdown extraction (sync-marketplace.cjs:79-114, 164-173, 196-216; package.json:64-79).

Verification checklist

  • grep -n "INSTALLED_CACHE_PATH\|detectInstalledVersion\|admin/restart" scripts/sync-marketplace.cjs → no matches.
  • grep -n "sleep 1" package.json → no match in build-and-sync.
  • npm run build-and-sync → completes; worker restarts (via CLI path only); curl -s http://127.0.0.1:$PORT/api/health shows the just-built version. (Resolve $PORT as 37700 + uid%100 or from ~/.claude-mem/settings.json.)
  • npm test → green.
  • Stale-cache regression check: ls ~/.claude/plugins/cache/thedotmack/claude-mem/*/package.json | xargs grep -h '"version"' — record current values; after the NEXT release, re-check that no cache dir's content changed (the mirror used to mutate them).

Anti-pattern guards

  • Do not "fix" the mirror by targeting workerPath instead — the feature is being removed, not repaired.
  • Do not add a new restart mechanism here; Phase 2 hardens the existing CLI restart.

Phase 2: Post-restart verification — restart proves itself or exits 1

Why: restart currently exits 0 after spawnDaemon returns a PID — fork success, not boot success. Four silent exit-0 daemon paths mean "✓" with a dead/stale worker.

What to implement

All in src/services/worker-service.ts:

  1. In case 'restart' (lines ~956-973):
    • BEFORE httpShutdown, capture the old worker: GET /api/health (2s timeout) → save oldPid (may be null if no worker).
    • Replace waitForPortFree(port, 5000) with waitForPortFree(port, getPlatformTimeout(15000)) — parity with stop (line ~946).
    • AFTER spawnDaemon, add a verification loop (new helper verifyRestartedWorker(port, oldPid, deadlineMs)): poll GET /api/health every 500ms until health.pid !== oldPid && health.version === EXPECTED_VERSION, where EXPECTED_VERSION is this process's own baked __DEFAULT_PACKAGE_VERSION__ (build-and-sync runs the marketplace copy, so its baked version IS the just-synced version). Deadline: getPlatformTimeout(30000).
    • On success: log Worker restart verified {pid, version} and exit 0. On deadline: console.error with the last observed health payload (or connection error) and exit 1.
  2. In the --daemon block (lines ~1167-1208): change EXIT PATH 4 only (generic start failure, lines ~1204-1206) from process.exit(0) to process.exit(1). Paths 1-3 are legitimate duplicate-suppression and stay exit 0.
  3. Spawn target: change spawnDaemon(__filename, port) (line ~965) to prefer the marketplace script — copy the candidate pattern from resolveWorkerScriptPath() (src/shared/worker-utils.ts:206-215), falling back to __filename when no marketplace copy exists (dev trees, CI).

Documentation references

  • /api/health shape: Server.ts:212-233 (pid, version fields confirmed).
  • getPlatformTimeout: used at worker-service.ts:946.
  • __DEFAULT_PACKAGE_VERSION__ availability inside worker-service bundle: scripts/build-hooks.js:312-313.

Verification checklist

  • New test tests/services/worker-restart-verify.test.ts: mock global.fetch (copy the fetchLog pattern from tests/shared/worker-utils-version-recycle.test.ts:34-50); assert verifyRestartedWorker returns success when health flips to {pid: newPid, version: expected}, failure on stale pid, failure on wrong version, failure on timeout.
  • Manual: npm run build-and-sync → output includes Worker restart verified; then kill -9 <worker pid> mid-restart-window and re-run to see a LOUD exit-1 path (or simulate by pointing the verify loop at a dead port in the test).
  • npm test → green, including the version-recycle contract.

Anti-pattern guards

  • Do NOT poll /api/version (doesn't exist) or invent new health fields.
  • Do NOT compare against marketplace package.json on disk for EXPECTED_VERSION — the baked constant is the truth for "the code I am running"; disk reads reintroduce a second oracle.
  • The verify deadline must not block forever: hard cap, then exit 1.

Phase 3: Self-replacing restart — old worker spawns its successor; recycle path stops spawning into corpses

Why: /api/admin/restart is kill-only; hooks that POST it then lazy-spawn race the dying worker (the ping-pong). If the OLD worker spawns its successor as its final act after the port closes, old and new never coexist and no third party spawns into a corpse.

What to implement

  1. src/services/worker-service.ts shutdown() (lines ~671-699):
    • Re-entrancy guard: the isShuttingDown field (line ~188) is write-only today; make shutdown() check-and-set it at entry (if (this.isShuttingDown) return; this.isShuttingDown = true;).
    • Hard deadline: wrap performGracefulShutdown(...) in Promise.race with a getPlatformTimeout(10000) timer; on deadline, log Graceful shutdown deadline exceeded — proceeding and continue (do not hang on unbounded session drain — drain today can run 35-40s).
    • Successor spawn: when reason === 'restart', after graceful shutdown completes/deadlines: resolve the marketplace script (same candidate pattern as Phase 2 step 3), await waitForPortFree(port, 5000), then spawnDaemon(marketplaceScript, port). If port never frees or spawn returns undefined, log loudly (logger.error) — the next hook's lazy-spawn is the safety net. Note flushResponseThen (flushResponseThen.ts:3-16) calls process.exit(0) after the action completes, so the spawn must be awaited inside the action.
  2. src/shared/worker-utils.ts ensureWorkerRunning() recycle path (lines ~305-330):
    • After POSTing /api/admin/restart, do NOT immediately lazy-spawn. Instead poll GET /api/health (500ms interval, HOOK_READINESS_TIMEOUT_MS budget) for the successor: healthy AND version === pluginVersion (already in hand from checkVersionMatch). Only fall through to the existing lazy-spawn if the successor never appears.
    • Amplifier fix: after waitForWorkerReadiness succeeds anywhere in this function, re-check the version once via /api/health; if still mismatched, log a warning (do NOT loop/recycle again in the same invocation — one recycle per hook event, the next hook retries; unbounded loops here re-create the storm).

Documentation references

  • onRestart wiring: worker-service.ts:255; route: Server.ts:282-294; flushResponseThen.ts:3-16.
  • Drain timings + unbounded awaits: GracefulShutdown.ts:30-75.
  • HOOK_READINESS_TIMEOUT_MS: worker-utils.ts:45-48.
  • Signal-path guard to mirror: src/supervisor/index.ts:56-60.

Verification checklist

  • tests/shared/worker-utils-version-recycle.test.ts still green (still POSTs restart on mismatch — the change is what happens AFTER the POST).
  • New test: shutdown re-entrancy — call shutdown() twice, assert performGracefulShutdown runs once (mock it via mock.module).
  • New test: recycle-no-corpse-spawn — fetch mock where /api/admin/restart 200s and /api/health returns the NEW version on poll N: assert NO spawn attempt; where health never recovers: assert lazy-spawn fallback fires.
  • Manual end-to-end: with a worker running, curl -X POST http://127.0.0.1:$PORT/api/admin/restart; within ~15s /api/health shows a NEW pid and the marketplace version, with no hook involvement. Check ~/.claude-mem/logs/claude-mem-$(date +%F).log for exactly one shutdown and one daemon start (no duplicate-refusal lines).
  • npm test → green.

Anti-pattern guards

  • The successor spawn happens ONLY on reason === 'restart'stop must stay kill-only.
  • Never spawn before the port is confirmed free — that recreates EXIT PATH 2/3 duplicate suicides.
  • Do not add respawn logic to flushResponseThen itself or to the Windows-managed IPC branch (process.send path, Server.ts:284-289) — Windows wrapper already owns restart there.

Phase 4: One spawn gate — shared ensureWorker() with a spawn lockfile; kill the resolver asymmetry

Why: Three spawn paths (hooks via worker-utils, MCP via worker-spawner, CLI via spawnDaemon) with two different bun resolvers and no mutual exclusion; logs show 3 launchers colliding within one second.

What to implement

  1. New module src/shared/worker-spawn-gate.ts:
    • acquireSpawnLock(): booleanwriteFileSync(join(DATA_DIR, 'spawn.lock'), JSON.stringify({pid: process.pid, startedAt: new Date().toISOString()}), {flag: 'wx'}) in try/catch. On EEXIST: statSync the lock; if mtimeMs older than 30_000ms, unlinkSync and retry ONCE; else return false.
    • releaseSpawnLock(): void — unlink, owner-checked (read it; only delete if pid === process.pid), errors swallowed.
    • Use DATA_DIR from src/shared/paths.ts — never homedir() directly.
  2. In src/shared/worker-utils.ts ensureWorkerRunning() (spawn section, lines ~332-351): wrap the spawn in the lock — if acquireSpawnLock() fails, skip the spawn and go straight to waitForWorkerPort/waitForWorkerReadiness (someone else is spawning; wait for their worker). Release in finally.
  3. Same wrap in src/services/worker-spawner.ts ensureWorkerStarted() around the spawnDaemon call (line ~132).
  4. Resolver unification: delete resolveBunRuntime() from worker-utils.ts (lines 217-235) and import/re-export resolveWorkerRuntimePath from ProcessManager.ts (already exported; it strictly supersedes — adds BUN_PATH, ~/.bun/bin, brew, snap fallbacks). This closes the kill-then-can't-respawn path.
  5. src/servers/mcp-server.ts (lines ~42-51): compute WORKER_SCRIPT_PATH preferring the marketplace copy — copy the candidate pattern from resolveWorkerScriptPath() with fallback to the current own-dir resolution. This stops MCP servers in stale cache dirs from spawning stale workers.

Documentation references

  • Write-discipline reference for the lock: writeJsonFileAtomic, src/npx-cli/utils/paths.ts:124-205 (the wx-flag + cleanup-on-error shape; the lock is simpler — no rename needed, wx IS the atomicity).
  • Spawn call sites table: Phase 0 / spawn-path report §7.
  • resolveWorkerRuntimePath chain: ProcessManager.ts:63-125.

Verification checklist

  • New test tests/shared/worker-spawn-gate.test.ts (temp dir via CLAUDE_MEM_DATA_DIR + dynamic import — see Phase 6 trap): second acquire fails while held; stale lock (backdate mtime via utimesSync) is broken and re-acquired; release is owner-only.
  • grep -rn "resolveBunRuntime" src/ → no definition in worker-utils (only the ProcessManager import).
  • grep -n "spawnHidden\|spawnDaemon" src/shared/worker-utils.ts src/services/worker-spawner.ts → every spawn site is inside the lock.
  • Race test (manual): run 5 concurrent node "$_P/scripts/bun-runner.js" .../worker-service.cjs start invocations with no worker running; logs must show exactly ONE Starting worker daemon and zero refusing to start duplicate storms.
  • npm test → green (version-recycle contract intact).

Anti-pattern guards

  • The lock gates SPAWNING only — never health checks; a held lock must never make a hook fail, only wait.
  • 30s staleness must use file mtime, not clock-in-content comparisons (test pollution wrote startedAt: 2024-01-01 once already).
  • Do not introduce a lock library; wx flag is sufficient and dependency-free.

Phase 5: Demote the PID file — port + /api/health become the only liveness oracle

Why: The dying worker's shutdown cascade deletes the NEW worker's PID file (src/supervisor/shutdown.ts:88), after which status reports a healthy worker as "not running" (status requires portInUse && pidInfo, worker-service.ts:975-988). /api/health already carries pid, version, workerPath — it subsumes the file.

What to implement

  1. Owner-guarded deletion (the clobber fix):
    • src/supervisor/shutdown.ts (~line 88): before rmSync(pidFilePath), read the file; delete ONLY if its pid === process.pid. A mismatch means a successor already wrote its own — log debug and leave it.
    • src/services/worker-service.ts case 'restart' (~line 964) and case 'stop' (~line 949): replace bare removePidFile() with the same owner-or-dead check: delete only if the recorded pid is the one we just shut down (captured in Phase 2's pre-shutdown health probe) or the recorded pid is not alive.
  2. case 'status' (lines ~975-988): source of truth becomes GET /api/health — report pid, version, uptime, workerPath from the response; fall back to "port in use but health unreachable (wedged?)" and "not running". Drop the readPidFile() requirement.
  3. --daemon duplicate gate (lines ~1167-1182): reorder — port/health probe FIRST (it's ground truth), PID file second (advisory only, for the no-port-bound-yet boot window). Keep writePidFile/touchPidFile as diagnostics — the worker itself remains the only writer.

Documentation references

  • Clobber chain: GracefulShutdown.ts:55supervisor.stop()runShutdownCascadeshutdown.ts:87-98.
  • verifyPidFileOwnership + startToken: ProcessManager.ts (PID APIs §3 of spawn-path report).
  • Health shape: Server.ts:212-233.

Verification checklist

  • Update tests/infrastructure/process-manager.test.ts expectations where deletion semantics changed (owner-guard) — coordinate with Phase 6 which relocates this file's data dir.
  • New test: old-worker-cleanup-spares-successor — write PID file as {pid: 99999...} (not own pid), run the shutdown cascade deletion step, assert file survives.
  • Manual: start worker, rm ~/.claude-mem/worker.pid, run worker-service.cjs status → must still print "Worker is running" with pid/version from health.
  • Full ping-pong scenario from Phase 3's manual check still converges.
  • npm test → green.

Anti-pattern guards

  • Do NOT delete writePidFile entirely — external tooling may read it; it's demoted to diagnostics, not removed.
  • status must not require BOTH oracles ever again — health wins, full stop.
  • Do not let the boot-window PID check (advisory) exit 1 — duplicate suppression stays exit 0 (Phase 2 changed only the generic-failure path).

Phase 6: Test hygiene — no test ever touches the real ~/.claude-mem again

Why: tests/infrastructure/process-manager.test.ts writes corrupt JSON and sentinel PIDs into the REAL ~/.claude-mem/worker.pid (snapshot-restore shrinks but doesn't close the race window, and a killed test run leaves corruption behind). It also pollutes the shared log, which contaminated this very diagnosis.

What to implement

  1. tests/infrastructure/process-manager.test.ts:
    • Replace the hardcoded DATA_DIR = path.join(homedir(), '.claude-mem') (lines 24-25) with: create mkdtempSync(join(tmpdir(), 'claude-mem-pm-test-')), set process.env.CLAUDE_MEM_DATA_DIR to it.
    • TRAP: DATA_DIR in src/shared/paths.ts:40 is a module-level const computed at import time, and ESM hoists static imports — setting the env var in beforeEach is too late. Copy the fresh-import pattern from tests/shared/worker-utils-version-recycle.test.ts:30-32 (query-param cache-bust dynamic import) OR set the env var at the very top of the file before any await import(...) of ProcessManager modules (convert the static imports of code-under-test to dynamic).
    • Delete the snapshot-restore of the real PID file (lines 28-43) — unnecessary once isolated; rmSync(tempDir, {recursive: true, force: true}) in afterAll (copy Pattern A: tests/write-json-file-atomic.test.ts:34-40).
  2. Sweep for other offenders: grep -rn "homedir()" tests/ | grep -v node_modules — any test resolving the real data dir gets the same treatment.
  3. Add a tripwire to tests/preload.ts (it already exists for the PostHog mock): if CLAUDE_MEM_DATA_DIR is unset, set it to a per-run mkdtempSync dir so NO test can fall through to ~/.claude-mem. (Env restoration discipline: copy tests/env-isolation.test.ts:31-90.)

Documentation references

  • Offending lines verbatim: process-manager.test.ts:24-25, 28-43, 109, 499-503 (test-hygiene report §1).
  • Copy-ready isolation patterns: Pattern A write-json-file-atomic.test.ts:34-40; Pattern C env restore env-isolation.test.ts:31-90.
  • Env override honored at src/shared/paths.ts:19-20.

Verification checklist

  • bun test tests/infrastructure/ green.
  • While the suite runs: ls -la ~/.claude-mem/worker.pid; shasum ~/.claude-mem/worker.pid before/after → byte-identical (or consistently absent).
  • grep -rn "homedir()" tests/ → no hit resolves a data-dir path for writes.
  • Full npm test green.

Anti-pattern guards

  • Do not mock fs to fake isolation — real temp dirs only, the tests exercise real file semantics.
  • Do not weaken assertions to dodge the trap; the fresh-import pattern is proven in-repo.

Phase 7: Final Verification

  1. Full suite + typecheck: npx tsc --noEmit and npm test — zero failures.
  2. Anti-pattern greps (all must be empty):
    • grep -rn "INSTALLED_CACHE_PATH\|detectInstalledVersion" scripts/
    • grep -n "admin/restart" scripts/sync-marketplace.cjs
    • grep -n "37777" src/ scripts/ -r (hardcoded port)
    • grep -rn "resolveBunRuntime" src/shared/worker-utils.ts
    • grep -rn "homedir(), '.claude-mem'" tests/
  3. Triple-restart soak: run npm run build-and-sync three times consecutively; every run must end Worker restart verified; /api/health pid changes each time, version stays the built version; grep -c "refusing to start duplicate\|Failed to start server" ~/.claude-mem/logs/claude-mem-$(date +%F).log shows no new occurrences during the soak.
  4. Stale-launcher convergence test (the original bug): manually spawn a worker from an OLD cache dir (node ~/.claude/plugins/cache/thedotmack/claude-mem/<old>/scripts/bun-runner.js .../worker-service.cjs start), then trigger any hook (or ensureWorkerRunning via a session). Expect in the log: exactly ONE Worker version mismatch — recycling stale worker, then the self-replacing restart, then a healthy marketplace-version worker — NO ping-pong (no second recycle within 5 minutes).
  5. Observation check: confirm claude-mem itself recorded observations during the soak (the worker was restarting underneath the recorder — the pipeline must survive its own surgery): query the 10 most recent rows in ~/.claude-mem/claude-mem.db observations and confirm fresh timestamps.
  6. Docs: update docs/public/troubleshooting.mdx if it documents the old restart semantics; CHANGELOG is auto-generated — do not edit.