Back to Claude Mem

[plan-03] Worker / Daemon Lifecycle Hardening — supervision, identity, resource bounds

plans/recluster-2026-06-04/plan-03.md

13.4.12.2 KB
Original Source

[plan-03] Worker / Daemon Lifecycle Hardening — supervision, identity, resource bounds

Defect

The worker/daemon has no robust lifecycle contract: startup health is checked against the wrong PID so start reports "process died during startup" even when it is alive; the PID file is never validated against process identity, so a recycled PID produces a permanent ghost-PID deadlock; the generator's spawned SDK child is SIGTERM'd (exit 143) mid-run leaving the queue to drown; Bun workers OOM-cascade when the host runs a heavy dev server; observer transcripts grow unbounded (single 1.9 GB JSONL); and on Windows the cumulative effect is zero observations ever generated. These are all the same gap: no identity-validated supervision with bounded resources and honest health.

Children

  • #2747 — worker-cli start always fails 'Process died during startup' — waitForHealth checks the wrong PID
  • #2726 — Worker PID file not validated against process identity → permanent ghost-PID deadlock (Windows)
  • #2740 — Generator's spawned SDK child gets SIGTERM (exit 143) at ~3 min; no observations insert; queue drowns
  • #2720 — Bun workers OOM cascade on Windows when host project runs Next.js dev (Turbopack)
  • #2754 — Observer session transcripts grow unbounded — single 1.9 GB JSONL, 6.1 GB total
  • #2703 — 0 observations ever generated on Windows (cross-cutting worker defects)

Fix sequence

Design doc: plans/03-worker-lifecycle.md. Health-check the actual spawned PID; validate PID-file identity (pid+start-time) before trusting/killing; supervise the SDK child with restart-on-unexpected-exit and queue drain protection; bound memory + transcript size with rotation; converge the Windows zero-observation path on the above.

Test matrix

HostScenarioRequired behavior
allstarthealth checks the real PID; no false "died"
allrecycled PIDidentity mismatch → no ghost deadlock
alllong generationSDK child survives or restarts; queue drains
Windowshost Next.js dev runningno OOM cascade; observations land
alllong sessiontranscript rotates; bounded disk

Out of scope

Env contamination of the SDK subprocess (was plan-06); observer output parsing (plan-11).