plans/recluster-2026-06-04/plan-03.md
The worker/daemon has no robust lifecycle contract: startup health is checked against the wrong PID so start reports "process died during startup" even when it is alive; the PID file is never validated against process identity, so a recycled PID produces a permanent ghost-PID deadlock; the generator's spawned SDK child is SIGTERM'd (exit 143) mid-run leaving the queue to drown; Bun workers OOM-cascade when the host runs a heavy dev server; observer transcripts grow unbounded (single 1.9 GB JSONL); and on Windows the cumulative effect is zero observations ever generated. These are all the same gap: no identity-validated supervision with bounded resources and honest health.
start always fails 'Process died during startup' — waitForHealth checks the wrong PIDDesign doc: plans/03-worker-lifecycle.md. Health-check the actual spawned PID; validate PID-file identity (pid+start-time) before trusting/killing; supervise the SDK child with restart-on-unexpected-exit and queue drain protection; bound memory + transcript size with rotation; converge the Windows zero-observation path on the above.
| Host | Scenario | Required behavior |
|---|---|---|
| all | start | health checks the real PID; no false "died" |
| all | recycled PID | identity mismatch → no ghost deadlock |
| all | long generation | SDK child survives or restarts; queue drains |
| Windows | host Next.js dev running | no OOM cascade; observations land |
| all | long session | transcript rotates; bounded disk |
Env contamination of the SDK subprocess (was plan-06); observer output parsing (plan-11).