Process-tree teardown and event-driven supervision

Status: accepted; Stage 1 and Stage 2 implemented.

Context

When Bear supervises a build (bear -- make) and a termination signal arrives, the whole process subtree underneath the build must stop, within the sub-one-second budget the signal-forwarding requirement sets, and Bear should still be able to write the partial compile_commands.json it has collected so far. The original supervise() only child.kill()ed the direct child with SIGKILL, which (a) left grandchildren reparented to init and running, and (b) being un-trappable, gave neither a build's own trap nor in-flight compilers any chance to wind down.

Three mechanisms can terminate an entire subtree, one per platform family:

Platform	Mechanism	A child can escape it?
any unix	process group (`setsid`/`setpgid`) + `killpg`	yes - by calling `setsid` itself
Linux	cgroup v2 `cgroup.kill`	no - unprivileged moves are denied
Windows	Job Object (`KILL_ON_JOB_CLOSE`)	no

Process groups are portable across unix and need no new dependency (libc is already present); the same group-kill technique is already proven in the version-probe watchdog (semantic/interpreters/compilers/probe.rs), though Stage 1 uses the lighter Command::process_group(0) (safe std, keeps the session) rather than the watchdog's unsafe setsid. Their one gap is a child that deliberately setsids away to daemonize. cgroups close that gap but require cgroup v2, a writable/delegated cgroup directory, and clone3(CLONE_INTO_CGROUP) or a pre_exec write to cgroup.procs - none exposed by std::process::Command - plus a runtime fallback. Job Objects need a windows-sys dependency, and Bear has too few Windows users to justify designing that path yet.

Two further forces shaped the design:

Waiting without polling. std::process::Child::wait() blocks uninterruptibly and cannot watch for a signal at the same time, which is why the original loop polled with try_wait() + sleep(100ms). A SIGCHLD-driven blocking loop (portable, reuses the already-present signal-hook) removes the poll and its latency; a Linux-only poll() over a pidfd + signalfd would be strictly nicer but Linux-5.3+ and more libc code.
Nested supervisors. In wrapper mode the chain is bear-driver -> make -> bear-wrapper -> real cc (the wrapper is a Rust binary on the same supervise() path, not a shell script). If every level created a new process group, the build would fragment into many groups and a top-level killpg would miss the deeper processes - re-opening the very escape hole grouping is meant to close.

Decision

Two-stage tree teardown. Stage 1: process_group(0) + killpg in the cfg-selected unix platform module. Stage 2: a Linux-gated cgroup module that places the build in a fresh cgroup v2 (the child joins via a pre_exec write to cgroup.procs) and, on teardown, writes cgroup.kill to reap the whole cgroup - including a descendant that setsids out of the process group. Stage 2 is best-effort: when cgroup v2 is unavailable or its directory is not writable/delegated it returns nothing and teardown falls back to the Stage 1 process-group SIGKILL. Both still go through the leader's single grace-then-force escalation; the graceful real-signal phase stays group-based because cgroup.kill can only SIGKILL. A Windows Job Object is a possible later third path; non-unix keeps single-process child.kill().
Only the outermost supervisor groups. The driver creates the group and owns the authoritative killpg; nested wrappers inherit the group and merely forward, so a single top-level killpg reaches the whole tree. Grouping is therefore a per-caller policy, not baked unconditionally into shared supervise().
Graceful, real-signal forwarding. Forward the signal Bear actually received (not a hardcoded one) to the group, give the tree a grace window to wind down and let Bear write the partial database, then escalate to SIGKILL.
SIGCHLD-driven event loop replaces the poll; the grace-then-SIGKILL escalation runs off a deadline inside that loop. pidfd + signalfd is a deferred Linux-only optimization behind the same wait function.

Consequences

No new dependency for Stage 1; libc and signal-hook are already in the tree (the latter needs its iterator feature enabled), and the group-kill technique is borrowed from the existing watchdog.
The poll and its up-to-100ms latency are gone; teardown reacts at signal speed, inside the budget.
The child leaves Bear's process group, so the tty no longer delivers Ctrl-C to the build directly - Bear becomes the sole conduit. This is what makes reliable tree-kill and real-signal forwarding possible and fixes trap support in the non-tty (CI SIGTERM) case; the trade-off is that any gap in Bear's forwarding loses the tty backstop. Accepted.
Stage 2 closes the setsid-escape hole on Linux hosts with a usable cgroup; where none is available the hole remains and the documented process-group fallback applies. A descendant that detaches gets no grace window (it left the group the graceful signal targets) - only the final cgroup.kill. Accepted: a daemon that deliberately detaches forfeits the graceful wind-down.
Each supervised build creates and removes one cgroup directory; a normal build leaves nothing behind, and the kill path's directory cleanup retries briefly because killed processes are reaped asynchronously by init.
The "only the outermost supervisor groups" rule keeps wrapper-mode nesting correct and keeps the wrapper's supervision simple (forward + propagate exit code); the wrapper inherits the leader's cgroup through the child, so one cgroup.kill reaches the whole tree.
The cgroup stays a Linux-only, runtime-detected layer; a Windows Job Object and a pidfd-based wait remain possible later additions.

Rejected: unifying the probe watchdog on `process_group(0)`

It is tempting to drop the unsafe setsid in the version-probe watchdog (semantic/interpreters/compilers/probe.rs) and reuse Stage 1's safe Command::process_group(0), on the theory that setsid's extra controlling-terminal detach is a no-op once stdin is null and stdout/stderr are pipes. Testing refutes it: under the parallel probe suite, process_group(0) produced intermittent misclassification (3 of 8 runs failed) while setsid did not (0 of 13). setsid gives each short-lived probe its own session with no controlling terminal; process_group(0) leaves it a background group inside the test runner's session and terminal, and under concurrency that difference is observable. The two calls are therefore not interchangeable in general - the lighter one is right for a single supervised build (Stage 1) but wrong for the probe. The probe keeps setsid.

References

Requirement: interception-signal-forwarding
Prior art in-tree: the version-probe watchdog's setsid + killpg teardown (semantic/interpreters/compilers/probe.rs)
Plan: plan.md (repo root, transient)

Process-tree teardown and event-driven supervision

Process-tree teardown and event-driven supervision

Context

Decision

Consequences

Rejected: unifying the probe watchdog on process_group(0)

References

Rejected: unifying the probe watchdog on `process_group(0)`