docs/design/polecat-lifecycle-patrol.md
Bead: gt-t6muy Date: 2026-02-20 Author: capable (gastown polecat) Status: Implemented — core lifecycle shipped, branch cleanup shipped, mayor notify pending Updated: 2026-03-07 (gt-o8g8 implementation audit by bear) Related: gt-dtw9u (Witness monitoring), gt-qpwv4 (Completion detection), gt-6qyt1 (Refinery queue), gt-budeb (Auto-nuke), gt-5j3ia (Swarm aggregation), gt-1dbcp (Polecat auto-start), w-gt-004 (Wasteland lifecycle item)
This document formalizes how Deacon, Witness, Refinery, and Polecats coordinate to move work through the Gas Town propulsion system. It captures the session-per-step model, defines the two cleanup stages, designs the per-rig lifecycle channel, and resolves open design questions about step granularity, recycling, and spawning.
Core insight: Polecats do NOT complete complex molecules end-to-end. Instead, each molecule step gets one polecat session. The sandbox (branch, worktree) persists across sessions. Sessions are the pistons; sandboxes are the cylinders.
See concepts/polecat-lifecycle.md for the relay race model.
These are distinct concepts:
| Concept | Trigger | What Changes | What Persists |
|---|---|---|---|
| Session cycle | Handoff, compaction, crash | Claude context window | Branch, worktree, molecule state |
| Step cycle | Step bead closed | Current step focus | Branch, worktree, remaining steps |
A single step may span multiple session cycles (if the step is complex or compaction occurs). Multiple steps may fit in a single session (if steps are small and context permits). The session-per-step model is a design target, not a hard constraint.
| Trigger | Who Initiates | What Happens |
|---|---|---|
| Step completion | Polecat | bd close <step> then gt handoff for next step |
| Context filling | Claude Code | Auto-compaction; PreCompact hook saves state |
| Crash/timeout | Infrastructure | Witness detects, respawns session |
gt done | Polecat | Final step; submit to MQ, go idle (sandbox preserved) |
Between sessions, state is preserved through:
hook_bead on agent bead persists across sessionsagent_state, cleanup_status, hook_bead fieldsThe new session discovers its position via:
gt prime --hook # Loads role context, reads hook
bd mol current # Discovers which step is next
bd show <step-id> # Reads step instructions
No explicit "handoff payload" is needed. The beads state IS the handoff.
Triggered when a step completes but more steps remain in the molecule.
| Action | Result |
|---|---|
| Close step bead | bd close <step-id> |
| Session cycles | gt handoff (voluntary) or crash recovery |
| Sandbox persists | Branch, worktree, uncommitted work all survive |
| Molecule persists | Remaining steps still open, hook still set |
| Identity persists | Agent bead unchanged, CV accumulates |
Who handles it:
gt handoffSessionManager.Start)LIFECYCLE:Shutdown → witness)Triggered when the molecule's final step completes and work is submitted.
| Action | Result |
|---|---|
Polecat runs gt done | Pushes branch, submits MR, sets cleanup_status=clean |
| Polecat sets agent state | agent_state=idle, hook_bead cleared |
| Polecat kills session | Session terminated, sandbox preserved |
Witness receives POLECAT_DONE | Acknowledges idle transition |
| Refinery merges | Squash-merge to main, closes MR and source issue |
| Identity survives | Agent bead still exists; CV chain has new entry; polecat ready for reuse |
STEP CLEANUP (intermediate) MOLECULE CLEANUP (final)
┌────────────────────┐ ┌────────────────────────────┐
│ Step bead: closed │ │ All step beads: closed │
│ Session: terminated│ │ Session: terminated │
│ Sandbox: ALIVE │ │ Sandbox: PRESERVED (idle) │
│ Molecule: ACTIVE │ │ Molecule: SQUASHED │
│ Hook: SET │ │ Hook: CLEARED │
│ Agent bead: working│ │ Agent bead: nuked │
│ Branch: ALIVE │ │ Branch: PUSHED (idle) │
└────────────────────┘ └────────────────────────────┘
The cleanup pipeline is a chain of handoffs, not a monolithic operation:
Polecat calls gt done
│
├── Sets cleanup_status=clean on agent bead
├── Pushes branch to origin
├── Creates MR bead (label: gt:merge-request)
├── Sends POLECAT_DONE mail to witness
└── Session exits
│
▼
Witness receives POLECAT_DONE
│
├── Checks cleanup_status (ZFC: trust polecat self-report)
├── If clean → sends MERGE_READY to refinery
├── If dirty → creates cleanup wisp (cannot auto-nuke)
└── Nudges refinery session
│
▼
Refinery processes MERGE_READY
│
├── Claims MR (sets assignee)
├── Acquires merge slot (serialized push lock)
├── Runs quality gates
├── Squash-merges to main
├── Closes MR bead and source issue
├── Sends MERGED mail to witness
└── Releases merge slot
│
▼
Witness receives MERGED
│
├── Verifies commit is on main (all remotes)
├── Checks cleanup_status
├── Acknowledges merge (polecat already idle, sandbox preserved)
└── If dirty → warns (shouldn't happen post-merge)
Each stage can fail independently. Recovery is handled by the next patrol cycle:
| Failure | Detection | Recovery |
|---|---|---|
gt done fails mid-execution | Zombie state: session alive, done-intent label | Witness DetectZombiePolecats() finds stuck-in-done, recovers |
POLECAT_DONE mail lost | Witness patrol: finds dead session with hook_bead | DetectZombiePolecats() with agent-dead-in-session |
| Merge conflict | Refinery doMerge() detects | Creates conflict resolution task, blocks MR |
MERGED mail lost | Refinery closed the bead; witness patrol finds closed bead with live session | DetectZombiePolecats() bead-closed-still-running |
| Nuke fails | Session still running after kill attempt | Next patrol detects zombie, retries nuke |
The per-rig polecat channel is implemented using the existing gt mail system.
This was chosen over beads-based queues or state files because:
gastown/witness) already map to rig-level agentsEach rig has implicit lifecycle channels via existing mail routing:
| Channel | Address | Purpose | Serviced By |
|---|---|---|---|
| Polecat lifecycle | <rig>/witness | Recycle, nuke, health requests | Witness patrol |
| Merge queue | <rig>/refinery | MERGE_READY, conflict reports | Refinery patrol |
| Rig coordination | <rig>/witness | Spawn requests, escalations | Witness |
| Town coordination | mayor/ | Cross-rig, strategic | Mayor |
Messages in the polecat lifecycle channel follow the existing witness protocol
(protocol.go):
| Subject Pattern | Type | Sender | Action |
|---|---|---|---|
POLECAT_DONE <name> | Completion | Polecat | Verify clean, forward to refinery |
LIFECYCLE:Shutdown <name> | External shutdown | Daemon | Auto-nuke or cleanup wisp |
LIFECYCLE:Cycle <name> | Session restart | Daemon | Kill and restart session |
HELP: <topic> | Escalation | Polecat | Witness evaluates, relays if needed |
MERGED <id> | Post-merge | Refinery | Nuke polecat sandbox |
MERGE_FAILED <id> | Merge failure | Refinery | Notify polecat, rework needed |
RECOVERED_BEAD <id> | Orphan recovery | Witness | Deacon re-dispatches work |
GUPP_VIOLATION: <name> | Stall detected | Daemon | Witness investigates |
ORPHANED_WORK: <name> | Dead session + work | Daemon | Witness recovers or nukes |
The witness processes its channel during patrol cycles. Processing is first-come-first-served within each cycle. The patrol pattern:
Witness patrol cycle:
│
├── 1. Check inbox (gt mail inbox)
│ └── Process lifecycle messages in order
│
├── 2. Detect zombie polecats
│ └── For each zombie: nuke or escalate
│
├── 3. Detect orphaned beads
│ └── For each orphan: reset status, mail deacon
│
├── 4. Detect stalled polecats
│ └── For each stalled: nudge or escalate
│
├── 5. Check for pending spawns
│ └── Process spawn requests from daemon
│
└── 6. Write patrol receipt
└── Machine-readable summary of findings
The witness is the primary consumer, but the design supports opportunistic servicing by other patrol agents:
| Agent | When It Services | What It Can Do |
|---|---|---|
| Witness | Every patrol cycle | Full lifecycle: spawn, nuke, escalate |
| Deacon | During rig-wide patrol | Detect unserviced requests, nudge witness |
| Daemon | Every heartbeat tick | Detect dead sessions, send LIFECYCLE messages |
| Refinery | During merge processing | Send MERGED/MERGE_FAILED to witness |
This creates redundant monitoring: if the witness misses a message, the deacon or daemon detects the resulting state (dead session, orphaned bead) and either handles it directly or nudges the witness.
As long as three conditions hold, a molecule WILL eventually complete:
hook_bead set on agent bead)GUPP ensures that when a session starts with a hook, it executes. The hook persists across session cycles. The sandbox provides continuity. The witness provides resurrection. Together, these guarantee eventual completion.
┌─────────────────────────────────────────────┐
│ COMPLETION LOOP │
│ │
│ Session spawns → gt prime → discovers hook │
│ │ │
│ ▼ │
│ GUPP fires → execute current step │
│ │ │
│ ▼ │
│ Step complete → bd close → handoff │
│ │ │
│ ▼ │
│ More steps? ──yes──▶ Respawn session ──┐ │
│ │ │ │
│ no │ │
│ │ │ │
│ ▼ │ │
│ gt done → merge → nuke │ │
│ │ │
│ Session crashes? ──▶ Witness respawns ─┘ │
│ │
└─────────────────────────────────────────────┘
| Failure | Effect | Recovery |
|---|---|---|
| Witness down | No respawn on crash | Deacon detects, restarts witness |
| Sandbox corrupted | Branch or worktree broken | RepairWorktree() or nuke and respawn |
| Hook cleared accidentally | GUPP doesn't fire | Witness DetectOrphanedBeads() finds in-progress bead, resets for re-dispatch |
| Dolt server down | Cannot read beads state | Daemon auto-restarts Dolt; polecat retries |
| Crash loop (3+ crashes) | Same step keeps failing | Witness escalates to mayor; filed as bug |
The system prioritizes liveness (work eventually completes) over strict safety (no duplicate work). This means:
Gas Town has four agents that perform patrol (periodic health monitoring):
| Agent | Scope | Frequency | Key Checks |
|---|---|---|---|
| Daemon | Town-wide | 3-minute heartbeat | Session liveness, GUPP violations, orphaned work |
| Boot/Deacon | Town-wide | Per daemon tick | Deacon health, witness health, cross-rig issues |
| Witness | Per-rig | Continuous | Polecat health, zombie detection, completion handling |
| Refinery | Per-rig | On demand | Merge queue processing, conflict detection |
Multiple agents observing overlapping state is intentional redundancy:
Daemon Deacon
(mechanical) (intelligent)
│ │
┌───────────┼───────────┐ ┌──────────┼──────────┐
│ │ │ │ │ │
Session GUPP Orphan Witness Refinery Cross-rig
liveness violations work health health convoy
│ │ │ │ │
└───────────┤ │ │ │
│ │ │ │
▼ ▼ ▼ ▼
Witness Witness Refinery
(per-rig patrol) (responds) (responds)
│
┌───────────┼───────────┐
│ │ │
Zombie Orphaned Stalled
detection beads polecats
Key property: If any single patrol agent fails, the others detect the resulting state degradation and compensate. The daemon detects dead sessions. The deacon detects dead witnesses. The witness detects dead polecats.
Daemon ───LIFECYCLE:──────▶ Witness inbox
Daemon ───GUPP_VIOLATION:─▶ Witness inbox
Daemon ───ORPHANED_WORK:──▶ Witness inbox
Deacon ◀──heartbeat.json──── Daemon
Deacon ───nudge────────────▶ Witness (if stale)
Deacon ───nudge────────────▶ Refinery (if stale)
Witness ──MERGE_READY:────▶ Refinery inbox
Witness ──RECOVERED_BEAD:─▶ Deacon (for re-dispatch)
Witness ──patrol receipt───▶ Beads (audit trail)
Refinery ─MERGED:─────────▶ Witness inbox
Refinery ─MERGE_FAILED:───▶ Witness inbox
Refinery ─convoy check─────▶ Deacon (for stranded convoys)
All patrol agents converge on the same observable state: beads (via Dolt), git (via branches and worktrees), and tmux (via session liveness). No agent maintains private state that others depend on. This is the "discover, don't track" principle applied to monitoring.
If state diverges (e.g., a message is lost), the next patrol cycle re-derives state from observables and self-heals.
Question: How many logical steps per physical molecule step? How many steps per polecat session?
Answer: Use formulas to define granularity, and let context pressure determine session boundaries.
Step granularity guidelines:
| Step Type | Granularity | Example |
|---|---|---|
| Setup / teardown | One physical step | "Set up working branch" |
| Implementation | One per logical unit | "Implement the solution" (may span sessions) |
| Verification | One per check type | "Run quality checks", "Self-review" |
| Handoff | One per lifecycle event | "Commit changes", "Submit work" |
The mol-polecat-work formula currently uses 10 steps. This is appropriate for
most work because:
Session-per-step is a guideline, not a rule. A polecat may complete multiple steps in one session if context permits. The key constraint is that each step is closed individually (no batch-closing — the Batch-Closure Heresy).
Anti-patterns:
git add commands (overhead exceeds value)Question: When is mechanical intervention (daemon-driven) appropriate vs agent-driven (polecat requests its own recycle)?
Answer: Prefer explicit self-recycling. Use mechanical intervention only as a safety net.
The spectrum:
AGENT-DRIVEN (preferred) MECHANICAL (safety net)
├── gt done (polecat goes idle) ├── Daemon detects dead session
├── gt handoff (polecat self-cycles) ├── Daemon detects GUPP violation
├── gt escalate (polecat asks help) ├── Witness zombie sweep
└── HELP mail (polecat signals) └── Deacon restart on stale heartbeat
Design principle: The polecat is the authority on its own state. External intervention should only occur when the polecat cannot speak for itself (dead session, hung process, stuck-in-done).
Concrete thresholds (agent-determined, not hardcoded):
The daemon uses broad thresholds for safety-net detection:
hook_bead but no progressHungSessionThresholdMinutes)done-intent labelThese thresholds are intentionally generous. The goal is to catch truly stuck polecats, not polecats that are thinking hard. False positives (the "Deacon murder spree" bug) are worse than slow detection.
The murder spree lesson: Mechanical detection of "stuck" is fragile because distinguishing "thinking deeply" from "hung" requires intelligence. This is why Boot exists (intelligent triage) and why the daemon's thresholds are conservative. Only the witness (an AI agent) should make judgment calls about whether a polecat is truly stuck.
Question: Mail-based, beads-based, or state file?
Answer: Mail-based. See Section 4 for full design.
Why not beads-based (special issue type)?
Why not state files (rig/polecat-queue.json)?
Question: After a polecat completes a step and hands off, who spawns the next session to continue the molecule?
Answer: The witness, triggered by either handoff detection or daemon lifecycle request.
The spawn chain:
Polecat completes step
│
├── Closes step bead
├── Calls gt handoff (creates handoff mail)
└── Session exits
│
▼
Daemon heartbeat tick
│
├── Detects dead polecat session
├── Finds hook_bead still set (work isn't done)
└── Triggers session restart
│
▼
SessionManager.Start()
│
├── Creates new tmux session in existing worktree
├── Injects env vars (GT_POLECAT, GT_RIG)
├── SessionStart hook fires: gt prime --hook
└── New session discovers next step via bd mol current
Current implementation: The daemon's processLifecycleRequests() handles
this. When a session dies but the hook is still set, the daemon either sends a
LIFECYCLE: message to the witness or directly restarts the session (depending
on configuration). Polecat startup is handled end-to-end by the GUPP/beacon
flow (SessionManager → StartupNudge → BuildStartupPrompt → SessionStart hook
→ gt prime).
Future (AT integration): The witness spawns replacement teammates directly
via Teammate({ operation: "spawn" }). The SubagentStop hook detects teammate
death and triggers respawn. See docs/design/witness-at-team-lead.md for details.
A polecat runs gt done but the session hangs before cleanup completes.
Detection: Witness DetectZombiePolecats() checks for done-intent label
older than 60 seconds with a live session.
Recovery: Witness kills the session and continues the cleanup pipeline
(verify cleanup_status, forward to refinery if MR exists).
A polecat directory exists but no tmux session and no hook_bead.
Detection: Manager.ReconcilePool() finds directories without sessions.
DetectStalePolecats() identifies sandboxes far behind main with no work.
Recovery: If no uncommitted work and no active MR, nuke the sandbox. If uncommitted work exists, escalate (someone needs to decide if the work matters).
The refinery starts merging while the polecat is still pushing.
Prevention: The cleanup_status=clean field on the agent bead serializes
this. The witness only sends MERGE_READY after verifying the polecat has
exited and the branch is clean. The merge slot provides additional serialization.
A step keeps failing and the session keeps restarting.
Detection: Track crash count per polecat (via ReconcilePool or
ephemeral state). Three crashes on the same step triggers escalation.
Recovery: Witness stops respawning, creates a bug bead, mails the mayor. The molecule stays in its current state (recoverable when the bug is fixed).
Should not happen because the hook is exclusive (one hook_bead per agent bead,
one agent bead per polecat name). But if it does:
Prevention: Git branch naming includes a unique suffix (@<timestamp>).
The TOCTOU guard in DetectZombiePolecats() (records detectedAt, re-verifies
before destructive action) prevents racing between detection and action.
Recovery: The second session fails to push (branch diverged) and escalates.
The Agent Teams (AT) integration (see docs/design/witness-at-team-lead.md)
changes the transport layer but preserves the lifecycle model:
| Aspect | Current (tmux) | Future (AT) |
|---|---|---|
| Session management | tmux sessions | AT teammates |
| Spawning | SessionManager.Start() | Teammate({ operation: "spawn" }) |
| Health monitoring | tmux liveness + pane output | AT lifecycle hooks (SubagentStop) |
| Messaging | gt nudge (tmux send-keys) | AT messaging |
| Cleanup | Session kill (sandbox preserved) | Teammate({ operation: "requestShutdown" }) (sandbox preserved) |
What stays the same:
gt done as the polecat idle signalWhat changes:
All core lifecycle operations are implemented and running in production:
| Operation | Command/Component | Key Implementation |
|---|---|---|
| Spawn/assign | gt sling | sling.go, polecat_spawn.go — finds idle polecat or allocates new slot |
| Work execution | gt prime --hook | Session discovers hook via bd mol current, GUPP fires |
| Session cycling | gt handoff | handoff.go — all roles, preserves sandbox and identity |
| Step completion | bd close + gt handoff | Step cleanup: session dies, sandbox lives |
| Work submission | gt done | done.go — push, MR, sandbox sync, set idle |
| Idle polecat reuse | gt sling | polecat/manager.go: FindIdlePolecat() + ReuseIdlePolecat() — branch-only repair |
| Zombie detection | Witness patrol | witness/handlers.go: DetectZombiePolecats() — restart-first, no auto-nuke |
| Stale detection | Witness patrol | polecat/manager.go: DetectStalePolecats() — tmux-based, protects paused states |
| Orphan recovery | Witness patrol | witness/handlers.go: DetectOrphanedBeads() — reset and re-dispatch |
| Cleanup pipeline | Mail-based | POLECAT_DONE → Witness → MERGE_READY → Refinery → MERGED |
| Merge queue | Refinery | Squash-merge, close MR and issue, convoy check |
| Feature | Description | Impact |
|---|---|---|
| Refinery notifies mayor after merge | PRs #2436/#2437 closed; branch cleanup shipped, mayor notify not yet | Unblocks dependent work dispatch |
| Feature | Rationale for deferral |
|---|---|
| Pool size enforcement | On-demand allocation works; fixed pool is optimization, not correctness |
gt polecat pool init | Polecats created naturally by first gt sling; pre-allocation unnecessary |
ReconcilePool() | Witness patrol already detects state drift via zombie/stale/orphan checks |
See concepts/polecat-lifecycle.md for the complete lifecycle model (three layers, four states, persistent polecat design). This document covers the implementation details: cleanup stages, mail channels, patrol coordination, and edge case handling.