docs/design/dog-infrastructure.md
Autonomous health monitoring, recovery, and concurrent shutdown dances in Gas Town.
Gas Town uses a three-tier watchdog chain for autonomous health monitoring:
Daemon (Go process) <- Dumb transport, 3-min heartbeat
|
+-> Boot (AI agent) <- Intelligent triage, fresh each tick
|
+-> Deacon (AI agent) <- Continuous patrol, long-running
|
+-> Witnesses & Refineries <- Per-rig agents
Key insight: The daemon is mechanical (can't reason), but health decisions need intelligence (is the agent stuck or just thinking?). Boot bridges this gap.
The daemon needs to ensure the Deacon is healthy, but:
Daemon can't reason - It's Go code following the ZFC principle (don't reason about other agents). It can check "is session alive?" but not "is agent stuck?"
Waking costs context - Each time you spawn an AI agent, you consume context tokens. In idle towns, waking Deacon every 3 minutes wastes resources.
Observation requires intelligence - Distinguishing "agent composing large artifact" from "agent hung on tool prompt" requires reasoning.
Boot is a narrow, ephemeral AI agent that:
This gives us intelligent triage without the cost of keeping a full AI running.
We could have Deacon handle its own "should I be awake?" logic, but:
| Agent | Session Name | Location | Lifecycle |
|---|---|---|---|
| Daemon | (Go process) | ~/gt/daemon/ | Persistent, auto-restart |
| Boot | gt-boot | ~/gt/deacon/dogs/boot/ | Ephemeral, fresh each tick |
| Deacon | hq-deacon | ~/gt/deacon/ | Long-running, handoff loop |
Critical: Boot runs in gt-boot, NOT hq-deacon. This prevents Boot
from conflicting with a running Deacon session.
The daemon runs a heartbeat tick every 3 minutes:
func (d *Daemon) heartbeatTick() {
d.ensureBootRunning() // 1. Spawn Boot for triage
d.checkDeaconHeartbeat() // 2. Belt-and-suspenders fallback
d.ensureWitnessesRunning() // 3. Witness health (checks tmux directly)
d.ensureRefineriesRunning() // 4. Refinery health (checks tmux directly)
d.processLifecycleRequests() // 5. Cycle/restart requests
// Agent state derived from tmux, not recorded in beads (gt-zecmc)
}
The Deacon updates ~/gt/deacon/heartbeat.json at the start of each patrol cycle:
{
"timestamp": "2026-01-02T18:30:00Z",
"cycle": 42,
"last_action": "health-scan",
"healthy_agents": 3,
"unhealthy_agents": 0
}
| Age | State | Boot Action |
|---|---|---|
| < 5 min | Fresh | Nothing (Deacon active) |
| 5-15 min | Stale | Nudge if pending mail |
| > 15 min | Very stale | Wake (Deacon may be stuck) |
When Boot runs, it observes:
Then decides:
| Condition | Action | Command |
|---|---|---|
| Session dead | START | Exit; daemon calls ensureDeaconRunning() |
| Heartbeat > 15 min | WAKE | gt nudge deacon "Boot wake: check your inbox" |
| Heartbeat 5-15 min + mail | NUDGE | gt nudge deacon "Boot check-in: pending work" |
| Heartbeat fresh | NOTHING | Exit silently |
The Deacon runs continuous patrol cycles. After N cycles or high context:
End of patrol cycle:
|
+- Squash wisp to digest (ephemeral -> permanent)
+- Write summary to molecule state
+- gt handoff -s "Routine cycle" -m "Details"
|
+- Creates mail for next session
Next daemon tick:
Daemon -> ensureDeaconRunning()
|
+- Spawns fresh Deacon in gt-deacon
|
+- SessionStart hook: gt mail check --inject
|
+- Previous handoff mail injected
|
+- Deacon reads and continues
Boot is ephemeral - it exits after each tick. No persistent handoff needed.
However, Boot uses a marker file to prevent double-spawning:
~/gt/deacon/dogs/boot/.boot-running (TTL: 5 minutes)~/gt/deacon/dogs/boot/.boot-status.json (last action/result)If the marker exists and is recent, daemon skips Boot spawn for that tick.
When tmux is unavailable, Gas Town enters degraded mode:
| Capability | Normal | Degraded |
|---|---|---|
| Boot runs | As AI in tmux | As Go code (mechanical) |
| Observe panes | Yes | No |
| Nudge agents | Yes | No |
| Start agents | tmux sessions | Direct spawn |
Degraded Boot triage is purely mechanical:
Multiple layers ensure recovery:
Boot needs to run multiple shutdown-dance molecules concurrently when multiple death warrants are issued. All warrants need concurrent tracking, independent timeouts, and separate outcomes.
The shutdown-dance does NOT need Claude sessions. The dance is a deterministic state machine:
WARRANT -> INTERROGATE -> EVALUATE -> PARDON|EXECUTE
Each step is mechanical:
Decision: Dogs are lightweight Go routines, not Claude sessions.
+-----------------------------------------------------------------+
| BOOT |
| (Claude session in tmux) |
| |
| +-----------------------------------------------------------+ |
| | Dog Manager | |
| | | |
| | Pool: [Dog1, Dog2, Dog3, ...] (goroutines + state) | |
| | | |
| | allocate() -> Dog | |
| | release(Dog) | |
| | status() -> []DogStatus | |
| +-----------------------------------------------------------+ |
| |
| Boot's job: |
| - Watch for warrants (file or event) |
| - Allocate dog from pool |
| - Monitor dog progress |
| - Handle dog completion/failure |
| - Report results |
+-----------------------------------------------------------------+
// Dog represents a shutdown-dance executor
type Dog struct {
ID string // Unique ID (e.g., "dog-1704567890123")
Warrant *Warrant // The death warrant being processed
State ShutdownDanceState
Attempt int // Current interrogation attempt (1-3)
StartedAt time.Time
StateFile string // Persistent state: ~/gt/deacon/dogs/active/<id>.json
}
type ShutdownDanceState string
const (
StateIdle ShutdownDanceState = "idle"
StateInterrogating ShutdownDanceState = "interrogating" // Sent message, waiting
StateEvaluating ShutdownDanceState = "evaluating" // Checking response
StatePardoned ShutdownDanceState = "pardoned" // Session responded
StateExecuting ShutdownDanceState = "executing" // Killing session
StateComplete ShutdownDanceState = "complete" // Done, ready for cleanup
StateFailed ShutdownDanceState = "failed" // Dog crashed/errored
)
type Warrant struct {
ID string // Bead ID for the warrant
Target string // Session to interrogate (e.g., "gt-gastown-Toast")
Reason string // Why warrant was issued
Requester string // Who filed the warrant
FiledAt time.Time
}
Decision: Fixed pool of 5 dogs, configurable via environment (GT_DOG_POOL_SIZE).
Rationale:
const (
DefaultPoolSize = 5
MaxPoolSize = 20
)
type DogPool struct {
mu sync.Mutex
dogs []*Dog // All dogs in pool
idle chan *Dog // Channel of available dogs
active map[string]*Dog // ID -> Dog for active dogs
stateDir string // ~/gt/deacon/dogs/active/
}
+------------------------------------------+
| |
v |
+----------------------------+ |
| INTERROGATING | |
| | |
| 1. Send health check | |
| 2. Start timeout timer | |
+-------------+--------------+ |
| |
| timeout or response |
v |
+----------------------------+ |
| EVALUATING | |
| | |
| Check tmux output for | |
| ALIVE keyword | |
+-------------+--------------+ |
| |
+-------+-------+ |
| | |
v v |
[ALIVE found] [No ALIVE] |
| | |
| | attempt < 3? |
| +-----------------------------------+
| | yes: attempt++, longer timeout
| |
| | no: attempt == 3
v v
+---------+ +-----------+
| PARDONED| | EXECUTING |
| | | |
| Cancel | | Kill tmux |
| warrant | | session |
+----+----+ +-----+-----+
| |
+-------+-------+
|
v
+----------------+
| COMPLETE |
| |
| Write result |
| Release dog |
+----------------+
| Attempt | Timeout | Cumulative Wait |
|---|---|---|
| 1 | 60s | 60s |
| 2 | 120s | 180s (3 min) |
| 3 | 240s | 420s (7 min) |
[DOG] HEALTH CHECK: Session {target}, respond ALIVE within {timeout}s or face termination.
Warrant reason: {reason}
Filed by: {requester}
Attempt: {attempt}/3
The existing dog package (internal/dog/) manages Deacon's multi-rig helper dogs.
Those are different from shutdown-dance dogs:
| Aspect | Helper Dogs (existing) | Dance Dogs (new) |
|---|---|---|
| Purpose | Cross-rig infrastructure | Shutdown dance execution |
| Sessions | Claude sessions | Goroutines (no Claude) |
| Worktrees | One per rig | None |
| Lifecycle | Long-lived, reusable | Ephemeral per warrant |
| State | idle/working | Dance state machine |
Recommendation: Use different package to avoid confusion:
internal/dog/ - existing helper dogsinternal/shutdown/ - shutdown dance poolIf a dog crashes (Boot process restarts, system crash):
~/gt/deacon/dogs/active/| State | Recovery Action |
|---|---|
| interrogating | Restart from current attempt |
| evaluating | Check response, continue |
| executing | Verify kill, mark complete |
| pardoned/complete | Already done, clean up |
func (p *DogPool) RecoverOrphans() error {
files, _ := filepath.Glob(p.stateDir + "/*.json")
for _, f := range files {
state := loadDogState(f)
if state.State != StateComplete && state.State != StatePardoned {
dog := p.allocateForRecovery(state)
go dog.Resume()
}
}
return nil
}
If all dogs are busy when a new warrant arrives, the warrant is queued for later processing. When a dog completes and is released, the queue is checked for pending warrants.
~/gt/
├── daemon/
│ ├── daemon.log # Daemon activity log
│ └── daemon.pid # Daemon process ID
├── deacon/
│ ├── heartbeat.json # Deacon freshness (updated each patrol cycle)
│ ├── health-check-state.json # Agent health tracking (gt deacon health-check)
│ └── dogs/
│ ├── boot/ # Boot's working directory
│ │ ├── CLAUDE.md # Boot context
│ │ ├── .boot-running # Boot in-progress marker (TTL: 5 min)
│ │ └── .boot-status.json # Boot last action/result
│ ├── active/ # Active dog state files
│ │ ├── dog-123.json
│ │ └── ...
│ ├── completed/ # Completed dance records (for audit)
│ │ └── dog-789.json
│ └── warrants/ # Pending warrant queue
│ └── warrant-abc.json
# Check Deacon heartbeat
cat ~/gt/deacon/heartbeat.json | jq .
# Check Boot status
cat ~/gt/deacon/dogs/boot/.boot-status.json | jq .
# View daemon log
tail -f ~/gt/daemon/daemon.log
# Manual Boot run
gt boot triage
# Manual Deacon health check
gt deacon health-check
# Dog pool status
gt dog pool status
# View active shutdown dances
gt dog dances
# View warrant queue
gt dog warrants
Symptom: Boot runs in hq-deacon instead of gt-boot
Cause: Session name confusion in spawn code
Fix: Ensure gt boot triage specifies --session=gt-boot
Symptom: tmux session exists but Claude is dead
Cause: Daemon checks session existence, not process health
Fix: Kill zombie sessions before recreating: gt session kill hq-deacon
Symptom: gt status shows wrong state for agents
Cause: Previously bead state and tmux state could diverge
Fix: As of gt-zecmc, status derives state from tmux directly (no bead state for
observable conditions like running/stopped). Non-observable states (stuck, awaiting-gate)
are still stored in beads.
The watchdog chain provides autonomous recovery:
Boot exists because the daemon can't reason and Deacon can't observe itself. The separation costs complexity but enables:
The dog pool extends this with concurrent shutdown dances -- lightweight Go state machines that execute warrants without consuming Claude sessions.