docs/design/convoy/spec.md
Daemon-resident event-driven completion and stranded convoy recovery.
Status: Implementation complete (all stories DONE) Owner: Daemon subsystem Related: convoy-lifecycle.md | convoy-manager.md
Convoys group work but don't drive it. Completion depends on a single
poll-based Deacon patrol cycle running gt convoy check. When Deacon is down
or slow, convoys stall. Work finishes but the loop never lands:
Create -> Track -> Execute -> Issue closes -> ??? -> Convoy closes
The gap needs three capabilities:
Two goroutines inside gt daemon:
| Goroutine | Trigger | What it does |
|---|---|---|
| Event poll | GetAllEventsSince every 5s, all rig stores + hq | Detects EventClosed / EventStatusChanged(closed), calls CheckConvoysForIssue |
| Stranded scan | gt convoy stranded --json every 30s | Feeds first ready issue via gt sling, auto-closes empty convoys via gt convoy check |
Both goroutines are context-cancellable and coordinate shutdown via sync.WaitGroup.
The event poll opens beads stores for all known rigs (via routes.jsonl) plus
the town-level hq store. Parked/docked rigs are skipped during polling. Convoy
lookups always use the hq store since convoys are hq-* prefixed. Each store
has an independent high-water mark for event IDs.
convoy.CheckConvoysForIssue)Shared function called by the daemon's event poll:
| Observer | When | Entry point |
|---|---|---|
| Daemon event poll | Close event detected in any rig store or hq | convoy.CheckConvoysForIssue (hq store passed in) |
The shared function:
GetDependentsWithMetadata on hq store, filtered by tracks type)gt convoy check <id> for open convoysgt sling| Decision | Rationale |
|---|---|
| SDK polling (not CLI streaming) | Avoids subprocess lifecycle management, simpler restart semantics |
| High-water mark (atomic int64) | Monotonically advancing, no duplicate event processing |
| One issue fed per convoy per scan | Prevents batch overflow; next issue fed on next close event |
| Stranded scan as safety net | Catches convoys missed by event-driven path (crash recovery) |
| Nil store disables event poll only | Stranded scan still works without beads SDK (degraded mode) |
| Resolved binary paths (PATCH-006) | ConvoyManager resolves gt/bd at startup to avoid PATH issues |
| Status | Meaning |
|---|---|
| DONE | Implemented, tested, integrated |
| DONE-PARTIAL | Implemented but has known gaps |
| TODO | Not yet implemented |
These commands must pass for every implementation story in this spec:
go test ./...golangci-lint runDescription: When an issue closes, the daemon detects the close event via SDK polling and triggers convoy completion checks.
Implementation: ConvoyManager.runEventPoll + pollEvents in convoy_manager.go
Acceptance criteria:
GetAllEventsSince on a 5-second intervalEventClosed eventsEventStatusChanged where new_value == "closed"issue_idconvoy.CheckConvoysForIssue for each detected closeGetAllEventsSince logs and retries next intervalTests:
TestEventPoll_DetectsCloseEvents -- real beads store, creates+closes issue, verifies logTestEventPoll_SkipsNonCloseEvents -- create-only, no close detectionCorrective note: "Zero side effects" negative assertions have been added via
TestEventPoll_SkipsNonCloseEvents_NegativeAssertion (verifies no subprocess
calls, no close detection, and no convoy activity for non-close events). Originally
tracked in S-11; now resolved.
Description: Every 30 seconds, scan for stranded convoys (unassigned work or empty). Feed ready work or auto-close empties.
Implementation: ConvoyManager.runStrandedScan + scan + findStranded + feedFirstReady + closeEmptyConvoy in convoy_manager.go
Acceptance criteria:
scanIntervalgt convoy stranded --json and parses outputready_count > 0: dispatches first ready issue via gt sling <id> <rig> --no-bootready_count == 0: runs gt convoy check <id> to auto-closebeads.ExtractPrefix + beads.GetRigNameForPrefixTests:
TestScanStranded_FeedsReadyIssues -- mock gt, verify sling log fileTestScanStranded_ClosesEmptyConvoys -- mock gt, verify check log fileTestScanStranded_NoStrandedConvoys -- empty list: asserts sling log absent, check log absent, no convoy activity in logsTestScanStranded_DispatchFailure -- first sling fails, scan continuesTestConvoyManager_ScanInterval_Configurable -- 0 -> default, custom preservedTestStrandedConvoyInfo_JSONParsing -- JSON round-tripDescription: A shared function for checking convoy completion and feeding the next ready issue, callable from any observer.
Implementation: CheckConvoysForIssue + feedNextReadyIssue in convoy/operations.go
Acceptance criteria:
GetDependentsWithMetadata filtered by tracks typeblocks dependenciesgt convoy check <id> for open convoysgt slingexternal:prefix:id wrapper format via extractIssueIDGetIssuesByIDs for cross-rig accuracyTests:
TestGetTrackingConvoys_FiltersByTracksType -- real store, blocks filteredTestIsConvoyClosed_ReturnsCorrectStatus -- real store, open vs closedTestExtractIssueID -- all wrapper variantsTestFeedNextReadyIssue_SkipsNonOpenIssues -- filtering logicTestFeedNextReadyIssue_FindsReadyIssue -- first matchTestCheckConvoysForIssue_NilStore -- returns nilTestCheckConvoysForIssue_NilLogger -- no panicTestCheckConvoysForIssueWithAutoStore_NoStore -- non-existent path, nilDescription: Witness convoy observer removed. The daemon's multi-rig event poll (watching all rig databases + hq) provides event-driven coverage for close events from any rig. The stranded scan (30s) provides backup. The witness's core job is polecat lifecycle management -- convoy tracking is orthogonal.
History: Originally had 6 CheckConvoysForIssueWithAutoStore call sites in
handlers.go (1 post-merge, 5 zombie cleanup paths). All were pure side-effect
notification hooks. Removed when daemon gained multi-rig event polling.
Description: Refinery convoy observer removed. The daemon event poll (5s) and witness observer provide sufficient coverage. The refinery observer was silently broken (S-17: wrong root path) for the entire feature lifetime with no visible impact, confirming the other two observers are sufficient. Since beads unavailability disables the entire town (not just convoy checks), the "degraded mode" justification for a third observer does not hold.
History: Originally called CheckConvoysForIssueWithAutoStore after merge.
S-17 found it passed rig path instead of town root. S-18 fixed it. Subsequently
removed as unnecessary redundancy.
Description: ConvoyManager starts and stops cleanly with the daemon.
Implementation: Integrated in daemon.go Run() and shutdown() methods.
Acceptance criteria:
gtPath/bdPath to ConvoyManagerlogger.Printf for daemon log integrationTests:
TestDaemon_StartsManagerAndScanner -- start + stop with mock binariesTestDaemon_StopsManagerAndScanner -- stop completes within 5sDescription: Merge-request beads carry convoy tracking fields for priority scoring and starvation prevention.
Implementation: ConvoyID and ConvoyCreatedAt in MRFields struct in beads/fields.go
Acceptance criteria:
convoy_id field parsed and formattedconvoy_created_at field parsed and formattedDescription: Start/Stop are safe under edge conditions.
Acceptance criteria:
Stop() is idempotent (double-call does not deadlock)Stop() before Start() returns immediatelyStart() is guarded against double-call (atomic.Bool with CompareAndSwap at convoy_manager.go:50-51,80-83)Tests:
TestManagerLifecycle_StartStop -- basic start + stopTestConvoyManager_DoubleStop_Idempotent -- double stopTestStart_DoubleCall_Guarded -- second Start() is no-op, warning loggedDescription: All subprocess calls in ConvoyManager and observer propagate context cancellation so daemon shutdown is not blocked by hanging subprocesses.
Implementation: All exec.Command calls replaced with exec.CommandContext.
Process group killing via setProcessGroup + syscall.Kill(-pid, SIGKILL) prevents
orphaned child processes.
Acceptance criteria:
exec.Command calls in ConvoyManager use exec.CommandContext(m.ctx, ...) (convoy_manager.go:200,241,257)exec.Command calls in operations.go accept and use a context parametergt subprocess hangs (convoy_manager_integration_test.go:154-206)convoy_manager.go, operations.go)Description: Observer subprocess calls use resolved binary paths instead
of bare "gt" to avoid PATH-dependent behavior drift.
Implementation: CheckConvoysForIssue resolves via exec.LookPath("gt")
with fallback to bare "gt". Threads gtPath parameter to runConvoyCheck
and dispatchIssue in operations.go.
Acceptance criteria:
runConvoyCheck and dispatchIssue accept a gtPath parameterCheckConvoysForIssue threads resolved pathm.gtPath)"gt" if resolution failsDescription: Filled testing gaps for core invariants identified in the test plan analysis.
Tests added:
| Test | What it proves |
|---|---|
TestFeedFirstReady_MultipleReadyIssues_DispatchesOnlyFirst | 3 ready issues -> sling log contains only first issue ID |
TestFeedFirstReady_UnknownPrefix_Skips | Issue prefix not in routes.jsonl -> sling never called, error logged |
TestFeedFirstReady_UnknownRig_Skips | Prefix resolves but rig lookup fails -> sling never called |
TestFeedFirstReady_EmptyReadyIssues_NoOp | ReadyIssues=[] despite ReadyCount>0 -> no crash, no dispatch |
TestEventPoll_SkipsNonCloseEvents_NegativeAssertion | Asserts zero side effects (no subprocess calls, no convoy activity) |
Acceptance criteria:
Description: Covered error paths that previously had no test coverage.
Tests added:
| Test | What it proves |
|---|---|
TestFindStranded_GtFailure_ReturnsError | gt convoy stranded exits non-zero -> error returned |
TestFindStranded_InvalidJSON_ReturnsError | gt returns non-JSON stdout -> parse error returned |
TestScan_FindStrandedError_LogsAndContinues | scan() doesn't panic when findStranded fails |
TestPollEvents_GetAllEventsSinceError | GetAllEventsSince returns error -> logged, retried next interval |
Acceptance criteria:
Description: Covered lifecycle edge cases identified in the test plan.
Tests added:
| Test | What it proves |
|---|---|
TestScan_ContextCancelled_MidIteration | Large stranded list + cancel mid-loop -> exits cleanly |
TestScanStranded_MixedReadyAndEmpty | Heterogeneous stranded list routes ready->sling and empty->check correctly |
TestStart_DoubleCall_Guarded | Second Start() is no-op, warning logged |
Acceptance criteria:
Description: Improved test harness quality and reduced duplication.
Items:
| Item | Impact |
|---|---|
Extract mockGtForScanTest(t, opts) helper | Used by 5+ scan tests (convoy_manager_test.go:57-117) |
| Add side-effect logger to all mock scripts | All mock scripts write call logs for positive/negative assertions |
Fix DispatchFailure test logger to capture fmt.Sprintf(format, args...) | Assertions verify rendered messages with correct IDs |
Convert TestScanStranded_NoStrandedConvoys to negative test | Asserts sling/check logs absent |
Acceptance criteria:
Description: Update stale documentation to reflect current implementation.
Items:
| Document | Issue |
|---|---|
docs/design/daemon/convoy-manager.md | Mermaid diagram shows bd activity --follow but implementation uses SDK GetAllEventsSince polling |
docs/design/daemon/convoy-manager.md | Text says "Restarts with 5s backoff on stream error" -- no stream, no backoff; it's a poll-retry loop |
docs/design/convoy/testing.md | Row "Stream failure triggers backoff + retry loop" is stale (no stream) |
docs/design/convoy/testing.md | TestDoubleStop_Idempotent listed as gap but now exists |
docs/design/convoy/convoy-lifecycle.md | Observer table lists Deacon as primary third observer; implementation uses Refinery |
docs/design/convoy/convoy-lifecycle.md | "No manual close" claim is stale; gt convoy close --force exists |
docs/design/convoy/convoy-lifecycle.md | Relative link to convoy concepts doc is broken (../concepts/...) |
docs/design/convoy/spec.md | File map test counts drifted from current suite |
Acceptance criteria:
Completion note: Completed in this review pass; remaining ambiguity about refinery root-path semantics is tracked separately in S-17.
Description: Add explicit corrective tasks for inaccuracies discovered in stories marked DONE, without changing the implementation status itself.
Rationale: DONE stories can still contain stale supporting narrative or inventory details after nearby refactors. Corrections are tracked explicitly to avoid silently editing historical delivery claims.
Scope:
handlers.gotownRoot vs rig-path argument assumptions
for CheckConvoysForIssueWithAutoStoreAcceptance criteria:
Status note: All corrective notes updated. S-01 negative assertion test now exists (resolved). S-04 call sites already use semantic descriptions. S-05 note updated to reflect S-17 verification findings (incorrect path, fix in S-18).
Description: Verify whether refinery passing e.rig.Path into
CheckConvoysForIssueWithAutoStore is correct for convoy visibility.
Context:
<townRoot>/.beads/doltFindings:
The current behavior is incorrect. e.rig.Path is a rig-level path
(<townRoot>/<rigName>), set in rig/manager.go as filepath.Join(m.townRoot, name).
OpenStoreForTown constructs <path>/.beads/dolt, so the refinery opens
<townRoot>/<rigName>/.beads/dolt instead of <townRoot>/.beads/dolt.
The rig-level .beads/ directory typically contains either a redirect file
(pointing to mayor/rig/.beads) or rig-scoped metadata -- not the town-level
Dolt database that holds convoy data. As a result, beadsdk.Open either fails
(no dolt/ directory) or opens a rig-scoped store that does not contain convoy
tracking dependencies. In both cases CheckConvoysForIssueWithAutoStore silently
returns nil, effectively disabling convoy checks from the refinery observer.
Other observers handle this correctly:
workspace.Find(workDir) before callingd.config.TownRoot directlyFix required: Resolve town root from e.rig.Path using workspace.Find
before passing to CheckConvoysForIssueWithAutoStore, matching the witness pattern.
See S-18 for implementation.
Acceptance criteria:
Description: Fixed the refinery's CheckConvoysForIssueWithAutoStore call to
pass the town root instead of the rig path, so convoy checks actually open the
correct beads store.
Context: Identified by S-17 verification. The refinery was passing e.rig.Path
(<townRoot>/<rigName>) but the function expects the town root. This silently
disabled convoy observation from the refinery.
Implementation: engineer.go now resolves town root via workspace.Find(e.rig.Path)
before calling CheckConvoysForIssueWithAutoStore, matching the witness pattern.
Acceptance criteria:
workspace.Find(e.rig.Path) before calling CheckConvoysForIssueWithAutoStoreworkspace package added to engineer.goengineer.go removed after fix| # | Invariant | Category | Blast Radius | Story | Tested? |
|---|---|---|---|---|---|
| I-1 | Issue close triggers CheckConvoysForIssue | Data | High | S-01 | Yes |
| I-2 | Non-close events produce zero side effects | Safety | Low | S-01 | Yes (TestEventPoll_SkipsNonCloseEvents_NegativeAssertion) |
| I-3 | High-water mark advances monotonically | Data | High | S-01 | Implicit |
| I-4 | Convoy check is idempotent | Data | Low | S-03 | Yes |
| I-5 | Stranded convoys with ready work get fed | Liveness | High | S-02 | Yes |
| I-6 | Empty stranded convoys get auto-closed | Data | Medium | S-02 | Yes |
| I-7 | Scan continues after dispatch failure | Liveness | Medium | S-02 | Yes |
| I-8 | Context cancellation stops both goroutines | Liveness | High | S-06 | Yes |
| I-9 | One issue fed per convoy per scan | Safety | Medium | S-02 | Implicit |
| I-10 | Unknown prefix/rig skips issue (no crash) | Safety | Medium | S-02 | Yes (TestFeedFirstReady_UnknownPrefix_Skips, _UnknownRig_Skips) |
| I-11 | Stop() is idempotent | Safety | Low | S-08 | Yes |
| I-12 | Subprocess cancellation on shutdown | Liveness | High | S-09 | Yes (TestConvoyManager_ShutdownKillsHangingSubprocess) |
| Failure | Likelihood | Recovery | Tested? |
|---|---|---|---|
GetAllEventsSince error | Low | Retry next 5s interval | Yes (TestPollEvents_GetAllEventsSinceError) |
| Beads store nil | Medium | Event poll disabled, stranded scan continues | Yes |
Close event with empty issue_id | Low | Skipped | No |
CheckConvoysForIssue panics | Low | Daemon process crash -> restart | No |
| Failure | Likelihood | Recovery | Tested? |
|---|---|---|---|
gt convoy stranded error | Low | Logged, skip cycle | Yes (TestFindStranded_GtFailure_ReturnsError) |
Invalid JSON from gt | Low | Logged, skip cycle | Yes (TestFindStranded_InvalidJSON_ReturnsError) |
gt sling dispatch fails | Medium | Logged, continue to next convoy | Yes |
gt convoy check fails | Low | Logged, continue to next convoy | No |
| Unknown prefix for issue | Low | Logged, skip issue | Yes (TestFeedFirstReady_UnknownPrefix_Skips) |
| Unknown rig for prefix | Low | Logged, skip issue | Yes (TestFeedFirstReady_UnknownRig_Skips) |
gt subprocess hangs | Low | Context cancellation kills process group | Yes (TestConvoyManager_ShutdownKillsHangingSubprocess) |
| Failure | Likelihood | Recovery | Tested? |
|---|---|---|---|
Stop() before Start() | Low | wg.Wait() returns immediately | No |
Double Stop() | Low | Idempotent | Yes |
Double Start() | Low | Guarded (atomic.Bool, no-op) | Yes (TestStart_DoubleCall_Guarded) |
| Subprocess blocks shutdown | Low | Context cancellation kills process group | Yes (TestConvoyManager_ShutdownKillsHangingSubprocess) |
| File | Contents |
|---|---|
internal/daemon/convoy_manager.go | ConvoyManager: event poll + stranded scan goroutines |
internal/convoy/operations.go | Shared CheckConvoysForIssue, feedNextReadyIssue, getTrackingConvoys, IsSlingableType, isIssueBlocked |
internal/beads/routes.go | ExtractPrefix, GetRigNameForPrefix (prefix -> rig resolution) |
internal/beads/fields.go | MRFields.ConvoyID, MRFields.ConvoyCreatedAt (convoy tracking in MR beads) |
| File | How it uses convoy |
|---|---|
internal/daemon/daemon.go | Opens multi-rig beads stores, creates ConvoyManager in Run(), stops in shutdown() |
internal/witness/handlers.go | Convoy observer removed (S-04 REMOVED) |
internal/refinery/engineer.go | Convoy observer removed (S-05 REMOVED) |
internal/cmd/convoy.go | CLI: gt convoy create/status/list/add/check/stranded/close/land |
internal/cmd/sling_convoy.go | Auto-convoy creation during gt sling |
internal/cmd/formula.go | executeConvoyFormula for convoy-type formulas |
| File | What it tests |
|---|---|
internal/daemon/convoy_manager_test.go | ConvoyManager unit tests (22 tests) |
internal/daemon/convoy_manager_integration_test.go | ConvoyManager integration tests (2 tests, //go:build integration) |
internal/convoy/store_test.go | Observer store helpers (3 tests) |
internal/convoy/operations_test.go | Operations function edge cases + safety guard tests |
internal/daemon/daemon_test.go | Daemon-level manager lifecycle (2 convoy tests) |
| File | Contents |
|---|---|
docs/design/convoy/convoy-lifecycle.md | Problem statement, design principles, flow diagram |
docs/design/convoy/spec.md | This document (includes test harness scorecard and remaining gaps) |
docs/design/daemon/convoy-manager.md | ConvoyManager architecture diagram (SDK polling + stranded scan) |
| Finding | Story |
|---|---|
| Stream-based convoy-manager doc was stale | S-15 |
| Testing doc had stale stream/backoff and duplicate gap entries | S-15 |
| Lifecycle observer/manual-close claims were stale | S-15 |
| Spec file-map command/test counts drifted | S-15 |
| DONE stories needed explicit corrective handling | S-16 |
| Refinery observer root-path ambiguity remains | S-17 (verified) |
| Refinery root-path fix required | S-18 |
These are documented in convoy-lifecycle.md as future work but are not in scope for this spec:
due_at field, overdue surfacing) (P3 in lifecycle doc)| Dimension | Score (1-5) | Key Gap |
|---|---|---|
| Fixtures & Setup | 4 | mockGtForScanTest shared builder covers scan tests; processLine path has own setup |
| Isolation | 4 | Temp dirs + t.Setenv(PATH) is solid; Windows correctly skipped; no shared state |
| Observability | 4 | All mock scripts emit call logs; negative tests assert log files absent/empty |
| Speed | 4 | All convoy-manager tests run quickly; no long-running interval waits in current suite |
| Determinism | 4 | No real timing dependencies; ticker tests use long intervals to avoid races |
Problem: ConvoyManager uses time.Ticker with 30s default. Testing "runs at interval" requires waiting or injecting a clock.
Proposal: Add clock field to ConvoyManager (interface with NewTicker(d)) defaulting to real time. Tests inject fake clock with immediate tick.
Compound Value: All periodic daemon components benefit.
Status: Not implemented. Tests use long intervals (10min) to prevent ticker firing during test.
TestProcessLine_EmptyIssueID (close event with empty issue_id)