PR #752 Chaos Testing Review

PR: https://github.com/gastownhall/beads/pull/752 Author: jordanhubbard Bead: bd-kx1j Status: Under Review

Summary

Jordan proposes adding chaos testing and E2E test coverage to beads. The PR:

Adds 4849 lines, removes 511 lines
Introduces chaos testing framework (random corruption, disk space exhaustion, NFS-like failures)
Creates side databases for testing recovery scenarios
Adds E2E tests tracking documented user scenarios
Brings code coverage to ~48%

Key Question from Jordan

"Is this level of testing something you actually want with the current pace of progress? It comes with an implied obligation to update and add to the tests as well as follow the CICD feedback in github (very spammy if your tests don't pass!)"

Files Changed (Major Categories)

Chaos/Doctor Infrastructure

cmd/bd/doctor_repair_chaos_test.go (378 lines) - Core chaos testing
cmd/bd/doctor/fix/database_integrity.go (116 lines) - DB integrity fixes
cmd/bd/doctor/fix/jsonl_integrity.go (87 lines) - JSONL integrity fixes
cmd/bd/doctor/fix/fs.go (57 lines) - Filesystem fault injection
cmd/bd/doctor/fix/sqlite_open.go (52 lines) - SQLite open handling
cmd/bd/doctor/jsonl_integrity.go (123 lines) - JSONL checks
cmd/bd/doctor/git.go (168 additions) - Git hygiene checks

Test Coverage Additions

internal/storage/memory/memory_more_coverage_test.go (921 lines) - Memory storage tests
cmd/bd/cli_coverage_show_test.go (426 lines) - CLI show command tests
cmd/bd/daemon_autostart_unit_test.go (331 lines) - Server autostart tests
internal/rpc/client_gate_shutdown_test.go (107 lines) - RPC client tests
Various other test files

Bug Fixes Discovered During Testing

internal/storage/sqlite/migrations/021_migrate_edge_fields.go - Major migration fix
internal/storage/sqlite/migrations/022_drop_edge_columns.go - Column cleanup
internal/storage/sqlite/migrations_template_pinned_regression_test.go - Regression test

Tradeoffs

Costs

Maintenance burden: Must keep coverage above 48% (or whatever threshold is set)
CI noise: Failed tests = spam until fixed
Velocity tax: Every change needs test updates
Complexity: Chaos testing framework itself needs maintenance

Benefits

Robustness validation: Proves beads can recover from corruption
Bug discovery: Already found migration bugs (021, 022)
Confidence: If chaos tests pass, beads is more robust than feared
Documentation: E2E tests document expected user scenarios
Regression prevention: Future changes caught before release

Initial Assessment

Implementation Quality: HIGH

The chaos testing code is well-structured. Key observations:

What the Chaos Tests Actually Cover

From doctor_repair_chaos_test.go:

Complete DB corruption - Writes "not a database" garbage, verifies recovery from JSONL
Truncated DB without JSONL - Tests graceful failure when no recovery source exists
Sidecar file backup - Ensures -wal, -shm, -journal files are preserved during repair
Repair with running server - Tests recovery while server holds locks
JSONL integrity - Malformed lines, re-export from DB

Each test:

Uses isolated temp directories
Builds a fresh bd binary for testing
Uses "side databases" (separate from real data)
Has proper cleanup

Bug Fixes Already Discovered

The PR includes fixes for bugs found during testing:

Migration 021/022: pinned and is_template columns were being clobbered
Regression test added to prevent recurrence

Test Coverage Structure

Tests are organized by build tags:

//go:build chaos - Chaos/corruption tests (run separately)
//go:build e2e - End-to-end CLI tests
Regular unit tests - No build tag required

This means chaos tests only run when explicitly requested, not on every go test.

Deep Analysis (Ultrathink)

The Core Question

Is the testing worth the ongoing maintenance cost?

Argument FOR Merging

Beads is more robust than feared. If Jordan got these tests passing, it means:
- bd doctor actually recovers from corruption
- JSONL/DB sync is working correctly
- Migration edge cases are handled
This validates the core design: SQLite + JSONL + git backstop.
Bugs already found. The migration 021/022 bugs are exactly the kind of subtle issues that would cause data loss in production. Finding them now is worth something.

Build tag isolation. Chaos tests won't slow down regular development:

bash

go test ./...                    # Normal tests only
go test -tags=chaos ./...        # Include chaos tests
go test -tags=e2e ./...          # Include E2E tests

48% coverage is a floor, not a target. The PR doesn't enforce maintaining 48%. Jordan is asking: "Is this level worth it?" We can always add more later, or let coverage drift if priorities change.
Documentation value. E2E tests document expected user scenarios. When an AI agent asks "what should happen when X?", the tests provide executable answers.

Argument AGAINST Merging

Velocity tax is real. Every behavior change needs test updates. This is especially painful during rapid iteration phases.
CI noise. Failed tests block merges. With multiple agents working, flaky tests become coordination bottlenecks.
Framework maintenance. The chaos testing framework itself (side databases, build tags, test helpers) becomes another thing to maintain.
False confidence. Tests passing doesn't mean beads is production-ready. It means tested scenarios work. Edge cases not covered still fail silently.

The Real Question: What Phase Are We In?

If beads is still in "rapid prototype" phase: The testing overhead is premature. Focus on features, fix crashes as they happen, lean on git backstop.

If beads is approaching "reliable tool" phase: Testing is essential. Multi-agent workflows amplify bugs. Corruption during a 10-agent batch is expensive.

Current reality: Beads is being dogfooded seriously. Multiple agents, real work, real data loss when things break. We're closer to "reliable tool" than "prototype."

ROI Calculation

Cost of NOT testing: When corruption happens:

Agent loses context (30-60 min recovery)
Human has to debug (variable, often 15-60 min)
Trust erosion (hard to quantify)

Cost of testing:

Review this PR (1-2 hours, one time)
Update tests when behavior changes (5-15 min per change)
Fix flaky tests when they appear (variable)

If corruption happens once a month, testing ROI is marginal. If corruption happens weekly (or with each new feature), testing pays for itself.

Recommendation

MERGE WITH MODIFICATIONS

Why Merge

The implementation quality is high
Bugs already found justify the effort
Build tag isolation minimizes velocity impact
Beads is past the prototype phase

Suggested Modifications

No hard coverage threshold in CI. Let coverage drift naturally. The value is in the chaos tests catching corruption, not in hitting a percentage.
Chaos tests optional in CI. Run chaos tests on release branches, not every PR. This reduces CI noise during active development.
Clear ownership. Jordan should document how to add new chaos scenarios. Future contributors need to know when to add vs skip tests.

Decision Framework for User

If you answer YES to 2+ of these, merge:

Are you dogfooding beads for real work?
Has corruption caused you to lose time in the last month?
Do you expect multiple agents using beads concurrently?
Is beads approaching a "v1.0" milestone?

If you answer NO to all, defer the PR until beads stabilizes.