Testing Philosophy

This document covers what to test and what not to test. For how to run tests, see TESTING.md.

The Test Pyramid

                  ┌─────────────────┐
                  │   E2E Tests     │  ← PR/Deploy only (slow, expensive)
                  │   ~5% of tests  │
                  └────────┬────────┘
                           │
            ┌──────────────┴──────────────┐
            │     Integration Tests       │  ← PR gate (moderate)
            │       ~15% of tests         │
            └──────────────┬──────────────┘
                           │
  ┌────────────────────────┴────────────────────────┐
  │              Unit Tests (Fast)                  │  ← Every save/commit
  │                 ~80% of tests                   │
  └─────────────────────────────────────────────────┘

Tier 1: Fast Tests (< 5 seconds total)

When: Every file save, pre-commit hooks, continuous during development

Pure function tests (no I/O)
In-memory data structure tests
Business logic validation
Mock all external dependencies

In beads: Core logic tests using newTestStore() with in-memory SQLite

Tier 2: Integration Tests (< 30 seconds)

When: Pre-push, PR checks

Real file system operations
Git operations with temp repos
Config file parsing
CLI argument handling

In beads: Tests tagged with //go:build integration, server mode tests

Tier 3: E2E / Smoke Tests (1-5 minutes)

When: PR merge, pre-deploy, nightly

Full bd init → bd doctor → bd doctor --fix workflow
Real API calls (to staging)
Cross-platform verification

What Makes a Test "Right"

A good test:

Catches a bug you'd actually ship - not theoretical edge cases
Documents expected behavior - serves as living documentation
Runs fast enough to not skip - slow tests get disabled
Isn't duplicated elsewhere - tests one thing, one way

What to Test (Priority Matrix)

Priority	What	Why	Examples in beads
High	Core business logic	This is what users depend on	`sync`, `doctor`, `export`, `import`
High	Error paths that could corrupt data	Data loss is catastrophic	Config handling, git operations, database integrity
Medium	Edge cases from production bugs	Discovered through real issues	Orphan handling, ID collision detection
Low	Display/formatting	Visual output, can be manually verified	Table formatting, color output

What NOT to Test Extensively

Simple utility functions

Trust the language. Don't test that strings.TrimSpace works.

Every permutation of inputs

Use table-driven tests with representative cases instead of exhaustive permutations.

// BAD: 10 separate test functions
func TestPriority0(t *testing.T) { ... }
func TestPriority1(t *testing.T) { ... }
func TestPriority2(t *testing.T) { ... }

// GOOD: One table-driven test
func TestPriorityMapping(t *testing.T) {
    cases := []struct{ in, want int }{
        {0, 4}, {1, 0}, {5, 3}, // includes boundary
    }
    for _, tc := range cases {
        t.Run(fmt.Sprintf("priority_%d", tc.in), func(t *testing.T) {
            got := mapPriority(tc.in)
            if got != tc.want { t.Errorf(...) }
        })
    }
}

Obvious behavior

Don't test "if file exists, return true" - trust the implementation.

Same logic through different entry points

If you test a function directly, don't also test it through every caller.

Anti-Patterns to Avoid

1. Trivial Assertions

Testing obvious happy paths that would pass with trivial implementations.

// BAD: What bug would this catch?
func TestValidateBeadsWorkspace(t *testing.T) {
    dir := setupTestWorkspace(t)
    if err := validateBeadsWorkspace(dir); err != nil {
        t.Errorf("expected no error, got: %v", err)
    }
}

// GOOD: Test the interesting error cases
func TestValidateBeadsWorkspace(t *testing.T) {
    cases := []struct{
        name    string
        setup   func(t *testing.T) string
        wantErr string
    }{
        {"missing .beads dir", setupNoBeadsDir, "not a beads workspace"},
        {"corrupted db", setupCorruptDB, "database is corrupted"},
        {"permission denied", setupNoReadAccess, "permission denied"},
    }
    // ...
}

2. Duplicate Error Path Testing

Testing the same logic multiple ways instead of once with table-driven tests.

// BAD: Repetitive individual assertions
if config.PriorityMap["0"] != 4 { t.Errorf(...) }
if config.PriorityMap["1"] != 0 { t.Errorf(...) }
if config.PriorityMap["2"] != 1 { t.Errorf(...) }

// GOOD: Table-driven
for k, want := range expectedMap {
    if got := config.PriorityMap[k]; got != want {
        t.Errorf("PriorityMap[%q] = %d, want %d", k, got, want)
    }
}

3. I/O Heavy Tests Without Mocking

Unit tests that execute real commands or heavy I/O when they could mock.

// BAD: Actually executes external commands in unit test
func TestServerFix(t *testing.T) {
    exec.Command("bd", "dolt", "stop").Run()
    // ...
}

// GOOD: Mock the execution or use integration test tag
func TestServerFix(t *testing.T) {
    executor := &mockExecutor{}
    fix := NewServerFix(executor)
    // ...
}

4. Testing Implementation, Not Behavior

Tests that break when you refactor, even though behavior is unchanged.

// BAD: Tests internal state
if len(server.connectionPool) != 3 { t.Error(...) }

// GOOD: Tests observable behavior
if resp, err := server.HandleRequest(req); err != nil { t.Error(...) }

5. Missing Boundary Tests

Testing known-good values but not boundaries and invalid inputs.

// BAD: Only tests middle values
TestPriority(1)  // works
TestPriority(2)  // works

// GOOD: Tests boundaries and invalid
TestPriority(-1) // invalid - expect error
TestPriority(0)  // boundary - min valid
TestPriority(4)  // boundary - max valid
TestPriority(5)  // boundary - first invalid

Target Metrics

Metric	Target	Current (beads)	Status
Test-to-code ratio	0.5:1 - 1.5:1	0.85:1	Healthy
Fast test suite	< 5 seconds	3.8 seconds	Good
Integration tests	< 30 seconds	~15 seconds	Good
Compilation overhead	Minimize	180 seconds	Bottleneck

Interpretation

0.5:1 - Light coverage, fast iteration (acceptable for utilities)
1:1 - Solid coverage for most projects (our target)
1.5:1 - Heavy coverage for critical systems
2:1+ - Over-engineered, maintenance burden

Beads-Specific Guidance

Well-Covered (Maintain)

Area	Why It's Well-Tested
Sync/Export/Import	Data integrity critical - comprehensive edge cases
SQLite transactions	Rollback safety, atomicity guarantees
Merge operations	Dolt-native cell-level merge
Database locking	Prevents corruption from multiple instances

Needs Attention

Area	Gap	Priority
Server lifecycle	Shutdown/signal handling	Medium
Concurrent operations	Stress testing under load	Medium
Boundary validation	Edge inputs in mapping functions	Low

Skip These

Display formatting tests (manually verify)
Simple getters/setters
Tests that duplicate SQLite's guarantees

TESTING.md - How to run tests
README_TESTING.md - Fast vs integration test strategy
dev-notes/TEST_SUITE_AUDIT.md - Test refactoring progress

Testing Philosophy

Testing Philosophy

The Test Pyramid

Tier 1: Fast Tests (< 5 seconds total)

Tier 2: Integration Tests (< 30 seconds)

Tier 3: E2E / Smoke Tests (1-5 minutes)

What Makes a Test "Right"

What to Test (Priority Matrix)

What NOT to Test Extensively

Simple utility functions

Every permutation of inputs

Obvious behavior

Same logic through different entry points

Anti-Patterns to Avoid

1. Trivial Assertions

2. Duplicate Error Path Testing

3. I/O Heavy Tests Without Mocking

4. Testing Implementation, Not Behavior

5. Missing Boundary Tests

Target Metrics

Interpretation

Beads-Specific Guidance

Well-Covered (Maintain)

Needs Attention

Skip These

Related Docs