Back to Ruflo

ADR-164.1 — BbsRoomBudgetTracker: Atomic Reserve-and-Commit Design

v3/docs/adr/ADR-164.1-budget-tracker-atomicity.md

3.16.238.0 KB
Original Source

ADR-164.1 — BbsRoomBudgetTracker: Atomic Reserve-and-Commit Design

ID: ADR-164.1 Status: Draft Date: 2026-06-29 Authors: claude (drafted with rUv) Companion to: ADR-164 (AgentBBS Federated Business-Management Autopilot) Related:

  • ADR-097 (federation budget circuit breaker — per-call primitive this ADR layers over)
  • ADR-145 (plugin supply-chain integrity and memory namespace governance — audit log foundation)
  • better-sqlite3 documentation (WAL mode + BEGIN IMMEDIATE semantics)

1. Problem Statement

1.1 The race window

ADR-164 §5.1.5 introduces BbsRoomBudgetTracker with the following interface:

typescript
getRemainingUsd(roomId: string): Promise<number>
recordSpend(roomId: string, usdSpent: number): Promise<void>

federation_bbs_publish calls getRemainingUsd and, if the result is greater than zero, proceeds to call sendMessage (which incurs cost), then calls recordSpend. This is a classic read-then-write pattern. Two concurrent callers can both read a remaining balance above zero, both pass the gate, and collectively spend more than the monthly cap.

1.2 Concrete example

Room #sales has a monthly cap of $1.00. Committed spend so far this month: $0.95. Remaining: $0.05.

Ten concurrent calls to federation_bbs_publish each estimate a cost of $0.05:

T0  caller-1  getRemainingUsd("sales")  → 0.05  (passes gate)
T0  caller-2  getRemainingUsd("sales")  → 0.05  (passes gate — reads same committed total)
T0  caller-3  getRemainingUsd("sales")  → 0.05  (passes gate)
  ...
T0  caller-10 getRemainingUsd("sales")  → 0.05  (passes gate)

T1  all 10 send messages, each spending $0.05
T2  all 10 call recordSpend("sales", 0.05)
T3  committed total = $0.95 + $0.50 = $1.45  ← $0.45 over cap

In autopilot mode with a Finance pod running overnight, this race can accumulate into a multi-hundred-dollar surprise bill before the morning review window.

1.3 Why the existing enforceBudget doesn't solve this

enforceBudget in v3/@claude-flow/plugin-agent-federation/src/domain/value-objects/federation-budget.ts is a correct, synchronous, per-call primitive. It runs check-then-decrement with no awaits inside a single call. What it does NOT do is maintain a persistent monthly ledger. It is per-envelope, not per-room-per-month. The BbsRoomBudgetTracker fills that gap, but its naive implementation creates the race described above.


2. Decision

Use an atomic reserve-and-commit token bucket backed by better-sqlite3 in WAL mode.

SQLite's write-side serialization — specifically BEGIN IMMEDIATE, which acquires the write lock before any read — eliminates the race window. No compare-and-swap loops. No advisory locks. No file-system flocks. The SQLite write lock IS the serialization primitive.

The design has three operations:

  1. reserve() — Before calling sendMessage, a caller atomically checks budget availability AND inserts a reservation row in a single BEGIN IMMEDIATE transaction. If budget would be exceeded, the transaction rolls back with BUDGET_EXCEEDED.
  2. commit() — After sendMessage returns the actual cost, the caller commits the reservation with the real amount.
  3. release() — If the caller decides not to proceed (or if the call fails before incurring cost), the reservation is released, returning budget to the pool immediately.

This three-phase contract replaces the two-phase read-then-write in the original design.


3. Data Model

3.1 Schema

The tracker maintains two tables in a dedicated SQLite file at data/bbs-budget.db (configurable via CLAUDE_FLOW_BBS_BUDGET_DB). The file is distinct from the main AgentDB store to avoid write contention on the higher-volume pattern/memory tables.

sql
-- WAL mode MUST be set once on database open, before any transactions.
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;   -- safe with WAL; fsync only on checkpoint

-- One row per registered BBS room. The _lock_bump column exists solely as
-- a write target for BEGIN IMMEDIATE serialization (see Section 5).
CREATE TABLE IF NOT EXISTS bbs_budget_rooms (
  room_id         TEXT NOT NULL PRIMARY KEY,
  monthly_cap_usd REAL NOT NULL CHECK (monthly_cap_usd >= 0),
  billing_month   TEXT NOT NULL,   -- 'YYYY-MM', resets on month boundary
  _lock_bump      INTEGER NOT NULL DEFAULT 0
);

-- One row per reservation lifecycle.
CREATE TABLE IF NOT EXISTS bbs_budget_reservations (
  reservation_id    TEXT NOT NULL PRIMARY KEY,  -- UUID v7 (sortable timestamp prefix)
  room_id           TEXT NOT NULL REFERENCES bbs_budget_rooms(room_id),
  caller_node_id    TEXT NOT NULL,              -- federation node-id of the spending agent
  estimated_usd     REAL NOT NULL CHECK (estimated_usd >= 0),
  actual_usd        REAL,                       -- NULL until commit
  state             TEXT NOT NULL
                    CHECK (state IN ('reserved','committed','released','expired','committed_post_expiry')),
  reserved_at       INTEGER NOT NULL,           -- unix milliseconds
  expires_at        INTEGER NOT NULL,           -- reserved_at + 60_000 by default
  committed_at      INTEGER,                    -- unix ms, NULL until commit
  audit_envelope_id TEXT NOT NULL               -- FK to federation_spend audit entry
);

-- Covering index for the monthly window query in reserve().
CREATE INDEX IF NOT EXISTS idx_reservations_room_month
  ON bbs_budget_reservations (room_id, reserved_at);

-- Index for the expiry sweeper.
CREATE INDEX IF NOT EXISTS idx_reservations_expiry
  ON bbs_budget_reservations (state, expires_at)
  WHERE state = 'reserved';

3.2 Field notes

reservation_id — UUID v7: UUID v7 encodes a millisecond-precision timestamp prefix in the first 48 bits, making rows naturally sorted by creation time without a separate ORDER BY reserved_at clause on most queries. This is a purely aesthetic win; correctness does not depend on UUID v7 specifically. Any opaque unique string suffices.

_lock_bump: This column exists as a belt-and-suspenders mechanism alongside BEGIN IMMEDIATE. Peer-review clarification (2026-06-29): serialization is primarily provided by BEGIN IMMEDIATE itself — which natively acquires a RESERVED lock on the entire SQLite database file the moment it executes, causing any concurrent BEGIN IMMEDIATE or write to block or throw SQLITE_BUSY. The _lock_bump UPDATE is an explicit write inside the transaction making the lock acquisition visible in code-review and any query-log analysis; it also defends against a future SQLite version changing BEGIN IMMEDIATE semantics. The increment value is never read by application logic — it's a write target by intent.

expires_at: Default is reserved_at + 60_000 (60 seconds). The default is appropriate for standard cloud LLM calls (sub-30s typical). For pods that run local, large, quantized models (e.g. Finance pod doing multi-step reconciliation), a single task can legitimately take 2-3 minutes — the 60s default would cause every commit to land post-expiry and trigger COMMIT_AFTER_EXPIRY warnings (mathematically correct, but operationally noisy). To accommodate, the pod template loader (per ADR-164 §3.3) MUST honour an optional reservationExpiryMs field in the pod's JSON template, clamped to [5_000, 300_000] (5 sec floor to defang trivially-tiny windows; 5 min ceiling to defang unbounded-runaway). When unset, CLAUDE_FLOW_BBS_RESERVATION_EXPIRY_MS is used; when that is unset, the 60_000 default applies. The hard ceiling means even a misconfigured local pod cannot exceed 5 minutes per reservation, bounding the worst-case post-expiry-commit window.

audit_envelope_id: Every reservation must be paired with a federation_spend audit entry before insertion. The audit write and the reservation insert share the same outer transaction (see Section 6), so they are atomically linked. A JOIN on bbs_budget_reservations.audit_envelope_id reconstructs full spend history without a separate query.


4. API

4.1 TypeScript signatures

File: plugins/ruflo-bbs-federation/src/bbs-room-budget-tracker.ts

typescript
/**
 * Atomically reserve budget before calling sendMessage.
 *
 * If successful, returns a reservationId the caller MUST pass to commit()
 * or release() when the downstream call finishes.
 *
 * On BUDGET_EXCEEDED the reservation is not created; the caller must abort
 * the send and return { blocked: true, reason: 'MONTHLY_BUDGET_EXCEEDED' }
 * to federation_bbs_publish.
 */
reserve(
  roomId: string,
  callerId: string,
  estimatedUsd: number
): Promise<
  | { ok: true; reservationId: string; remainingAfterReserve: number }
  | { ok: false; error: 'BUDGET_EXCEEDED' | 'ROOM_NOT_FOUND' }
>;

/**
 * Commit the reservation with the actual cost returned by sendMessage.
 *
 * actualUsd may be greater than estimatedUsd (overrun); see Section 8.3.
 * The reservation transitions from 'reserved' to 'committed'. Once
 * committed, the row is permanent — it cannot be released.
 */
commit(
  reservationId: string,
  actualUsd: number
): Promise<
  // ok-and-clean: reservation was live when commit landed
  | { ok: true; committed: true; finalRemaining: number }
  // ok-but-late: reservation had already expired when commit landed, but
  // the actual API spend was REAL and is now recorded against the budget.
  // Callers / audit dashboards should highlight this. See §5.3 Expired Commit
  // Leak fix (peer review 2026-06-29).
  | { ok: true; warned: 'COMMIT_AFTER_EXPIRY'; finalRemaining: number }
  | { ok: false; error: 'NOT_FOUND' | 'ALREADY_FINALIZED' }
>;

/**
 * Release the reservation if the send was not attempted or failed before
 * incurring cost. Transitions 'reserved' → 'released'. Released reservations
 * free their estimated_usd immediately (they are excluded from the window
 * query on the next reserve() call).
 *
 * Calling release() on a committed or already-released reservation is an
 * error; it indicates a caller bug.
 */
release(
  reservationId: string
): Promise<
  | { ok: true; released: true }
  | { ok: false; error: 'NOT_FOUND' | 'ALREADY_FINALIZED' }
>;

5. The Reserve Transaction (Load-Bearing Detail)

5.1 Why BEGIN IMMEDIATE is mandatory

SQLite has three transaction modes:

ModeWrite-lock acquiredRisk
BEGIN (deferred)On first writeTwo readers both pass the gate before either writes
BEGIN IMMEDIATEImmediately on BEGINOnly one caller holds the write lock; all others block
BEGIN EXCLUSIVEFull database lockCorrect but blocks all readers; overkill here

The race in Section 1.2 exists precisely because deferred mode allows two readers to execute the "check committed + reserved total" SELECT before either performs the INSERT. BEGIN IMMEDIATE closes this window: the write lock is acquired atomically with the transaction start, so the second caller blocks at its own BEGIN IMMEDIATE until the first commits or rolls back.

5.2 Pseudocode for reserve()

function reserve(roomId, callerId, estimatedUsd):

  auditEnvelopeId = generateAuditEntry(roomId, callerId, estimatedUsd)
  // ^-- audit write runs BEFORE the budget transaction starts.
  //     If the audit write fails, we throw immediately: no budget state changes.
  //     (Audit + budget must be atomic; see Section 6.)

  BEGIN IMMEDIATE;

    -- Step 1: Touch the room header row.
    -- This is the serialization primitive. SQLite grants the write lock
    -- to exactly one connection at a time. All other BEGIN IMMEDIATE
    -- callers block here until we commit or rollback.
    UPDATE bbs_budget_rooms
      SET _lock_bump = _lock_bump + 1
      WHERE room_id = roomId;

    -- Step 1b: If no rows updated, the room doesn't exist.
    IF rows_changed == 0 THEN
      ROLLBACK;
      RETURN { ok: false, error: 'ROOM_NOT_FOUND' };
    END IF;

    -- Step 2: Fetch room's monthly cap and verify billing month.
    SELECT monthly_cap_usd, billing_month
      FROM bbs_budget_rooms
      WHERE room_id = roomId;
    -- If billing_month != current YYYY-MM, reset (see Section 7.3).

    -- Step 3: Compute current financial position within this month.
    --   committed_total = sum of actual_usd for committed rows this month
    --   reserved_total  = sum of estimated_usd for live reserved rows this month
    --   "live" = state='reserved' AND expires_at > now()
    --   "this month" = reserved_at >= start_of_billing_month (unix ms)
    SELECT
      COALESCE(SUM(CASE WHEN state = 'committed' THEN actual_usd ELSE 0 END), 0)
        AS committed_total,
      COALESCE(SUM(CASE WHEN state = 'reserved' AND expires_at > now_ms
                        THEN estimated_usd ELSE 0 END), 0)
        AS reserved_total
    FROM bbs_budget_reservations
    WHERE room_id = roomId
      AND reserved_at >= billing_month_start_ms;

    -- Step 4: Gate check.
    projected = committed_total + reserved_total + estimatedUsd;
    IF projected > monthly_cap_usd THEN
      ROLLBACK;
      RETURN { ok: false, error: 'BUDGET_EXCEEDED' };
    END IF;

    -- Step 5: Insert the reservation.
    now_ms   = Date.now();
    expiry   = now_ms + RESERVATION_EXPIRY_MS;  // default 60_000
    reservId = uuidv7();

    INSERT INTO bbs_budget_reservations
      (reservation_id, room_id, caller_node_id, estimated_usd,
       actual_usd, state, reserved_at, expires_at, committed_at,
       audit_envelope_id)
    VALUES
      (reservId, roomId, callerId, estimatedUsd,
       NULL, 'reserved', now_ms, expiry, NULL,
       auditEnvelopeId);

  COMMIT;

  remaining = monthly_cap_usd - (committed_total + reserved_total + estimatedUsd);
  RETURN { ok: true, reservationId: reservId, remainingAfterReserve: remaining };

5.3 Pseudocode for commit()

function commit(reservationId, actualUsd):

  BEGIN IMMEDIATE;

    SELECT state, room_id, estimated_usd, reserved_at, expires_at
      FROM bbs_budget_reservations
      WHERE reservation_id = reservationId;

    IF no row THEN
      ROLLBACK; RETURN { ok: false, error: 'NOT_FOUND' };
    END IF;

    IF state IN ('committed', 'released', 'expired') THEN
      ROLLBACK; RETURN { ok: false, error: 'ALREADY_FINALIZED' };
    END IF;

    IF expires_at <= Date.now() THEN
      -- Reservation expired before the caller committed.
      -- *** PEER-REVIEW FIX (2026-06-29): never let a real API spend escape the ledger. ***
      -- Previously: marked 'expired' and returned ok:false. This created the
      -- "Expired Commit Leak" — the API was actually called (spend incurred)
      -- but the budget tracker recorded nothing, so the monthly cap could be
      -- silently bypassed by any slow-LLM loop.
      -- Now: accept the spend, transition state to 'committed_post_expiry',
      -- record actual_usd, AND emit an alert envelope to the exec room.
      UPDATE bbs_budget_reservations
        SET state = 'committed_post_expiry',
            actual_usd = actualUsd,
            committed_at = Date.now()
        WHERE reservation_id = reservationId;
      -- The spend-reporter MUST emit a `reservation.committed_post_expiry`
      -- audit event in the same transaction (see §6.1 atomicity requirement).
      -- The federation-breaker-service treats this event as a budget consumer
      -- — cumulative spend still includes it for circuit-breaker math.
      COMMIT;
      RETURN { ok: true, warned: 'COMMIT_AFTER_EXPIRY', finalRemaining: <recomputed> };
    END IF;

    -- On overrun (actualUsd > estimatedUsd), we charge the room the
    -- actual amount regardless. See Section 8.3.
    UPDATE bbs_budget_reservations
      SET state = 'committed',
          actual_usd = actualUsd,
          committed_at = Date.now()
      WHERE reservation_id = reservationId;

    -- Re-compute remaining after commit (for the return value).
    SELECT monthly_cap_usd FROM bbs_budget_rooms WHERE room_id = room_id;
    SELECT COALESCE(SUM(actual_usd), 0) AS committed_total
      FROM bbs_budget_reservations
      WHERE room_id = room_id AND state = 'committed'
        AND reserved_at >= billing_month_start_ms;
    SELECT COALESCE(SUM(estimated_usd), 0) AS live_reserved_total
      FROM bbs_budget_reservations
      WHERE room_id = room_id AND state = 'reserved'
        AND expires_at > Date.now()
        AND reserved_at >= billing_month_start_ms;

  COMMIT;

  finalRemaining = monthly_cap_usd - (committed_total + live_reserved_total);
  RETURN { ok: true, committed: true, finalRemaining };

5.4 Pseudocode for release()

function release(reservationId):

  BEGIN IMMEDIATE;

    SELECT state FROM bbs_budget_reservations
      WHERE reservation_id = reservationId;

    IF no row THEN
      ROLLBACK; RETURN { ok: false, error: 'NOT_FOUND' };
    END IF;

    IF state IN ('committed', 'released', 'expired') THEN
      ROLLBACK; RETURN { ok: false, error: 'ALREADY_FINALIZED' };
    END IF;

    UPDATE bbs_budget_reservations
      SET state = 'released'
      WHERE reservation_id = reservationId;

  COMMIT;

  RETURN { ok: true, released: true };

6. Audit Log Integration

6.1 Atomicity requirement

Budget state and audit state must be kept in sync. A reservation that has no corresponding audit entry creates an unexplained gap in the spend history. An audit entry with no reservation row means the budget was potentially spent without being gated.

The sequencing is:

  1. Write audit entry first (via spend-reporter.ts → ProductionFederationSpendReporter). This targets the federation-spend namespace, key fed-spend-<roomId>-<ts>, as specified by DEFAULT_FEDERATION_SPEND_NAMESPACE in v3/@claude-flow/plugin-agent-federation/src/application/spend-reporter.ts.
  2. Capture the returned key as audit_envelope_id.
  3. Open BEGIN IMMEDIATE and insert the reservation row referencing audit_envelope_id.

If step 1 fails (memory backend error), throw immediately — no reservation is created. If step 3 fails (budget exceeded, database locked, etc.), the audit entry is orphaned but the budget is NOT charged. Orphaned audit entries with no matching reservation_id are a diagnostic signal, not a correctness violation; they will be caught by the audit reconciliation job (out of scope for Phase 1-3).

6.2 Overrun entries

When commit() is called with actualUsd > estimatedUsd, a second audit event is written before the commit transaction:

typescript
// emit a reservation_overrun audit entry via spend-reporter.ts
await spendReporter.reportSpend({
  peerId: roomId,
  taskId: reservationId,
  tokensUsed: 0,
  usdSpent: actualUsd - estimatedUsd,
  ts: new Date().toISOString(),
  success: true,
  // custom field consumed by ruflo-cost-tracker:
  eventKind: 'reservation_overrun',
});

This allows the FederationBreakerService (in v3/@claude-flow/plugin-agent-federation/src/application/federation-breaker-service.ts) to detect spend anomalies via its rolling sample buffer. If cumulative spend for the room crosses the 24h cap threshold (dailySpendCapUsd, default $5.00), the breaker can trip the room. See ADR-097 §Phase 2 for the breaker trip logic.

6.3 Cross-reference: ADR-097 §audit-log

The federation_spend events emitted here are consumed by [email protected] (ADR-097 Phase 3). The audit_envelope_id on each reservation row provides a direct JOIN path:

sql
SELECT r.reservation_id, r.state, r.actual_usd, a.ts, a.peerId
FROM bbs_budget_reservations r
JOIN federation_spend_events a
  ON a.key = 'fed-spend-' || r.room_id || '-' || r.audit_envelope_id
WHERE r.room_id = 'finance'
ORDER BY r.reserved_at DESC
LIMIT 100;

Full month history for a room is reconstructable from this JOIN alone.


7. Expiry Handling

7.1 The sweeper

A background sweeper runs every 5 seconds (configurable via CLAUDE_FLOW_BBS_SWEEP_INTERVAL_MS). It transitions expired reservations from reserved to expired:

typescript
// Runs on setInterval, not BEGIN IMMEDIATE — this is intentionally deferred.
async function sweepExpiredReservations(db: Database): Promise<number> {
  const now = Date.now();
  const result = db
    .prepare(`
      UPDATE bbs_budget_reservations
        SET state = 'expired'
        WHERE state = 'reserved'
          AND expires_at < ?
    `)
    .run(now);
  return result.changes;
}

Note this uses a regular BEGIN DEFERRED (the default for a single prepared statement). The sweeper does not compete with reserve()/commit() for correctness because:

  • reserve() already excludes expired rows from its window query (expires_at > now_ms).
  • The sweeper merely formalizes the state; it does not change what reserve() would observe.

7.2 Phantom debt (intentional safety bias)

Reservations in state reserved where expires_at <= now() are excluded from the window query in reserve(). This means their estimated_usd is NOT counted against the room's remaining budget from the perspective of new reservations.

However, expired-but-unswepped rows remain in state reserved until the sweeper runs. The sweeper may lag by up to SWEEP_INTERVAL_MS (default 5 seconds). During this window, the rows are already excluded from the gate check (since expires_at <= now()), so they do not block new reservations.

The intentional safety bias is: expired reservations that have not been swept yet do NOT free up budget, but they also do NOT block new reservations. This means a room can briefly have committed + reserved + expired totals that exceed the cap on paper, but new callers will correctly see available budget once the expiry timestamp passes.

If the sweeper crashes permanently (e.g., the process exits), expired rows accumulate indefinitely. They are excluded from gate checks, so correctness is maintained, but storage grows. The sweeper health is reported by npx ruflo doctor --component bbs-budget.

7.3 Monthly billing reset

On the first reserve() call after a month boundary, the tracker detects billing_month != current-YYYY-MM and runs a reset inside the same BEGIN IMMEDIATE transaction before the gate check:

sql
UPDATE bbs_budget_rooms
  SET billing_month = ?, _lock_bump = _lock_bump + 1
  WHERE room_id = ?;

Old reservation rows from the previous month are NOT deleted on reset. They are retained for audit purposes until the retention policy purges them (configurable retentionDays per ADR-145 §namespace-governance). The window query is naturally scoped to reserved_at >= billing_month_start_ms, so old rows are ignored without deletion.


8. Open Questions

8.1 Expired commit handling — RESOLVED (peer review 2026-06-29)

Original concern: if a caller takes longer than RESERVATION_EXPIRY_MS to commit (slow model, network timeout), its reservations expire mid-flight. The first draft rejected the late commit with EXPIRED, meaning the actual API cost was incurred but the budget tracker showed nothing — a silent budget bypass (the "Expired Commit Leak").

Resolution: commit() on an expired reservation now ACCEPTS the spend, transitions state to committed_post_expiry, records actual_usd against the room's budget, returns { ok: true, warned: 'COMMIT_AFTER_EXPIRY', finalRemaining }, AND emits a high-priority reservation.committed_post_expiry audit envelope to the room and (when configured) to #exec. This guarantees that every real API spend is reflected in the monthly cap math, even when execution windows are violated. The warning is the observability signal — repeated COMMIT_AFTER_EXPIRY for the same pod is a clear instruction to increase reservationExpiryMs for that pod template (or to investigate why the pod's calls are exceeding their expected window). See §3.2 expires_at field notes for per-pod tuning, §5.3 for the commit-path SQL, and §4.1 for the discriminated return shape.

A residual mitigation also applies: the per-pod reservationExpiryMs ceiling of 300_000 ms (5 minutes) bounds the worst-case window during which a single reservation can sit pre-commit, regardless of agent stall depth.

8.2 Commit with actualUsd > estimatedUsd (overrun)

Default behaviour: charge the room the actual amount AND emit a reservation_overrun audit entry. The breaker may trip if cumulative overruns move total spend past the dailySpendCapUsd threshold in FederationBreakerService.

An alternative — fail the commit — was considered and rejected. A failed commit would mean the API call completed successfully but the cost is unrecorded, which is worse for budget accuracy than charging the overrun.

8.3 Cross-pod budget pooling

Out of scope for this ADR. The Phase 1-3 model is independent per-pod per-room per-month ledgers. If a business eventually wants a shared budget pool (e.g., the #exec room borrows from the #sales room's unused balance), the data model supports it via a future bbs_budget_pool table, but the implementation is Phase 4. Do not design for it now.


9. Concurrency Test Plan

All five tests are implementable in vitest using the existing better-sqlite3 setup (the agentdb-tools.ts file in v3/@claude-flow/cli/src/mcp-tools/agentdb-tools.ts demonstrates the project's standard better-sqlite3 integration pattern).

Test 1: 100 concurrent reserves, $1.00 cap

Setup: register room "sales-test" with monthly_cap_usd = 1.00
       set billing_month = current YYYY-MM

Action: fire 100 concurrent Promise calls:
          reserve("sales-test", "caller-N", 0.05)
        where N = 0..99, all fired in parallel via Promise.all()

Assert:
  - exactly 20 reservations have state 'reserved'
  - exactly 80 calls returned { ok: false, error: 'BUDGET_EXCEEDED' }
  - SELECT SUM(estimated_usd) FROM bbs_budget_reservations
      WHERE room_id = 'sales-test' AND state = 'reserved'
    = 1.00  (not 1.00 + epsilon, not 0.95)
  - no row with state 'committed' or 'released'

Test 2: Block, then release, then allow 20 of 50

Setup: register room "sales-test2" with monthly_cap_usd = 1.00

Step A: reserve("sales-test2", "initial", 0.50)
        → reservationIdA

Step B: fire 50 concurrent reserve("sales-test2", "caller-N", 0.05)
        Promise.all of 50 calls

Assert step B: all 50 return BUDGET_EXCEEDED
               (0.50 reserved + 0.50 requested > 1.00)

Step C: release(reservationIdA)

Step D: fire another 50 concurrent reserve("sales-test2", "caller-N", 0.05)

Assert step D:
  - exactly 10 succeed (0.50 freed, 10 × $0.05 = $0.50)
  - exactly 40 return BUDGET_EXCEEDED

Test 3: Reservation expiry frees budget for next caller

Setup: register room "hr-test" with monthly_cap_usd = 0.10
       set RESERVATION_EXPIRY_MS = 100  (100ms for test speed)

Step A: reserve("hr-test", "agent-1", 0.10)
        → reservationId (state: 'reserved')

Step B: reserve("hr-test", "agent-2", 0.05)
        → assert BUDGET_EXCEEDED (reservation still live)

Step C: wait 150ms (expiry passes)
        run sweepExpiredReservations() manually

Step D: reserve("hr-test", "agent-2", 0.05)
        → assert ok: true, state 'reserved'
           (expired row excluded from gate check)

Test 4: Crash mid-reserve (transaction atomicity)

Setup: use better-sqlite3 in-process (no separate process needed)
       intercept the COMMIT step by throwing after the INSERT
       and before db.exec('COMMIT')

Action: call reserve("ops-test", "agent", 0.05)
        → throws mid-transaction

Assert:
  - db.prepare('SELECT * FROM bbs_budget_reservations').all() returns []
    (transaction was rolled back; no orphan row)
  - bbs_budget_rooms._lock_bump is unchanged from pre-test value
    (write was rolled back with the transaction)

Note: better-sqlite3's synchronous API means "mid-transaction crash" is modelled by an exception thrown between db.prepare('BEGIN IMMEDIATE').run() and db.prepare('COMMIT').run(). The database automatically rolls back any transaction that was not committed before the connection is closed or an exception propagates.

Test 5: Clock skew backward 30 seconds

Setup: register room "finance-test" with monthly_cap_usd = 0.50
       set RESERVATION_EXPIRY_MS = 60_000

Step A: reserve at t=1000 (Date.now() = 1000)
        → reservation expires_at = 61_000

Step B: simulate clock jump: Date.now() = 1000 - 30_000 = -29_000
        (wrap in a clock-override fixture)

Step C: run sweepExpiredReservations()
        → WHERE expires_at < -29_000  — no rows match (61_000 is not < -29_000)
        Assert: reservation is NOT swept, state remains 'reserved'

Step D: call commit(reservationId, 0.05) at clock=-29_000
        → expires_at (61_000) > Date.now() (-29_000): not expired
        Assert: commit succeeds

This test validates that a clock moving backward does not revive an expired
reservation. The expires_at column stores the wall-clock instant of expiry
as an absolute unix-ms value; it does not store a relative duration.
A backward jump makes all unexpired reservations appear to have more time
remaining, not less. Expired reservations (expires_at already in the past)
cannot be moved back to 'reserved' because the sweeper only writes
'expired' — it never writes 'reserved'.

Test 6: COMMIT_AFTER_EXPIRY captures the real spend

Added by peer review 2026-06-29 — verifies the §8.1 Expired Commit Leak fix.

Setup: room with monthly_cap_usd = $1.00, reservationExpiryMs = 100

Step A: reserve($0.30) → reservationId, expires_at = now + 100ms
Step B: wait 200ms (the reservation is now expired but not yet committed
        nor swept; this simulates a slow LLM call)
Step C: assert sweeper has NOT yet committed_post_expiry (window query
        excludes it, but state is still 'reserved' or just transitioned
        to 'expired' by sweeper — either is acceptable)
Step D: call commit(reservationId, 0.30)
        Assert: returns { ok: true, warned: 'COMMIT_AFTER_EXPIRY',
                          finalRemaining: 0.70 }
        Assert: room's cumulative committed spend now includes the $0.30
        Assert: audit log contains a `reservation.committed_post_expiry`
                envelope referencing reservationId

Step E: reserve($0.70) → succeeds (since remaining is exactly 0.70)
Step F: reserve($0.01) → BUDGET_EXCEEDED (room is fully spent — the
        post-expiry commit counted)

This test is the load-bearing proof that the Expired Commit Leak is closed: the late commit MUST charge the room and MUST be visible in subsequent gate checks. Failing this test means the §8.1 fix has regressed.


---

## 10. Failure Modes

### 10.1 SQLite `database is locked` under contention

WAL mode greatly reduces lock contention compared to journal mode because readers and the writer operate on different regions. However, if `BEGIN IMMEDIATE` cannot acquire the write lock within SQLite's default busy timeout (default: 0ms — fails immediately), the caller receives `SQLITE_BUSY`.

The tracker applies exponential backoff on `SQLITE_BUSY`:

```typescript
const BACKOFF_SCHEDULE_MS = [10, 50, 250] as const;

async function withRetry<T>(fn: () => T): Promise<T> {
  for (let attempt = 0; attempt < BACKOFF_SCHEDULE_MS.length; attempt++) {
    try {
      return fn();  // synchronous better-sqlite3 call
    } catch (err: unknown) {
      if (!isSqliteBusy(err) || attempt === BACKOFF_SCHEDULE_MS.length - 1) {
        throw err;
      }
      await sleep(BACKOFF_SCHEDULE_MS[attempt]);
    }
  }
  throw new Error('unreachable');
}

Three retries (10ms / 50ms / 250ms) cover transient lock spikes. If all three fail, the error propagates to federation_bbs_publish, which returns { blocked: true, reason: 'DATABASE_BUSY' } rather than attempting the send. This is fail-closed: a busy database does not permit unbudgeted sends.

Setting db.pragma('busy_timeout = 500') on connection open (as a complement to WAL mode) is also recommended — this lets SQLite spin internally for up to 500ms before surfacing SQLITE_BUSY to Node, reducing the frequency of application-level retries without changing the correctness guarantee.

10.2 Database file corruption

If the WAL file or the main database file is corrupt (detected on first db.prepare() or on explicit PRAGMA integrity_check), the tracker fails closed: all reserve() calls return { ok: false, error: 'DATABASE_UNAVAILABLE' }. All federation_bbs_publish calls return { blocked: true, reason: 'BUDGET_DATABASE_UNAVAILABLE' }.

Recovery requires operator intervention:

  1. Stop all pods for the affected node.
  2. Delete or rename data/bbs-budget.db, data/bbs-budget.db-wal, data/bbs-budget.db-shm.
  3. Re-run npx ruflo doctor --fix --component bbs-budget to recreate the schema.
  4. Restart pods. The new database starts with zero committed spend, so the room may temporarily accept more spend than its cap before historical spends are re-accounted. This is a known limitation of Phase 1-3; a reconciliation import from the audit log is deferred to Phase 4.

10.3 Audit log write failure before reserve()

If the reportSpend() call to spend-reporter.ts throws before the BEGIN IMMEDIATE transaction starts, the entire reserve() call throws. No reservation is created. The caller receives an error and must abort the send. This is the correct fail-closed behaviour: if we cannot audit the spend, we do not permit it.


11. Performance Bound

11.1 Target

p99 latency for reserve() under N=50 concurrent callers on a Mac mini-class host (Apple M-series, NVMe SSD): less than 5ms.

11.2 Rationale

WAL mode eliminates reader-writer contention. BEGIN IMMEDIATE serializes writers, so N concurrent callers are serialized into a queue. Each queue slot is:

  • One UPDATE bbs_budget_rooms (by primary key — index seek, ~0.1ms)
  • One SELECT with a covering index on (room_id, reserved_at) (~0.2ms)
  • One INSERT (append to WAL, ~0.3ms)
  • WAL write + fsync at checkpoint cadence (amortized, not per-transaction with PRAGMA synchronous = NORMAL)

Estimated per-transaction time in-process: ~0.5–1ms. At N=50 serial queue depth, expected tail: ~50ms for the last caller. However, in practice N=50 concurrent HTTP handlers will not all land simultaneously — they arrive with microsecond jitter that disaggregates the queue significantly. Load testing using the pattern in scripts/bench-agenticow.mjs should be run in CI to validate the 5ms target.

11.3 Throughput ceiling and non-goal

SQLite write serialization means the throughput ceiling for reserve() is approximately 1000 calls/second on this hardware (1ms per write × single-writer at a time). This is sufficient for Phase 1-3: a business autopilot across 7 pods, each firing at most one send per few seconds.

If pod density ever requires more than ~500 reserves/second per node, the correct resolution is partitioned databases per-pod (one data/bbs-budget-<podName>.db file per pod, rather than one shared file). This preserves the SQLite serialization guarantee within each pod while eliminating cross-pod contention.

Replacing SQLite with a distributed store (Redis, Postgres, Spanner) is explicitly a non-goal for Phase 1-3. It introduces network latency, operational complexity, and a new failure mode (network partition). The agentbbs model is single-node-first; distributed budget pooling is Phase 4.


12. Migration

12.1 Current state

At the time this ADR is written, BbsRoomBudgetTracker is specified in ADR-164 §5.1.5 but not yet implemented. There is no existing atomic or non-atomic implementation to migrate FROM.

However, to permit incremental rollout (and to guard against regressions if this design has unforeseen issues), the implementation follows a three-step feature-flag pattern:

12.2 Step A: Schema-only migration

Ship a no-op migration that creates bbs_budget_rooms and bbs_budget_reservations tables in data/bbs-budget.db. This migration runs on node startup if the file does not exist. No application logic changes. The existing (to-be-written) non-atomic tracker remains the code path.

12.3 Step B: Ship behind feature flag

Implement the atomic tracker as AtomicBbsRoomBudgetTracker. The tracker factory reads CLAUDE_FLOW_BBS_ATOMIC_BUDGET:

typescript
export function createBudgetTracker(
  db: Database,
  spendReporter: SpendReporter
): BbsRoomBudgetTracker {
  if (process.env.CLAUDE_FLOW_BBS_ATOMIC_BUDGET === '1') {
    return new AtomicBbsRoomBudgetTracker(db, spendReporter);
  }
  // Fallback to the non-atomic implementation while the flag is off.
  return new NaiveBbsRoomBudgetTracker(db, spendReporter);
}

The flag defaults to '0' (disabled).

12.4 Step C: Flip the default

After 1 week of operation at a production site with CLAUDE_FLOW_BBS_ATOMIC_BUDGET=1 showing zero budget-exceeded incidents attributable to the tracker, flip the default to '1' and deprecate the naive implementation. Remove NaiveBbsRoomBudgetTracker in the next minor version.


13. Code Anchoring

The following ruflo source files are the primary anchor points for implementing this ADR:

FileRole
v3/@claude-flow/plugin-agent-federation/src/application/spend-reporter.tsEmits federation_spend events; ProductionFederationSpendReporter.reportSpend() is called before BEGIN IMMEDIATE
v3/@claude-flow/plugin-agent-federation/src/domain/value-objects/federation-budget.tsPer-call enforceBudget primitive — this ADR does NOT modify this file; the atomic tracker layers above it
v3/@claude-flow/plugin-agent-federation/src/application/federation-breaker-service.tsConsumes federation_spend events from the cost-tracker; an overrun audit entry can trip the room's daily cap
v3/@claude-flow/cli/src/mcp-tools/agentdb-tools.tsDemonstrates the project-standard better-sqlite3 setup: WAL mode, pragma synchronous, prepare().run() pattern
plugins/ruflo-bbs-federation/src/bbs-room-budget-tracker.tsNew file — the AtomicBbsRoomBudgetTracker class implementing this ADR's design

14. References

  • ADR-164 — AgentBBS Federated Business-Management Autopilot (parent ADR; §5.1.5 specifies the BbsRoomBudgetTracker interface this ADR replaces with an atomic version)
  • ADR-097 — Federation Budget Circuit Breaker (§Phase 2: FederationBreakerService; §Phase 3: spend-reporter.ts consumer contract; §audit-log: federation-spend namespace layout)
  • ADR-145 — Plugin Supply-Chain Integrity and Memory Namespace Governance (audit log namespace retention, namespace ownership rules)
  • better-sqlite3 documentationBEGIN IMMEDIATE semantics, WAL mode configuration, busy_timeout pragma, synchronous-NORMAL safety guarantee