v3/docs/adr/ADR-097-federation-budget-circuit-breaker.md
Status: Proposed
Date: 2026-05-04
Version: target v3.6.x
Supersedes: nothing
Related: ADR-086 (Agent Federation), ADR-095 (Architectural gaps — G2 includes federation transport), issues #1723 and #1724 (#1724 closed as duplicate of #1723), commit 6f495369 (G2 Ed25519 signing in federation)
Agent Federation (ADR-086) lets agents on one Ruflo node delegate tasks to peer nodes across trust boundaries. The implementation in @claude-flow/plugin-agent-federation already covers identity (Ed25519 keypairs), trust scoring, and federation_send for cross-node delegation. What it does not cover today:
ruflo-cost-tracker plugin tracks local agent token usage but federation traffic is invisible to it.The original issue (#1723, #1724 dup) frames this as a stability + enterprise-readiness gap. The fix is a federated circuit breaker layered on top of the existing federation protocol — no breaking changes to the wire format, just additive metadata.
| Component | Path | Today |
|---|---|---|
| Federation MCP tool | v3/@claude-flow/plugin-agent-federation/src/mcp-tools.ts (federation_send) | Sends a task to a peer; no cost / hop awareness |
| Federation node entity | v3/@claude-flow/plugin-agent-federation/src/domain/entities/federation-node.ts | trustScore, trustLevel, lastSeen; no state: SUSPENDED |
| Cost tracker plugin | plugins/ruflo-cost-tracker/ | Tracks local model spend per agent / per session |
| Behavioral trust | ADR-086 §"Trust scoring" | Adjusts on protocol misbehavior; no cost-based decay yet |
Ship four cohesive but independently-shippable parts. Each part is one iteration; the parts compose into the full circuit breaker.
federation_sendExtend the federation_send MCP tool input schema with three optional metadata fields:
{
// existing
peerId: string,
taskId: string,
payload: unknown,
// new (all optional, defaults preserve current behavior)
budget?: {
maxTokens?: number, // hard cap on Σ tokens across the whole hop chain
maxUsd?: number, // hard cap on Σ USD spend; enforced via cost-tracker
},
maxHops?: number, // default 8; 0 = no remote delegation allowed
}
These travel inside the federation envelope as a budget block alongside the payload. The receiving peer:
maxHops on receive. If it drops below 0, returns an error response (HOP_LIMIT_EXCEEDED) without invoking any agent. The originator gets the failure synchronously.tokensUsed and usdSpent; the running total is checked against the cap before each subsequent action. Overshoot returns BUDGET_EXCEEDED and refuses further work.Budget defaults: when omitted, treat as Infinity for maxTokens / maxUsd and 8 for maxHops. The default hop limit alone closes the recursion-loop class without any caller change.
Extend federation-node.ts with a state field driven by the breaker:
| State | When transitioned | What it means |
|---|---|---|
ACTIVE | default; healthy | federation_send accepts deliveries to this peer |
SUSPENDED | breaker tripped (cost threshold or repeated failures) | sends to this peer return PEER_SUSPENDED immediately; receives still accepted but ignored for trust accumulation |
EVICTED | manual or post-grace-period escalation from SUSPENDED | peer removed from registry; new delivery errors are emitted as PEER_EVICTED |
Transition triggers:
ACTIVE → SUSPENDED when either:
peer.costSuspensionUsd, default $5.00)SUSPENDED → ACTIVE after a configurable cooldown (default 30 min) AND a successful health probeSUSPENDED → EVICTED after 24h continuous suspension OR explicit federation_evict MCP callCooldown + auto-recovery prevents the breaker from being a one-way door — same shape as a typical hystrix-style breaker.
Wire the federation layer into ruflo-cost-tracker so federated spend appears in the same dashboards as local spend:
federation_spend published to the cost-tracker bus on every federation_send completion: { peerId, taskId, tokensUsed, usdSpent, ts }.cost-report skill.This is one direction (federation → cost-tracker). Cost-tracker doesn't need to mutate federation state directly; the breaker pulls.
ruflo doctor reports the current state of every known peer (ACTIVE/SUSPENDED/EVICTED) and the trailing-24h cost. A peer pinned in SUSPENDED for >1h shows up as a yellow warning.federation_breaker_status returns the same info programmatically for swarm coordinators that want to route around suspended peers.{prevState, newState, reason, peerId} so post-incident triage doesn't need a debugger.Per the user directive, the implementation team owns this section. Each phase is one iteration to keep blast radius bounded:
| Phase | Scope | Lands in |
|---|---|---|
| P1 | Budget envelope + hop counter, no peer state changes | federation plugin + new tests |
| P2 | Peer state machine (ACTIVE/SUSPENDED/EVICTED) + transition rules | federation-node.ts + tests |
| P3 | Cost-tracker bus event + per-peer rolling aggregation | cost-tracker plugin |
| P4 | Doctor + federation_breaker_status MCP tool | doctor command + mcp-tools |
Test surface:
BUDGET_EXCEEDED on the originator AND transition to SUSPENDED after the threshold.maxHops ≤ 8.| Decision | Alternative | Why this |
|---|---|---|
Optional budget fields, default Infinity | Required budget on every federation_send | Backward compatible: existing code paths keep working. Callers opt in to limits. |
Hop counter default 8 | Default unlimited | 8 is generous for legitimate multi-hop flows (typical ≤3) and cheap insurance against the recursion-loop class. |
| Peer state in federation plugin (not cost-tracker) | Centralized in cost-tracker | Federation owns peer identity + trust; the breaker is a federation concern. Cost-tracker provides numbers; federation decides actions. |
| Trailing-24h $5 default suspension threshold | Per-peer config only | Sane default avoids "gun, foot" for new installs. Override via plugin config. |
EVICTED separate from SUSPENDED | Single INACTIVE state | Eviction is operationally permanent (admin removes the peer); suspension is a soft, auto-recovering state. They have different audit/log/recovery semantics. |
| Pull from cost-tracker for breaker decisions | Push spend events into federation | Federation already speaks events to cost-tracker; reverse direction would couple in both ways. Pull-on-decision is simpler and just-in-time. |
lastSeen) already rely on local clocks; same threat model.SUSPENDED → ACTIVE simultaneously after cooldown, they could all retry pending sends at once. Mitigation: jitter the cooldown by ±10% per peer.The full implementation is done when:
federation_send accepts and propagates budget + maxHops without breaking any existing testmaxHops with HOP_LIMIT_EXCEEDED on the originatorSUSPENDED, refuses sends for the cooldown window, and auto-recovers on a successful proberuflo doctor shows peer states + trailing-24h costfederation_breaker_status MCP tool returns structured state per peercost-report shows federated spend grouped by peer