v3/docs/adr/ADR-163-multi-agent-performance-benchmarking-suite.md
--backend ruflo --confirm)scripts/benchmark-multiagent.mjs — two backends (mock for CI smoke at $0; ruflo for publishable numbers gated behind --confirm)docs/benchmarks/multi-agent/multiagent-mock-*.json — 500 mock runs, seed 42, overall pass-rate 72.2%. MOCK numbers, not publishable — Bernoulli over hand-picked per-task pass rates. Use this run to verify the pipeline, not to claim a result.As of June 2026, all major competing frameworks publish a task-completion-rate benchmark:
| Framework | Task Completion | Cost/Task | Source |
|---|---|---|---|
| LangGraph | 62% | $0.08 | Independent 2026 benchmark, 2,000 runs, Grade B |
| AutoGen | 58% | ~$0.10 est | Same source |
| CrewAI | 54% | ~$0.12 est | Same source |
| Ruflo | Not published | Not published | — |
Ruflo's CLAUDE.md documents performance targets (<100ms MCP, <500ms CLI startup) and internal micro-benchmarks (HNSW speedup, SONA adaptation time), but publishes no end-to-end multi-agent task completion rate, cost-per-task, or throughput-per-dollar figure comparable to what competitors report. This creates a marketing credibility gap and blocks data-driven tuning of the 3-tier routing thresholds.
Two 2026 papers further motivate action:
Implement a reproducible multi-agent performance benchmark suite in scripts/benchmark-multiagent.mjs (mirroring the existing scripts/benchmark-intelligence.mjs pattern), and publish results in CLAUDE.md under a new "Multi-Agent Benchmarks" table.
5-task corpus (same topology as the LangGraph/AutoGen/CrewAI 2026 independent benchmark):
| Task | Type | Success criterion |
|---|---|---|
| T1: Code generation | Single-agent Tier-2 | Correct output, ≤2 retries |
| T2: Multi-file refactor | Hierarchical swarm (3 agents) | All target files modified, tests pass |
| T3: Research synthesis | Mesh swarm (4 agents) | ≥5 cited sources, coherent output |
| T4: Security audit | Specialized swarm (reviewer+auditor) | ≥3 findings categorized |
| T5: End-to-end feature | Full pipeline (architect→coder→tester→reviewer) | Feature works + tests green |
Metrics per run:
Run configuration:
scripts/benchmark-intelligence.mjs output patternTarget: ≥65% overall task completion rate (beating LangGraph's 62%).
In a follow-up PR, explore replacing the fixed round-robin task assignment in swarm_init with a lightweight 3-iteration unfolded ADMM solver for workload distribution across agents. No production change without benchmark evidence.
Positive:
Negative:
Neutral: