docs/darwin/PLAN.md
Branch: darwin/capability-evolution-2026-06-26
Started: 2026-06-26
Drive ruflo capabilities toward SOTA across the dimensions we already
benchmark, using a /loop 5m autonomous loop. Each tick spawns one
claude -p (headless, Read/Edit/Bash only, --max-budget-usd capped) to
do a single optimization cycle, so this conversation stays focused on
orchestration and the per-tick spend is bounded.
A single tick = one claude -p invocation that does end-to-end:
docs/darwin/log.jsonl — last N entries, find current
champion scores per dimension.node scripts/bench-beir.mjs <dataset> --top-k 10
ADR → node plugins/ruflo-adr/scripts/import.mjs --dry-run
OIA → npx ruflo metaharness oia-audit --format jsondocs/darwin/log.jsonl with:
{ iter, ts, dimension, change, deltaScore, action, commit }claude -p --max-budget-usd 0.50 --model haiku for routine ticks.
Escalate to sonnet only when haiku reports "task too complex" 3x in
a row.
A dimension is "proven SOTA" when: