v3/docs/adr/ADR-169-benchmark-reporting-integrity-standard.md
ID: ADR-169 Status: Accepted — the FRAMES ablation (metaharness, n=50, seed 42, 2026-06-28) already complies; this ADR makes the discipline binding for every benchmark number ruflo publishes Date: 2026-07-03 Authors: rUv (drafted with Claude Code) Related ADRs:
runs/summary.json, scored by score-gaia.mjs) — every verdict below
was measured from committed artifacts, not asserted.ADR-167 and ADR-168 cover earning integrity: was the score obtained by solving tasks, and is the evidence recorded. This ADR covers the third leg — reporting integrity: given honestly-earned numbers, are they presented in a way that survives the Berkeley RDI lens?
RDI's April 2026 study broke eight major benchmarks not only through harness exploits but through reporting artifacts that inflate scores without any cheating in the run itself: relaxed/substring metrics presented as accuracy (normalization collisions), undisclosed best-of-N presented as single-attempt scores, no-work passes hidden in aggregates, and unreproducible cherry-picked runs.
The metaharness FRAMES ablation self-audit demonstrated that these vectors are cheap to close by construction — and that the same discipline was already implicitly present in the ablation's artifacts. What is missing is a binding standard so future numbers (GAIA, FRAMES, terminal-bench, whatever comes next) can't regress.
Every benchmark number ruflo publishes — README, release notes, leaderboard
submission, gist, blog — MUST satisfy five rules. summary.json-style scored
artifacts are the enforcement point.
The reported number is strict (gaia-style normalized) exact-match, tagged
view: "primary" in the artifact. Relaxed metrics (gold-tokens ⊆ prediction
or any substring-containment variant) MAY be computed as diagnostics but are
never the reported number, and if shown must carry the literal label
"relaxed (substring-contained) — diagnostic, not the score." Rationale:
substring containment is the normalization-collision vector RDI used to
inflate GAIA; score-gaia.mjs computes acc_relaxed for diagnosis and the
FRAMES audit confirms it never leaks into a headline.
Any best-of-N, self-consistency, majority-vote, or verifier-reranked arm
carries an explicit view label (majority, verifier-bon, ps-bon,
sc-curve, …) in the artifact, and the label travels with the number into
prose. A scaled score quoted without its label is a reporting violation even
when the underlying run was honest. (FRAMES example: deepseek base 0.50 is
quotable as base; 0.56 exists only as view: "majority" and must say so.)
Artifacts report mean_steps (with min > 0 for any correct answer) and
empty_rate per arm. A correct-with-zero-work record anywhere in the run
fails the artifact (this is AUD-2's reporting-side mirror).
Seed, n, and confidence intervals (Wilson) in the artifact header; the dataset revision/split named. A number without its reproducibility block is not publishable.
Where the benchmark is retrieval-grounded (FRAMES: Wikipedia; GAIA: open web), retrieving the answer text can be legitimate — the integrity question is whether the artifact can distinguish reasoning-over-retrieval from verbatim surfacing. Until the ADR-168 evidence contract (serialized tool outputs, secret-redacted, size-bounded) reaches the benchmark's harness, its reports MUST carry the honest gap statement rather than an implied clean bill ("answer-leakage: not provable from the artifact"). Turning that ⚠️ into ✅ happens by recording evidence (ADR-168), never by softening the statement.
score-gaia.mjs (and successor scorers) keep emitting
view, mean_steps, empty_rate, seed/n/CI — these fields are the
machine-checkable surface of R1–R4.gaia-audit.mjs
(ADR-167's registry) validates a scored artifact against R1–R4 before
/gaia submit packages it: headline view is primary; every non-primary
view labeled; no zero-step corrects; reproducibility block present.
Fail-closed, same posture as the existing checks.