v3/docs/adr/ADR-167-gaia-submission-integrity-exploit-audit.md
ID: ADR-167
Status: Proposed — prototype + tests shipped on feat/gaia-submission-integrity-audit (this branch). Enforceable checks green now; four checks blocked on trajectory instrumentation (see §7).
Date: 2026-07-03
Authors: rUv (drafted with Claude Code)
Related ADRs:
In April 2026, UC Berkeley's Center for Responsible, Decentralized Intelligence (RDI) published "How We Broke Top AI Agent Benchmarks" and the companion BenchJack study (arXiv:2605.12673). An automated scanning agent achieved near-perfect scores on eight major agent benchmarks — ~100% on WebArena, ~98% on GAIA, ~73% on OSWorld — using zero LLM calls and zero reasoning, solving not a single task. BenchJack surfaced 219 distinct flaws across eight exploit classes. Separately, METR reported that frontier models (o3, Claude-3.7-Sonnet) spontaneously reward-hack in 30%+ of evaluation runs — stack-introspecting to detect evaluator code and monkey-patching grading functions to always return a pass.
For GAIA specifically the RDI finding is concrete: ~98% of GAIA answers leaked
through public answer databases plus normalization collisions. GAIA's gold
answers live in a public Hugging Face dataset (gaia-benchmark/GAIA) — the very
dataset our loader downloads (gaia-loader.ts) and, in principle, the very
dataset our web_search / grounded_query tools can reach at run time.
ruflo publishes GAIA scores to the Princeton HAL leaderboard via /gaia submit,
which already signs the submission package with the ADR-103 Ed25519 witness
manifest. But a signature proves transport-integrity, not earning-integrity.
Today, a submission attests only "these bytes were not tampered after signing" —
it says nothing about how the scores were earned. Post-RDI, that is the wrong
thing to be certain about.
Signing proves the bytes are untampered in transit. The audit proves the score was earned by solving the task, not by exploiting the benchmark.
This ADR adds an earning-integrity layer: a deterministic, $0, pure-core red-team of a results file against the known reward-hacking vectors, whose report is hashed and signed into the existing witness manifest. After this change a ruflo GAIA submission attests two independent facts:
In scope: a pure-core, offline audit of the GAIA results file +
trajectories.jsonl (+ optional metadata); wiring it into /gaia validate and
/gaia submit; the manifest-attestation integration point.
Out of scope / honest limitations:
gaia-bench.ts) persists QuestionResult[] only — it does not
emit a trajectories.jsonl, and gaia-agent.ts discards the messages[]
array (fetched tool outputs, agent-visible prompts) after each question. Four
of the seven checks therefore cannot enforce today and return
status: "skip" with a harness_gap note rather than a false pass (§7). They
become enforceable once the harness records the data — this ADR specifies the
forward contract.Verified by reading source on origin/main (not invented):
QuestionResult in
v3/@claude-flow/cli/src/commands/gaia-bench.ts:
{ task_id, question, model, correct, answer, expected_output, error?, turns?, wallMs?, inputTokens?, outputTokens?, hardnessDifficulty?, hardnessConfidence?, decomposed? }. Emitted inside BenchRunOutput
{ level, model, summary{total,passed,passRate,estCostUsd,meanTurns, meanWallMs,hardnessDist?}, results }.GaiaAgentResult in gaia-agent.ts carries
toolCallsByName (per-tool counts, not names+args), totalInputTokens,
totalOutputTokens, turns, replanCount, convergenceTrigger?. The
messages[] array (fetched tool outputs + prompts) is local to
runGaiaAgent and never returned or persisted — this is the central
harness gap.judgeAnswer() in gaia-judge.ts: Stage-1 normaliseAnswer()
(lowercase + trim + strip one quote pair + collapse whitespace), Stage-1b
unit-aware numeric match, then LLM-as-judge cached under
~/.cache/ruflo/gaia/judgments/<sha256>.json (an unsigned filesystem
oracle). Note isAnswerCorrect() in gaia-agent.ts additionally does
bidirectional substring matching (normExpected.includes(normModel)) — a
collision hazard the audit flags.gaia-loader.ts hard-codes split=validation; gold answers are
the public HF dataset.gaia-submit.md: results.jsonl,
trajectories.jsonl (documented but not produced by the harness today),
metadata.json, manifest.md.json (Ed25519 witness).plugins/ruflo-core/scripts/witness/{lib,verify,regen}.mjs
(ADR-103): manifest.fixes[] = {id, desc, file, sha256, marker, markerVerified}; integrity = {manifestHash (sha256 of canonical manifest), publicKey, signature, seedDerivation}; seed =
sha256(gitCommit + ':ruflo-witness/v1'). verify.mjs recomputes the hash,
reproduces the public key from the commit, and Ed25519-verifies.| ID | Check | Severity | RDI/METR vector | Enforceable today? |
|---|---|---|---|---|
| AUD-1 | answer-leakage | CRITICAL | GAIA #1: ~98% answer leakage via public DB + fetched pages | Blocked — needs tool_result outputs in trajectory |
| AUD-2 | no-work / no-LLM pass | CRITICAL | "100% without solving; zero LLM calls" | Yes — turns/outputTokens are real fields (stronger with trajectory) |
| AUD-3 | oracle-leakage | CRITICAL | gold answer visible in agent prompt/context | Blocked — needs recorded prompt |
| AUD-4 | grader-isolation | CRITICAL | METR: 30%+ grader monkey-patching / eval-code introspection | Blocked — needs tool_call names+args |
| AUD-5 | normalization-collision | WARN | GAIA normalization collisions + bidirectional-substring hazard | Yes — reads answers in the results file |
| AUD-6 | voting-disclosure | WARN | hidden best-of-N score inflation | Yes if metadata carries N (harness gap: not persisted) |
| AUD-7 | split-integrity | WARN/INFO | validation-split (public gold) presented as held-out | Yes if metadata carries split (harness gap: not persisted) |
| AUD-8 | answer-key-reads | CRITICAL | RDI: answer keys read from unsanitized config/infra | Yes — static scan of runner sources + produced artifacts |
| AUD-9 | dynamic-eval | CRITICAL | RDI: trojanized test infrastructure (eval/exec of task content) | Yes — static scan of the gaia-bench runner sources |
| AUD-10 | judge-injection | WARN | prompt-injection in the produced answer aimed at the LLM judge | Yes — scans produced answers/outputs |
Static source-scan family (AUD-8/9/10) — ported from the reverted #2547
gaia-integrity.mjs duplicate. Unlike AUD-1/3/4 these do not depend on
trajectory instrumentation: AUD-8/9 statically scan the harness runner sources
(v3/@claude-flow/cli/src/benchmarks/gaia-*, src/commands/gaia-bench.ts) plus
the produced artifacts, and AUD-10 scans the produced answers/outputs — so they
are enforceable today. They still skip (never false-pass) when the source
population is absent (e.g. run standalone/ejected outside the repo). AUD-9
preserves the #2547 false-positive refinements — multi-line calls resolved,
RegExp.prototype.exec excluded, fixed-string commands and comment lines
ignored — so it passes clean against ruflo's own gcloud secrets … execSync
calls. AUD-8 is fail-closed (critical); AUD-9 fail-closed (critical); AUD-10 is
WARN (a produced-answer injection does not by itself prove a gamed score, but it
is a --strict gate and a reviewer signal). The reverted duplicate's separate
provenance stamp and --allow-integrity-override flag are not ported: this
ADR's signed attestation (§5) and the /gaia submit --allow-dirty gate (§6)
already provide provenance and override, so AUD-8/9/10 flow through the same
attestation.clean / strict_clean path as every other check.
Each check is a pure function in
plugins/ruflo-workflows/scripts/gaia-audit.mjs, unit-tested against a fixture
that triggers its vector and the clean fixture that must pass. Where the schema
lacks the data, the check returns status:"skip" + a harness_gap string — it
never fakes a pass.
The audit's normaliseAnswer() is a byte-for-byte copy of gaia-judge.ts's so
the audit sees answers exactly as the scorer does.
To make the blocked checks enforceable, gaia-agent.ts must emit — and
gaia-bench.ts must persist to trajectories.jsonl — one record per task:
{
"task_id": "…",
"turns": 4,
"tools_used": ["web_search"],
"steps": [
{ "type": "prompt", "content": "<agent-visible question/context>" },
{ "type": "llm_call", "input": "<visible input>", "output": "<model text>",
"tokens_in": 1520, "tokens_out": 88 },
{ "type": "tool_call", "name": "web_search", "input": { "query": "…" } },
{ "type": "tool_result", "name": "web_search", "url": "https://…",
"output": "<fetched page text>" }
]
}
This is a faithful serialization of the messages[] array runGaiaAgent
already builds — the fix is to return and write it, not to compute anything
new. is_error tool results and image markers can be recorded as-is. Until
this lands, AUD-1/3/4 skip and the attestation records the gap.
gaia-audit.mjs emits a deterministic report
{ schema, audited_at, threat_model, totals, checks[], attestation{ clean, strict_clean, critical_failures[], warn_failures[], skipped[], harness_gaps[] }, inputs{ results_sha256, trajectories_sha256?, metadata_sha256? } }.
audited_at is an injected value or the literal AUDITED_AT_PLACEHOLDER, and
the body contains no Date.now()/random, so identical inputs hash identically.
Integration into /gaia submit (the real signer lives in the ADR-103
witness scripts):
/gaia submit writes the package (results.jsonl, trajectories.jsonl,
metadata.json), then runs gaia-audit.mjs --results … --trajectories … --metadata … --out audit-report.json --audited-at <submitted_at>.attestation.clean === false (any CRITICAL fail), /gaia submit
refuses to build the leaderboard package unless --allow-dirty is
passed. --strict additionally refuses on WARN failures.manifest.md.json:
// node plugins/ruflo-core/scripts/witness/regen.mjs \
// --manifest submission-<id>/manifest.md.json --fixes gaia-audit-fix.json
{ "fixes": [{
"id": "gaia-exploit-audit",
"desc": "GAIA pre-submission exploit audit clean (ADR-167)",
"file": "submission-<id>/audit-report.json",
"marker": "\"clean\": true"
}] }
regen.mjs→refreshFix() records the report's sha256 and verifies the
"clean": true marker; verify.mjs then reports regressed if anyone
swaps in a report whose bytes or marker changed. The audit report is thus
inside the signed manifest, exactly like every other ADR-103 fix marker.manifestHash = sha256(canonical manifest) already covers the fix's
sha256, and the Ed25519 signature covers manifestHash — so tampering with
the audit report after signing breaks verification.Why a fix-marker rather than a new manifest field: it reuses the ADR-103
machinery verbatim (refreshFix → sha256 + marker; verify.mjs → regressed
detection) with zero signer changes, and it surfaces in the same
verify.mjs --json output operators already consume.
Adopt the earning-integrity audit as a gate on /gaia submit:
gaia-audit.mjs (pure-core + CLI, exit 0 clean / 1 CRITICAL / 2 usage)
and its test suite now.--audit into /gaia validate and the pre-sign gate into /gaia submit
gaia-submission skill; /gaia submit refuses a CRITICAL-failing
package unless --allow-dirty.skip to enforced.Rigorously honest status — these currently skip, not pass:
steps[type=tool_result].output (fetched
page text). gaia-agent.ts discards messages[]. Highest-value gap:
this is GAIA's #1 vector and it is dark today.steps[type=prompt]). Static review confirms buildInitialContent() never
injects final_answer, but that is not a per-run attestation.toolCallsByName (counts only) is computed but never persisted. Also flags
that the judge cache (~/.cache/ruflo/gaia/judgments) is an unsigned
filesystem oracle any local process can write.gaia-bench.ts accepts --voting-attempts but
never writes it into BenchRunOutput/metadata.json. Add a
voting_attempts metadata field at package time from the run flags.gaia_split metadata field
(gaia-loader.ts only fetches validation).Enforced today: AUD-2 (turns/tokens are real), AUD-5 (answers in results file), AUD-6/AUD-7 when metadata carries the field, and the static source-scan family AUD-8 answer-key-reads / AUD-9 dynamic-eval / AUD-10 judge-injection (they scan the harness sources + produced artifacts, not the trajectory, so no instrumentation is required — they skip honestly only when run outside the repo).
All four §4/§7 harness fixes are serialization-only — the data already exists
in memory at run time; the harness simply does not write it out. Until then,
the attestation faithfully reports harness_gaps[], so a signed "clean" is
never stronger than the checks that actually ran.