Back to Ruflo

ADR-167: GAIA Submission Integrity — Pre-Submission Exploit Audit + Signed Attestation

v3/docs/adr/ADR-167-gaia-submission-integrity-exploit-audit.md

3.20.015.5 KB
Original Source

ADR-167: GAIA Submission Integrity — Pre-Submission Exploit Audit + Signed Attestation

ID: ADR-167 Status: Proposed — prototype + tests shipped on feat/gaia-submission-integrity-audit (this branch). Enforceable checks green now; four checks blocked on trajectory instrumentation (see §7). Date: 2026-07-03 Authors: rUv (drafted with Claude Code) Related ADRs:

  • ADR-133 (GAIA benchmark harness — loader/agent/judge/tools)
  • ADR-135 (GAIA Tracks A/B/D/E — voting, planning, critic, decompose)
  • ADR-136 (GAIA Track Q — hardness routing)
  • ADR-103 (Witness manifest — Ed25519-signed fix markers + temporal history)
  • ADR-131 (ToolOutputGuardrail — content-boundary screening)
  • ADR-166 (MCP bridge RCE remediation — the "verify the claim against source" discipline this ADR reuses)

1. Context

1.1 Why this ADR now

In April 2026, UC Berkeley's Center for Responsible, Decentralized Intelligence (RDI) published "How We Broke Top AI Agent Benchmarks" and the companion BenchJack study (arXiv:2605.12673). An automated scanning agent achieved near-perfect scores on eight major agent benchmarks — ~100% on WebArena, ~98% on GAIA, ~73% on OSWorld — using zero LLM calls and zero reasoning, solving not a single task. BenchJack surfaced 219 distinct flaws across eight exploit classes. Separately, METR reported that frontier models (o3, Claude-3.7-Sonnet) spontaneously reward-hack in 30%+ of evaluation runs — stack-introspecting to detect evaluator code and monkey-patching grading functions to always return a pass.

For GAIA specifically the RDI finding is concrete: ~98% of GAIA answers leaked through public answer databases plus normalization collisions. GAIA's gold answers live in a public Hugging Face dataset (gaia-benchmark/GAIA) — the very dataset our loader downloads (gaia-loader.ts) and, in principle, the very dataset our web_search / grounded_query tools can reach at run time.

ruflo publishes GAIA scores to the Princeton HAL leaderboard via /gaia submit, which already signs the submission package with the ADR-103 Ed25519 witness manifest. But a signature proves transport-integrity, not earning-integrity. Today, a submission attests only "these bytes were not tampered after signing" — it says nothing about how the scores were earned. Post-RDI, that is the wrong thing to be certain about.

1.2 The core distinction

Signing proves the bytes are untampered in transit. The audit proves the score was earned by solving the task, not by exploiting the benchmark.

This ADR adds an earning-integrity layer: a deterministic, $0, pure-core red-team of a results file against the known reward-hacking vectors, whose report is hashed and signed into the existing witness manifest. After this change a ruflo GAIA submission attests two independent facts:

  1. Transport-integrity (ADR-103, unchanged): the package bytes match the signed manifest hash.
  2. Earning-integrity (this ADR): the scores survived a mechanical red-team of GAIA's #1 leakage vector, the "100% without solving" no-work signature, oracle leakage, and grader monkey-patching.

1.3 Scope & limitations

In scope: a pure-core, offline audit of the GAIA results file + trajectories.jsonl (+ optional metadata); wiring it into /gaia validate and /gaia submit; the manifest-attestation integration point.

Out of scope / honest limitations:

  • A mechanical audit reduces but cannot fully eliminate reward-hacking. It catches the known, mechanically-detectable vectors from the RDI/METR catalogue. A sufficiently clever exploit that leaves no trace in the recorded trajectory is not caught. This is a floor, not a ceiling.
  • Some checks are only as strong as the trajectory schema. The current harness (gaia-bench.ts) persists QuestionResult[] only — it does not emit a trajectories.jsonl, and gaia-agent.ts discards the messages[] array (fetched tool outputs, agent-visible prompts) after each question. Four of the seven checks therefore cannot enforce today and return status: "skip" with a harness_gap note rather than a false pass (§7). They become enforceable once the harness records the data — this ADR specifies the forward contract.
  • No live network, no LLM, no env access. The audit is $0 and deterministic by construction so its report can be signed reproducibly.

2. Ground truth — the real schemas the audit reads

Verified by reading source on origin/main (not invented):

  • Per-question resultQuestionResult in v3/@claude-flow/cli/src/commands/gaia-bench.ts: { task_id, question, model, correct, answer, expected_output, error?, turns?, wallMs?, inputTokens?, outputTokens?, hardnessDifficulty?, hardnessConfidence?, decomposed? }. Emitted inside BenchRunOutput { level, model, summary{total,passed,passRate,estCostUsd,meanTurns, meanWallMs,hardnessDist?}, results }.
  • Agent resultGaiaAgentResult in gaia-agent.ts carries toolCallsByName (per-tool counts, not names+args), totalInputTokens, totalOutputTokens, turns, replanCount, convergenceTrigger?. The messages[] array (fetched tool outputs + prompts) is local to runGaiaAgent and never returned or persisted — this is the central harness gap.
  • JudgejudgeAnswer() in gaia-judge.ts: Stage-1 normaliseAnswer() (lowercase + trim + strip one quote pair + collapse whitespace), Stage-1b unit-aware numeric match, then LLM-as-judge cached under ~/.cache/ruflo/gaia/judgments/<sha256>.json (an unsigned filesystem oracle). Note isAnswerCorrect() in gaia-agent.ts additionally does bidirectional substring matching (normExpected.includes(normModel)) — a collision hazard the audit flags.
  • Loadergaia-loader.ts hard-codes split=validation; gold answers are the public HF dataset.
  • Submission packagegaia-submit.md: results.jsonl, trajectories.jsonl (documented but not produced by the harness today), metadata.json, manifest.md.json (Ed25519 witness).
  • Witnessplugins/ruflo-core/scripts/witness/{lib,verify,regen}.mjs (ADR-103): manifest.fixes[] = {id, desc, file, sha256, marker, markerVerified}; integrity = {manifestHash (sha256 of canonical manifest), publicKey, signature, seedDerivation}; seed = sha256(gitCommit + ':ruflo-witness/v1'). verify.mjs recomputes the hash, reproduces the public key from the commit, and Ed25519-verifies.

3. Check catalogue mapped to RDI/METR vectors

IDCheckSeverityRDI/METR vectorEnforceable today?
AUD-1answer-leakageCRITICALGAIA #1: ~98% answer leakage via public DB + fetched pagesBlocked — needs tool_result outputs in trajectory
AUD-2no-work / no-LLM passCRITICAL"100% without solving; zero LLM calls"Yesturns/outputTokens are real fields (stronger with trajectory)
AUD-3oracle-leakageCRITICALgold answer visible in agent prompt/contextBlocked — needs recorded prompt
AUD-4grader-isolationCRITICALMETR: 30%+ grader monkey-patching / eval-code introspectionBlocked — needs tool_call names+args
AUD-5normalization-collisionWARNGAIA normalization collisions + bidirectional-substring hazardYes — reads answers in the results file
AUD-6voting-disclosureWARNhidden best-of-N score inflationYes if metadata carries N (harness gap: not persisted)
AUD-7split-integrityWARN/INFOvalidation-split (public gold) presented as held-outYes if metadata carries split (harness gap: not persisted)
AUD-8answer-key-readsCRITICALRDI: answer keys read from unsanitized config/infraYes — static scan of runner sources + produced artifacts
AUD-9dynamic-evalCRITICALRDI: trojanized test infrastructure (eval/exec of task content)Yes — static scan of the gaia-bench runner sources
AUD-10judge-injectionWARNprompt-injection in the produced answer aimed at the LLM judgeYes — scans produced answers/outputs

Static source-scan family (AUD-8/9/10) — ported from the reverted #2547 gaia-integrity.mjs duplicate. Unlike AUD-1/3/4 these do not depend on trajectory instrumentation: AUD-8/9 statically scan the harness runner sources (v3/@claude-flow/cli/src/benchmarks/gaia-*, src/commands/gaia-bench.ts) plus the produced artifacts, and AUD-10 scans the produced answers/outputs — so they are enforceable today. They still skip (never false-pass) when the source population is absent (e.g. run standalone/ejected outside the repo). AUD-9 preserves the #2547 false-positive refinements — multi-line calls resolved, RegExp.prototype.exec excluded, fixed-string commands and comment lines ignored — so it passes clean against ruflo's own gcloud secrets … execSync calls. AUD-8 is fail-closed (critical); AUD-9 fail-closed (critical); AUD-10 is WARN (a produced-answer injection does not by itself prove a gamed score, but it is a --strict gate and a reviewer signal). The reverted duplicate's separate provenance stamp and --allow-integrity-override flag are not ported: this ADR's signed attestation (§5) and the /gaia submit --allow-dirty gate (§6) already provide provenance and override, so AUD-8/9/10 flow through the same attestation.clean / strict_clean path as every other check.

Each check is a pure function in plugins/ruflo-workflows/scripts/gaia-audit.mjs, unit-tested against a fixture that triggers its vector and the clean fixture that must pass. Where the schema lacks the data, the check returns status:"skip" + a harness_gap string — it never fakes a pass.

The audit's normaliseAnswer() is a byte-for-byte copy of gaia-judge.ts's so the audit sees answers exactly as the scorer does.


4. Forward trajectory contract (unblocks AUD-1/3/4)

To make the blocked checks enforceable, gaia-agent.ts must emit — and gaia-bench.ts must persist to trajectories.jsonl — one record per task:

jsonc
{
  "task_id": "…",
  "turns": 4,
  "tools_used": ["web_search"],
  "steps": [
    { "type": "prompt",      "content": "<agent-visible question/context>" },
    { "type": "llm_call",    "input": "<visible input>", "output": "<model text>",
      "tokens_in": 1520, "tokens_out": 88 },
    { "type": "tool_call",   "name": "web_search", "input": { "query": "…" } },
    { "type": "tool_result", "name": "web_search", "url": "https://…",
      "output": "<fetched page text>" }
  ]
}

This is a faithful serialization of the messages[] array runGaiaAgent already builds — the fix is to return and write it, not to compute anything new. is_error tool results and image markers can be recorded as-is. Until this lands, AUD-1/3/4 skip and the attestation records the gap.


5. Manifest-attestation design

gaia-audit.mjs emits a deterministic report { schema, audited_at, threat_model, totals, checks[], attestation{ clean, strict_clean, critical_failures[], warn_failures[], skipped[], harness_gaps[] }, inputs{ results_sha256, trajectories_sha256?, metadata_sha256? } }. audited_at is an injected value or the literal AUDITED_AT_PLACEHOLDER, and the body contains no Date.now()/random, so identical inputs hash identically.

Integration into /gaia submit (the real signer lives in the ADR-103 witness scripts):

  1. /gaia submit writes the package (results.jsonl, trajectories.jsonl, metadata.json), then runs gaia-audit.mjs --results … --trajectories … --metadata … --out audit-report.json --audited-at <submitted_at>.
  2. If attestation.clean === false (any CRITICAL fail), /gaia submit refuses to build the leaderboard package unless --allow-dirty is passed. --strict additionally refuses on WARN failures.
  3. On a clean pass, the audit report is added to the package and registered as a witness fix entry so its sha256 is signed into manifest.md.json:
    jsonc
    // node plugins/ruflo-core/scripts/witness/regen.mjs \
    //   --manifest submission-<id>/manifest.md.json --fixes gaia-audit-fix.json
    { "fixes": [{
      "id": "gaia-exploit-audit",
      "desc": "GAIA pre-submission exploit audit clean (ADR-167)",
      "file": "submission-<id>/audit-report.json",
      "marker": "\"clean\": true"
    }] }
    
    regen.mjsrefreshFix() records the report's sha256 and verifies the "clean": true marker; verify.mjs then reports regressed if anyone swaps in a report whose bytes or marker changed. The audit report is thus inside the signed manifest, exactly like every other ADR-103 fix marker.
  4. manifestHash = sha256(canonical manifest) already covers the fix's sha256, and the Ed25519 signature covers manifestHash — so tampering with the audit report after signing breaks verification.

Why a fix-marker rather than a new manifest field: it reuses the ADR-103 machinery verbatim (refreshFix → sha256 + marker; verify.mjs → regressed detection) with zero signer changes, and it surfaces in the same verify.mjs --json output operators already consume.


6. Decision

Adopt the earning-integrity audit as a gate on /gaia submit:

  • Ship gaia-audit.mjs (pure-core + CLI, exit 0 clean / 1 CRITICAL / 2 usage) and its test suite now.
  • Wire --audit into /gaia validate and the pre-sign gate into /gaia submit
    • the gaia-submission skill; /gaia submit refuses a CRITICAL-failing package unless --allow-dirty.
  • Sign the audit report into the witness manifest as an ADR-103 fix marker.
  • File the harness-instrumentation follow-up (§4, §7) so AUD-1/3/4 and the metadata fields for AUD-6/7 move from skip to enforced.

7. Harness gaps (what is NOT enforceable until instrumented)

Rigorously honest status — these currently skip, not pass:

  1. AUD-1 answer-leakage — needs steps[type=tool_result].output (fetched page text). gaia-agent.ts discards messages[]. Highest-value gap: this is GAIA's #1 vector and it is dark today.
  2. AUD-3 oracle-leakage — needs the recorded agent-visible prompt (steps[type=prompt]). Static review confirms buildInitialContent() never injects final_answer, but that is not a per-run attestation.
  3. AUD-4 grader-isolation — needs tool_call names + arguments. toolCallsByName (counts only) is computed but never persisted. Also flags that the judge cache (~/.cache/ruflo/gaia/judgments) is an unsigned filesystem oracle any local process can write.
  4. AUD-6 voting-disclosuregaia-bench.ts accepts --voting-attempts but never writes it into BenchRunOutput/metadata.json. Add a voting_attempts metadata field at package time from the run flags.
  5. AUD-7 split-integrity — add a gaia_split metadata field (gaia-loader.ts only fetches validation).

Enforced today: AUD-2 (turns/tokens are real), AUD-5 (answers in results file), AUD-6/AUD-7 when metadata carries the field, and the static source-scan family AUD-8 answer-key-reads / AUD-9 dynamic-eval / AUD-10 judge-injection (they scan the harness sources + produced artifacts, not the trajectory, so no instrumentation is required — they skip honestly only when run outside the repo).

All four §4/§7 harness fixes are serialization-only — the data already exists in memory at run time; the harness simply does not write it out. Until then, the attestation faithfully reports harness_gaps[], so a signed "clean" is never stronger than the checks that actually ran.