plugins/ruflo-workflows/commands/gaia-submit.md
Build a submission-ready package from a completed benchmark run and sign it with the ruflo Ed25519 witness manifest.
/gaia submit
/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json
/gaia submit --results=./my-results.json --dry-run
| Flag | Default | Description |
|---|---|---|
--results | ~/.cache/ruflo/gaia/results-latest.json | Path to the JSON results file from /gaia run |
--run-id | auto (from git SHA) | Short identifier embedded in the package directory name |
--dry-run | off | Build and validate the package but do not write it to disk |
--no-sign | off | Skip Ed25519 signing (not recommended for leaderboard submissions) |
--allow-dirty | off | Build the package even when the ADR-167 exploit audit reports a CRITICAL failure (records the failure in the package; not for leaderboard submissions) |
--strict-audit | off | Also refuse to build when the audit reports a WARN finding |
submission-<date>-<short-sha>/
├── results.jsonl — one JSON object per question (HAL-compatible)
├── trajectories.jsonl — full agent trajectory per question
├── metadata.json — model, harness version, tool catalogue, cost
├── audit-report.json — ADR-167 pre-submission exploit-audit report
├── manifest.md.json — Ed25519-signed witness manifest (signs audit-report.json's hash)
└── README.md — human-readable summary + comparison vs HAL baseline
Before signing, /gaia submit runs the pre-submission exploit audit — a
deterministic, $0 red-team of the known reward-hacking vectors that let UC
Berkeley RDI hit ~98% on GAIA without solving a single task (leaked answer
DBs, no-work passes, oracle leakage, grader monkey-patching).
Signing proves the package bytes are untampered; the audit proves the scores
were earned. The two are wired together: the audit report is registered as an
ADR-103 witness fix marker, so its sha256 + "clean": true marker are
signed into manifest.md.json.
/gaia submit refuses to
build the leaderboard package unless --allow-dirty is passed.--strict-audit additionally refuses on WARN findings.skip with a harness_gap note (ADR-167 §7) — they do not block, but the
gap is recorded in audit-report.json and thus in the signed manifest.{
"task_id": "e1fc63a2-da7a-432f-be78-7c4a95598703",
"model_answer": "4",
"reasoning_trace": "[full agent trace]",
"tools_used": ["web_search", "python_exec"],
"turns": 5,
"wall_seconds": 12.4
}
~/.cache/ruflo/gaia/results-latest.json;
ask if multiple candidates exist.level, model, summary, results array.results[] → HAL-compatible results.jsonl (one JSON per line).trajectories.jsonl from any trajectory fields in the results.metadata.json:
{
"submitted_at": "<ISO-8601>",
"harness": "[email protected] / @claude-flow/[email protected]",
"model": "<model-id>",
"gaia_level": 1,
"tool_catalogue": ["web_search","file_read","web_browse","image_describe","python_exec"],
"total_questions": 53,
"pass_rate": 0.208,
"est_cost_usd": 1.23,
"adrs": ["ADR-133","ADR-135","ADR-136"],
"git_sha": "<short-sha>"
}
node plugins/ruflo-workflows/scripts/gaia-audit.mjs \
--results <results> \
--trajectories submission-<id>/trajectories.jsonl \
--metadata submission-<id>/metadata.json \
--out submission-<id>/audit-report.json \
--audited-at "$SUBMITTED_AT" ${STRICT_AUDIT:+--strict}
--strict-audit) and
--allow-dirty was NOT passed, stop and report the findings — do not build
the package. --audited-at is set to metadata.submitted_at so the report
is reproducible.node plugins/ruflo-core/scripts/witness/regen.mjs \
--manifest submission-<id>/manifest.md.json \
--root submission-<id> \
--fixes gaia-audit-fix.json # {id:"gaia-exploit-audit", file:"audit-report.json", marker:"\"clean\": true"}
README.md with pass-rate table comparing to HAL baselines.--dry-run, print the package tree, the audit summary, and the manifest
hash without writing.After generating the package:
zip -r submission-$(date +%Y%m%d).zip submission-<date>-<sha>/
# Upload at https://huggingface.co/spaces/gaia-benchmark/leaderboard