plugins/ruflo-workflows/commands/gaia-validate.md
Run pre-submission integrity checks before executing a benchmark or packaging results for the HAL leaderboard.
/gaia validate
/gaia validate --strict
/gaia validate --fix
| Flag | Default | Description |
|---|---|---|
--strict | off | Fail on warnings (not just errors); with --audit, also fails on WARN audit findings |
--fix | off | Attempt to auto-fix resolvable issues (e.g., install missing deps) |
--skip-hf | off | Skip the HF dataset connectivity check (useful offline) |
--skip-build | off | Skip the TypeScript build check |
--audit | off | Run the pre-submission exploit audit (ADR-167) when a results file exists |
ANTHROPIC_API_KEY — required for model inferenceHF_TOKEN — required to download the GAIA dataset from Hugging FaceGOOGLE_AI_API_KEY — optional; warn if absent (Gemini model support disabled)GOOGLE_CUSTOM_SEARCH_API_KEY + GOOGLE_CUSTOM_SEARCH_CX — optional; warn
if absent (web_search falls back to DuckDuckGo)cd v3/@claude-flow/cli && npx tsc --noEmit
All GAIA benchmark source files must be TS-error-free.
Perform a dry-run fetch of 1 question from the HF GAIA dataset to confirm the token and network path work.
Verify the witness manifest is up to date and valid:
node plugins/ruflo-core/scripts/witness/verify.mjs
Confirm all required benchmark source files exist:
v3/@claude-flow/cli/src/commands/gaia-bench.tsv3/@claude-flow/cli/src/benchmarks/gaia-agent.tsv3/@claude-flow/cli/src/benchmarks/gaia-judge.tsv3/@claude-flow/cli/src/benchmarks/gaia-loader.tsv3/@claude-flow/cli/src/benchmarks/gaia-tools/index.tsnode v3/@claude-flow/cli/bin/cli.js --version
Only runs when --audit is passed and a results file is available
(default ~/.cache/ruflo/gaia/results-latest.json). Motivated by UC Berkeley
RDI's April-2026 finding that all 8 major agent benchmarks — GAIA included —
were gamed to ~98-100% without solving any task (GAIA leaked ~98% of answers
via public answer DBs + normalization collisions; o3/Claude-3.7 monkey-patch
graders in 30%+ of runs, per METR).
Signing proves the bytes are untampered; this audit proves the score was earned. It is a deterministic, $0, offline red-team of the known reward-hacking vectors:
node plugins/ruflo-workflows/scripts/gaia-audit.mjs \
--results ~/.cache/ruflo/gaia/results-latest.json \
--trajectories <submission>/trajectories.jsonl \
--metadata <submission>/metadata.json \
[--repo-root <dir>] [--dataset-dir <dir>]... [--skip-source-scan]
| Check | Severity | RDI/METR vector |
|---|---|---|
| answer-leakage | CRITICAL | gold answer appears in a fetched tool output (GAIA's #1 vector) |
| no-work-pass | CRITICAL | correct with zero turns / zero output tokens / no LLM call |
| oracle-leakage | CRITICAL | gold answer present in the agent-visible input prompt |
| grader-isolation | CRITICAL | tool call targets judge/grader/witness paths (monkey-patching) |
| normalization-collision | WARN | degenerate/generic answer credited under aggressive normalization |
| voting-disclosure | WARN | hidden best-of-N not disclosed in metadata |
| split-integrity | WARN/INFO | validation-split (public gold) presented as held-out |
| answer-key-reads | CRITICAL | answer/gold/solution/ground-truth-shaped path read outside the sanctioned dataset dir (runner sources + produced artifacts) |
| dynamic-eval | CRITICAL | eval() / new Function() / exec of a non-literal (task-derivable) command in the gaia-bench runner sources |
| judge-injection | WARN | prompt-injection markers in the agent's produced answer/output aimed at the LLM judge |
The last three form the static source-scan family (ported from the reverted
#2547 duplicate): they scan the harness sources + produced artifacts rather than
the trajectory, so they enforce today with no instrumentation. answer-key-reads
and dynamic-eval are fail-closed; both skip (never false-pass) when the
runner sources cannot be located (e.g. run standalone) or under
--skip-source-scan. --repo-root overrides where the runner sources are found;
--dataset-dir (repeatable) sets the sanctioned dataset dirs (default
~/.cache/huggingface, ~/.cache/ruflo/gaia/dataset). dynamic-eval is tuned
to pass ruflo's own multi-line gcloud secrets … execSync calls (fixed-string
commands and RegExp.prototype.exec are not flagged). judge-injection is
output-side and distinct from oracle-leakage (input-side), so both run.
Exit 0 clean · exit 1 on any CRITICAL fail (or WARN fail under --strict) ·
exit 2 usage error. Checks whose data the trajectory schema does not yet
capture return skip with a harness_gap note (see ADR-167 §7) — they never
report a false pass.
Validating GAIA benchmark environment...
[PASS] ANTHROPIC_API_KEY set (sk-ant-...abc3)
[PASS] HF_TOKEN set (hf_...xyz9)
[WARN] GOOGLE_AI_API_KEY not set — Gemini routing disabled
[WARN] GOOGLE_CUSTOM_SEARCH_API_KEY not set — web_search using DuckDuckGo fallback
[PASS] TypeScript build clean (0 errors)
[PASS] HF dataset reachable (1 question fetched)
[PASS] Witness manifest valid (Ed25519 verified)
[PASS] All 5 benchmark source files present
[PASS] CLI binary resolves to v3.6.x
2 warnings (use --strict to fail on warnings)
Ready to run /gaia run
process.env first, then attempt
gcloud secrets versions access latest --secret=<name> silently.npx tsc --noEmit in the CLI package directory; capture stderr.node … gaia-bench run --smoke-only --limit=1 --dry-run.--audit is set and a results file exists, run
plugins/ruflo-workflows/scripts/gaia-audit.mjs (ADR-167); treat a CRITICAL
audit failure as an error and a WARN failure as a warning.--strict is set in which case warnings also cause exit 1.