plugins/ruflo-workflows/skills/gaia-architecture-comparison/SKILL.md
Compare ruflo's GAIA benchmark harness against the Princeton HAL reference implementation and other open-source harnesses to understand capability gaps and prioritize improvements.
gaia-bench run
└─ gaia-loader.ts — HF dataset download + cache
└─ gaia-agent.ts — multi-turn Anthropic Messages loop
└─ gaia-tools/ — web_search, file_read, web_browse,
image_describe, python_exec
└─ gaia-voting.ts — Track A self-consistency (N attempts → majority vote)
└─ gaia-hardness/ — Track Q difficulty predictor (ADR-136)
└─ gaia-judge.ts — two-stage LLM-as-judge scorer
HAL uses a similar loop but with:
| Dimension | ruflo | HAL reference | Gap |
|---|---|---|---|
| Question count | 53 (partial L1) | 300 (full L1) | Use --limit 165 for full L1 |
| Web search | DuckDuckGo / Google CSE | BrowserBase live | Add Playwright or Browserless |
| Code execution | python_exec stub | Real Jupyter kernel | Implement real sandbox |
| Image OCR | image_describe (Gemini) | GPT-4V / Gemini | Functionally equivalent |
| File handling | file_read | Full PDF/XLSX/ZIP parser | Expand file_read |
| Self-consistency | voting.ts (Track A) | Not in reference | ruflo advantage |
| Hardness routing | predictor.ts (Track Q) | Not in reference | ruflo advantage |
| Memory | AgentDB HNSW | None | ruflo advantage |
| Pass-rate L1 | ~20.8% (iter 23) | 74.6% (HAL Sonnet 4.5) | ~54 pp gap |
Real code execution — many L2/L3 questions require running Python to
compute a numerical answer. The current python_exec tool is a stub.
Implementing a real sandbox (E2B, Pyodide, or subprocess) is the single
highest-ROI change.
Full question set — running 53/300 L1 questions underestimates true
pass-rate because the first 53 skew easier. Run --limit 165 (full L1)
for a comparable HAL score.
Real browser — web_browse currently fetches raw HTML. Replacing it
with Playwright/Browserless for JavaScript-rendered pages would unlock
many web navigation questions.
Structured file parsing — PDF, XLSX, and ZIP attachments require
dedicated parsers. file_read currently handles plain text and images only.
Turn budget — 12 turns may be insufficient for complex multi-step questions. HAL uses up to 20 turns for L3.
System prompt tuning — HAL's system prompt is more elaborate and explicitly instructs the model to use tools before answering.
Self-consistency voting (Track A) — running N attempts per question and taking the majority answer reduces variance on borderline questions. HAL does not implement this.
Hardness routing (Track Q) — routing each question to an appropriate model and turn budget based on predicted difficulty. This reduces cost on easy questions while providing more resources for hard ones.
AgentDB memory — storing patterns across runs enables the agent to recall successful strategies for similar question types.
| Priority | Change | Expected Lift | Effort |
|---|---|---|---|
| P0 | Real python_exec sandbox (E2B) | +15-25 pp | High |
| P0 | Full 165-Q L1 evaluation | Accurate baseline | Low |
| P1 | Playwright-based web_browse | +5-10 pp | Medium |
| P1 | PDF/XLSX file parser | +3-8 pp | Medium |
| P2 | Increase max-turns to 20 for L2/L3 | +2-5 pp | Low |
| P2 | System prompt tuning (iter 30 research) | +2-5 pp | Low |
| P3 | Google Grounding via Gemini (iter 32) | +3-7 pp | Medium |
| P3 | Multi-provider routing (Gemini Flash for cheap Q's) | Cost reduction | Medium |
npx @claude-flow/cli@latest memory search \
--namespace gaia-patterns \
--query "architecture comparison HAL benchmark"
npx @claude-flow/cli@latest memory store \
--namespace gaia-patterns \
--key "architecture-comparison-$(date +%Y%m%d)" \
--value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing."