Back to Ruflo

GAIA Architecture Comparison Skill

plugins/ruflo-workflows/skills/gaia-architecture-comparison/SKILL.md

3.10.134.7 KB
Original Source

GAIA Architecture Comparison Skill

Compare ruflo's GAIA benchmark harness against the Princeton HAL reference implementation and other open-source harnesses to understand capability gaps and prioritize improvements.

When to use

  • Planning the next iteration of GAIA work
  • Evaluating which architectural change has the highest pass-rate ROI
  • Onboarding a new contributor to the benchmark codebase

Architecture overview

ruflo harness (current)

gaia-bench run
  └─ gaia-loader.ts      — HF dataset download + cache
  └─ gaia-agent.ts       — multi-turn Anthropic Messages loop
       └─ gaia-tools/    — web_search, file_read, web_browse,
                           image_describe, python_exec
  └─ gaia-voting.ts      — Track A self-consistency (N attempts → majority vote)
  └─ gaia-hardness/      — Track Q difficulty predictor (ADR-136)
  └─ gaia-judge.ts       — two-stage LLM-as-judge scorer

HAL reference (Princeton)

HAL uses a similar loop but with:

  • OpenAI function calling as the tool interface
  • BrowserBase / Playwright for real browser automation
  • Code interpreter sandbox (Jupyter kernel)
  • Larger token budget per turn (4096+)
  • Full 300-question evaluation set

Key differences

DimensionrufloHAL referenceGap
Question count53 (partial L1)300 (full L1)Use --limit 165 for full L1
Web searchDuckDuckGo / Google CSEBrowserBase liveAdd Playwright or Browserless
Code executionpython_exec stubReal Jupyter kernelImplement real sandbox
Image OCRimage_describe (Gemini)GPT-4V / GeminiFunctionally equivalent
File handlingfile_readFull PDF/XLSX/ZIP parserExpand file_read
Self-consistencyvoting.ts (Track A)Not in referenceruflo advantage
Hardness routingpredictor.ts (Track Q)Not in referenceruflo advantage
MemoryAgentDB HNSWNoneruflo advantage
Pass-rate L1~20.8% (iter 23)74.6% (HAL Sonnet 4.5)~54 pp gap

Gap analysis

Primary gaps (high impact)

  1. Real code execution — many L2/L3 questions require running Python to compute a numerical answer. The current python_exec tool is a stub. Implementing a real sandbox (E2B, Pyodide, or subprocess) is the single highest-ROI change.

  2. Full question set — running 53/300 L1 questions underestimates true pass-rate because the first 53 skew easier. Run --limit 165 (full L1) for a comparable HAL score.

  3. Real browserweb_browse currently fetches raw HTML. Replacing it with Playwright/Browserless for JavaScript-rendered pages would unlock many web navigation questions.

Secondary gaps (medium impact)

  1. Structured file parsing — PDF, XLSX, and ZIP attachments require dedicated parsers. file_read currently handles plain text and images only.

  2. Turn budget — 12 turns may be insufficient for complex multi-step questions. HAL uses up to 20 turns for L3.

  3. System prompt tuning — HAL's system prompt is more elaborate and explicitly instructs the model to use tools before answering.

ruflo advantages

  1. Self-consistency voting (Track A) — running N attempts per question and taking the majority answer reduces variance on borderline questions. HAL does not implement this.

  2. Hardness routing (Track Q) — routing each question to an appropriate model and turn budget based on predicted difficulty. This reduces cost on easy questions while providing more resources for hard ones.

  3. AgentDB memory — storing patterns across runs enables the agent to recall successful strategies for similar question types.

Improvement roadmap

PriorityChangeExpected LiftEffort
P0Real python_exec sandbox (E2B)+15-25 ppHigh
P0Full 165-Q L1 evaluationAccurate baselineLow
P1Playwright-based web_browse+5-10 ppMedium
P1PDF/XLSX file parser+3-8 ppMedium
P2Increase max-turns to 20 for L2/L3+2-5 ppLow
P2System prompt tuning (iter 30 research)+2-5 ppLow
P3Google Grounding via Gemini (iter 32)+3-7 ppMedium
P3Multi-provider routing (Gemini Flash for cheap Q's)Cost reductionMedium

Loading context from past research

bash
npx @claude-flow/cli@latest memory search \
  --namespace gaia-patterns \
  --query "architecture comparison HAL benchmark"

Storing comparison findings

bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "architecture-comparison-$(date +%Y%m%d)" \
  --value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing."