Back to Ruflo

GAIA Debugging Skill

plugins/ruflo-workflows/skills/gaia-debugging/SKILL.md

3.10.133.8 KB
Original Source

GAIA Debugging Skill

When a GAIA question fails, systematically diagnose the root cause and propose a targeted fix.

When to use

  • A specific task_id returns the wrong answer or times out
  • Pass-rate dropped between two runs and you need to find the regression
  • You want to understand why a particular question class is consistently failing

Failure mode taxonomy

CodeModeSymptomFix direction
TGTool GapAgent lacks a required tool (no image OCR, no PDF reader)Add tool to catalogue
RMReasoning MissAgent has the right data but draws wrong conclusionImprove system prompt, add CoT instruction
EBExtraction BugAnswer is in the trace but FINAL_ANSWER: regex failsFix answer extraction pattern
LILoop IssueAgent loops (re-asks same tool call) and hits turn limitIncrease max-turns or add loop-detection
DSDataset ShiftGround truth differs from what web currently showsFlag for HAL dataset audit
ATAPI TimeoutTool call times out; agent never gets the resultIncrease per-turn timeout

Diagnostic workflow

Step 1 — Load the question trace

bash
# Find the result for the task_id in the latest run
RESULTS=~/.cache/ruflo/gaia/results-latest.json
node -e "
  const r = JSON.parse(require('fs').readFileSync('$RESULTS'));
  const q = r.results.find(x => x.task_id === '$TASK_ID');
  console.log(JSON.stringify(q, null, 2));
"

Step 2 — Classify the failure

Look at the trace output:

  1. No tools called at all → RM or configuration issue
  2. Tool called but returned error → TG or AT
  3. Tool returned data, wrong answer → RM or EB
  4. Correct answer in trace but marked wrong → EB
  5. max-turns hit → LI or question too hard for current model

Step 3 — Re-run with extended logging

bash
node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
  --level 1 --limit 1 \
  --task-id $TASK_ID \
  --models claude-sonnet-4-6 \
  --max-turns 20 \
  --output json

Step 4 — Apply targeted fix

FailureAction
TG — missing web_browseVerify gaia-tools/index.ts exports web_browse; check tool registration
TG — missing image OCRAdd image_describe tool call; verify GOOGLE_AI_API_KEY
RM — reasoningAdd a system prompt instruction: "Before answering, list all facts you have gathered"
EB — extractionTest the FINAL_ANSWER_RE regex against the trace manually
LI — loopAdd a tool-call deduplication guard in gaia-agent.ts
AT — timeoutSet DEFAULT_PER_TURN_TIMEOUT_MS higher or use --max-turns flag

Step 5 — Verify fix and store pattern

bash
# Re-run the single question
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json

# If now passing, store the pattern
npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "fix-$FAILURE_CODE-$(date +%Y%m%d)" \
  --value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"

Quick reference: tool catalogue check

bash
node -e "
  const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
  const cat = createDefaultToolCatalogue({});
  console.log('Tools registered:', cat.definitions.map(t => t.name));
"

Expected: web_search, file_read, web_browse, image_describe, python_exec

Pattern storage

After resolving a debugging session, store the finding:

bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "session-$(date +%Y%m%d-%H%M)" \
  --value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'

Search for similar past failures:

bash
npx @claude-flow/cli@latest memory search \
  --namespace gaia-debug-patterns \
  --query "extraction bug final answer regex"