agents/skills/agent-evaluation-framework/SKILL.md
Use this skill to orchestrate evaluation sessions for subagents, identify procedural bottlenecks, and iteratively refine system prompts and capabilities utilizing Swarm intelligence principles.
agents/scripts/create_worktree.sh <task_id> for each test case beforehand.
Report where the worktrees were created to the user. Inside worktrees, builds
MUST use the tools/dev/gm.py tool INSIDE the worktree. gm.py will
automatically run setup_worktree_build.py to prepare the symlinks; manual
execution of setup_worktree_build.py is not required.test/mjsunit/repro.js).use_remoteexec = true in args.gn) before proceeding.agent-meta-tests
only.agent-meta-tests directory cannot be changed.SafeToAutoRun: true for ALL commands
executed during meta-refinement. Approval must NEVER be asked of the user.test/mjsunit/ or Buganizer).The ultimate goal of evaluation is to harden the agent's skepticism and reasoning depth:
Architectural Skepticism: Require subagents to explicitly argue against a proposed fix before accepting it. Look at the problem from multiple orthogonal angles.
Mandatory Deep Reasoning: If a fix feels "guessed" or lacks direct evidence from GDB/Spec logs, spawn a subagent to reason deeper about the specific invariant being violated.
Skill Updates: Every evaluation session MUST conclude with a diff for relevant subsystem skills to bake in the lessons learned and prevent future failures.
analyze_brain.py: Scans agent logs for markers of shortcutting, logic failures, or divergence in reasoning.