agents/skills/agent-evaluation-framework/SKILL.md
Use this skill to orchestrate evaluation sessions for subagents, identify procedural bottlenecks, and iteratively refine system prompts and capabilities utilizing Swarm intelligence principles.
git worktree
for each test case beforehand. Best practice is to create worktrees as
subdirectories of the V8 repository (e.g., in a worktrees/ directory within
the V8 root). Report where the worktrees were created to the user. Inside
worktrees, builds MUST use the tools/dev/gm.py tool INSIDE the worktree.
gm.py will automatically run setup_worktree_build.py to prepare the
symlinks; manual execution of setup_worktree_build.py is not required.test/mjsunit/repro.js).use_remoteexec = true in args.gn) before proceeding.agent-meta-tests
only.agent-meta-tests directory cannot be changed.SafeToAutoRun: true for ALL commands
executed during meta-refinement. Approval must NEVER be asked of the user.test/mjsunit/ or Buganizer).The ultimate goal of evaluation is to harden the agent's skepticism and reasoning depth:
Architectural Skepticism: Require subagents to explicitly argue against a proposed fix before accepting it. Look at the problem from multiple orthogonal angles.
Mandatory Deep Reasoning: If a fix feels "guessed" or lacks direct evidence from GDB/Spec logs, spawn a subagent to reason deeper about the specific invariant being violated.
Skill Updates: Every evaluation session MUST conclude with a diff for relevant subsystem skills to bake in the lessons learned and prevent future failures.
analyze_brain.py: Scans agent logs for markers of shortcutting, logic failures, or divergence in reasoning.