agents/skills/agent-evaluation-framework/SKILL.md
Use this skill to orchestrate evaluation sessions for subagents, identify procedural bottlenecks, and iteratively refine system prompts and capabilities utilizing Swarm intelligence principles.
git worktree for each test case beforehand. Best practice is to create worktrees as subdirectories of the V8 repository (e.g., in a worktrees/ directory within the V8 root). Report where the worktrees were created to the user. Inside worktrees builds MUST use the tools/dev/gm.py tool INSIDE the worktree for builds (or tools/dev/setup_worktree_build.py to prepare the worktree for builds).test/mjsunit/repro.js).use_remoteexec = true in args.gn) before proceeding.agent-meta-tests only.agent-meta-tests directory cannot be changed.SafeToAutoRun: true for ALL commands executed during meta-refinement. Approval must NEVER be asked of the user.test/mjsunit/ or Buganizer).The ultimate goal of evaluation is to harden the agent's skepticism and reasoning depth:
Architectural Skepticism: Require subagents to explicitly argue against a proposed fix before accepting it. Look at the problem from multiple orthogonal angles.
Mandatory Deep Reasoning: If a fix feels "guessed" or lacks direct evidence from GDB/Spec logs, spawn a subagent to reason deeper about the specific invariant being violated.
Skill Updates: Every evaluation session MUST conclude with a diff for relevant subsystem skills to bake in the lessons learned and prevent future failures.
analyze_brain.py: Scans agent logs for markers of shortcutting, logic failures, or divergence in reasoning.