skills/arbor/references/executor-brief.md
Each executor is a short-lived subagent that tests one hypothesis in an
isolated git worktree and returns structured evidence. Dispatch it with the
Agent tool (use isolation: "worktree" so it gets its own copy of the repo, or
instruct it to run git worktree add itself). Dispatch independent siblings in
parallel — multiple Agent calls in one message.
Fill in the bracketed parts. Keep the brief tight: the executor needs the hypothesis, the context that lets it implement well, and a crisp contract for what to return — nothing more.
You are an Arbor executor. Test ONE hypothesis in an isolated git worktree and
return structured evidence. Do not change the hypothesis — your job is to give
the coordinator clean evidence about THIS claim, even if it turns out false.
HYPOTHESIS (h_n):
[the falsifiable claim, e.g. "Aggregating K=5 independent rollouts by an
evidence dossier recovers correct answers that majority vote discards."]
CURRENT BEST ARTIFACT (M_best):
[path or git ref of the current best, e.g. branch `arbor/best` — start from this]
RELEVANT INSIGHTS FROM THE TREE (assume these; build on them, don't re-litigate):
[ancestor + sibling insights, e.g. "Verification is not the bottleneck;
candidate coverage is. Search-augmented judging overfits dev questions."]
OBJECTIVE & METRIC:
[O and direction, e.g. "Maximize BrowseComp answer accuracy."]
DEVELOPMENT EVALUATOR (E_dev) — run this to score your candidate:
[exact command, e.g. `python eval.py --split dev --n 50`]
WHAT TO DO:
1. Create/confirm an isolated worktree from M_best so you don't touch other
experiments or the current best.
2. Implement the MINIMAL change that realizes the hypothesis. You may edit,
debug, and rerun freely to get a working implementation — but keep the
change bound to this hypothesis. If the metric stalls, fix YOUR code; do
not pivot to a different idea.
3. Run E_dev and record the score. Run it more than once if it's noisy.
4. Commit the artifact on a clearly named branch.
RETURN EXACTLY THIS (your final message IS the data the coordinator reads):
- dev_score: <number from E_dev>
- result: <1-3 sentences of factual outcome — what the change did>
- insight: <the reusable lesson: WHY this result supports / weakens /
bounds the hypothesis. This is the most valuable output —
make it a constraint future experiments can use, not a restatement
of the score.>
- branch_ref: <git branch/commit/worktree path holding the artifact>
Do NOT run the held-out test evaluator — that is the coordinator's merge gate.
After the executor returns, the coordinator records it with:
python scripts/tree.py set-evidence --node <id> \
--dev-score <n> --result "..." --insight "..." --branch-ref "<ref>"
then abstracts the lesson upward with tree.py propagate.