skills/arbor/references/htr-methodology.md
Background reference for the arbor skill. Source: Toward Generalist
Autonomous Research via Hypothesis-Tree Refinement (Jin et al., 2026,
arXiv:2606.11926; code: github.com/RUC-NLPIR/Arbor). Read this when you want
the reasoning behind a design choice in the main loop.
AO is the operational core of autonomous research. An agent starts from an
initial artifact and a research objective, then improves the artifact through
experimental feedback without step-level human supervision. Formally a task
is a tuple P = (M_0, O, E_dev, E_test):
M_0 — mutable initial material (usually a codebase + its data).O — objective: what "better" means, as a metric direction over the
artifact's output.E_dev — development evaluator the agent may use freely during search.E_test — held-out test evaluator. Same objective, different evidence.The goal is to return M* = argmax over candidates of S_test(M'), subject to
the constraint that hypotheses and implementation decisions are made without
using E_test as an exploration oracle. A candidate that exploits dev-split
idiosyncrasies may raise S_dev but is not a successful AO solution unless the
gain also transfers to S_test.
Why this is hard: feedback is delayed, experiments are expensive, and failed attempts contain information that should guide later search. If an agent treats each trial as an independent local attempt, it loses the structure of the research process — what was tried, what evidence came back, how each result reshapes the space of future hypotheses.
HTR is built to satisfy three requirements that ordinary agentic tool use does not:
A rooted tree T = (V, E). Each node is a research unit n = <h_n, iota_n, mu_n>:
h_n — a verifiable/falsifiable claim about how changing the
material improves the objective. Granularity tracks depth: nodes near the root
are broad directions; deeper nodes are concrete interventions an executor can
implement and evaluate. This organizes exploration as progressive refinement
rather than a flat sequence of independent trials.iota_n — the reusable interpretation of evidence. For an
executed leaf: what was tried, what happened, and why the result supports,
weakens, or constrains the hypothesis. For an internal node: an abstraction
over its children's insights — the current understanding of that direction.
It is not an execution transcript; it is compact semantic memory for later
ideation and selection.mu_n — connects the semantic hypothesis to executable evidence:
node status, dev score, factual result, implementation reference (git branch
or commit), optional background. The material itself is not duplicated in
the tree — only references to external artifact states produced in isolated
worktrees. This keeps the state compact while every hypothesis stays grounded
in a verifiable implementation.Internal nodes hold abstract directions and accumulated lessons; leaves hold candidate interventions to dispatch. After a leaf executes, its score, result, artifact ref, and insight are written back, and the insight is propagated upward along the path to the root. Through this abstraction, local outcomes become direction-level lessons and eventually a compact global understanding.
The tree therefore plays three roles at once: a search frontier (which directions are active/validated/pruned), a long-term memory (reusable evidence from successes and failures), and an auditable record (each artifact change linked to the hypothesis and evidence that motivated it).
h_n, relevant ancestor insights, and the current best artifact; it
creates an isolated git worktree, implements the minimal change h_n requires,
evaluates on E_dev, repairs its own broken/inactive code, and returns
structured evidence.The boundary is the point: exploratory code changes stay isolated until they pass the merge gate, and the tree records only decision-relevant evidence (scores, factual outcomes, artifact refs, distilled insights) rather than a raw log of tool calls. This is how transient execution traces become persistent research state.
An executor's local loop may involve many edits and reruns, but it stays bound
to the assigned hypothesis: h_n is fixed. If an executor were allowed to
change the hypothesis when the metric stalls, the returned score would no longer
be evidence about the assigned node, and ancestor insights built from it would
become impossible to interpret. Keeping executors hypothesis-bound preserves the
semantic meaning of every tree update while still allowing local engineering
flexibility.
Each coordinator cycle is a controlled mutation of the tree through a narrow interface:
k child hypotheses, each a
refinement/alternative/correction. Ideation is conditioned on tree evidence:
validated insights are assumptions to build on, pruned nodes are negative
constraints, recent reports suggest what's feasible or under-tested.E_test in a fresh worktree and merged into
M_best only if it improves under O. This separates exploratory success on
E_dev from verified artifact-level progress.From the paper's experiments across six AO tasks (model training, harness engineering, data synthesis) plus MLE-Bench Lite:
M_0, evaluator, metric, and interface).