plans/EXPLORE_V2.md
Written 2026-06-09 after reviewing the candidate-selection implementation in
src/pro/main/ipc/handlers/local_agent/tools/explore_code_subagent.tson theexplore-code-subagentbranch. Builds onGENERALIZABLE_EXPLORER_AGENT.mdandLONG_TERM_EXPLORE.md; supersedes their implementation sequencing where they conflict.
The current implementation got the core invariant right: the value model selects observed candidate IDs and can never type a file path or line range into the final report. That fixed fabricated references.
But the implementation around that invariant has three structural problems:
scene, channel, canvas-style target
resolution, /onboarding/ as a support path). These will misfire on mobile apps, backend
services, CLIs, and non-React conventions.respond only: done), and then a separate selection
call re-derives judgment from a lossy 180-char-evidence candidate packet. With the targeted
and gap follow-up passes, one explore_code call can cost ~6 sequential LLM invocations,
each rebuilding context the first one already had.V2 collapses exploration and selection into one conversation that must end with a structured
submit_report tool call, demotes deterministic code from judge to validator, makes task
intent a tool argument supplied by the main model, and cuts the report to a single
representation at roughly a third of the current budget.
The fabricated-fact problem also moves into scope: the selection model currently authors causal "facts" from candidate metadata it never read. V2 grounds facts by requiring them to quote observed evidence, and earns main-model trust with short verbatim quotes instead of per-report prohibitions.
path:line references.resolveCandidateIds-style resolution.candidatesFrom*Result), dedupe, overlap merging,
range clamping (clampRangeForReport), and ranked-candidate ID assignment.explore_code.ts).--arms / --resume-from harness plumbing.grep.ts) — genuinely generic./onboarding/, /tours/).selectExploreCandidates and the standalone selection prompt/parse path
(parseExploreSelection, extractJsonObject, and friends) — replaced by a forced tool call.runTargetedEvidencePass, runModelAuthoredGapEvidencePass, runMinimumEvidencePass, and
their query-synthesis helpers (buildMinimumEvidenceQueries, buildTargetedIdentifierGrepQueries,
extractTargetIdentifiers) — replaced by conversation continuation.inferTaskIntent and every intent-conditional branch — replaced by an intent tool argument.normalizeRecommendedAction — replaced by validation-only rules (see below).supplementPrimaryCandidates and MIN_ANSWER_PRIMARY_FILES padding.normalizeSearchTerm morphology table, isActionSearchTerm, isLowSignalSearchTerm,
the scene/scenes blocklist entries, and hasTargetResolutionEvidence.getEvidenceRoles / getStrongEvidenceRoles / hasStrongApiEvidence regex haystacks — role
judgment moves to the model with an open vocabulary.target evidence role (canvas-app residue).DYAD_CODE_EXPLORER_REPORT_MODE env-var control of production behavior — benchmark arms get
explicit plumbing.resolveSelection (both branches return fallbackReadTargets).routes? check in normalizeSearchTerm (moot once deleted).skip_explore_result parses but silently renders as targeted_gap_search — make it a real
rendered outcome ("explorer found nothing relevant; proceed without it") or drop it from the
schema.estimatedTokens = rangeWidth * 8 overestimates ~2x; use ~4 tokens/line or measure.user query + intent (from main model)
-> sub-agent conversation (single streamText call, tool loop)
tools: explore_code(raw), grep, read_file, list_files, submit_report
every tool result is annotated inline with candidate IDs: [c12]
host accumulates typed candidates at the tool boundary (unchanged)
-> conversation MUST end by calling submit_report(selection)
selection schema = candidate IDs + roles + facts + action + confidence + gaps
-> deterministic validator
drop unknown IDs / unobserved ranges; clamp budgets; verify fact quotes;
lower confidence; never upgrade or rewrite
-> if validator finds a critical gap AND step budget remains:
append one user message naming the gap; continue the SAME conversation
-> deterministic renderer emits the final compact report
-> main model answers, reads listed ranges, or runs the listed bounded searches
Today candidates get IDs only after exploration ends, so the explorer model cannot reference them. V2 assigns IDs as observations arrive and annotates the tool result text the model sees:
#### src/store/channels.ts [c7] - switchChannel (function:42)The registry stays host-side and typed (unchanged from today). The inline annotation is purely so the model can select IDs it has actually seen, in the same conversation, with the full evidence in context. ID assignment must be stable across the conversation (monotonic counter, dedupe maps to the first ID).
submit_report as a forced tool callReplace the prose-JSON selection pass with a submit_report tool whose input schema is the
selection. Enforce completion: if the model stops without calling it, send one nudge message;
if it still fails, fall back to deterministic-only selection (current fallback behavior).
Schema validation happens at the tool-call layer, so extractJsonObject regex parsing and the
selection_invalid path disappear.
interface ExploreSelectionV2 {
primaryCandidateIds: CandidateId[]; // 1-5, no padding
readTargets: Array<{
candidateId: CandidateId;
purpose: string; // tied to the caller's intent
required: boolean;
}>;
flow: Array<{
candidateId: CandidateId;
role: string; // OPEN vocabulary; suggested list in prompt only
fact: string; // must contain a quote from observed evidence (validated)
quote: string; // <=2 lines, verbatim from an observed window/cluster
}>;
missingCoverage: string[]; // specific, <=3
recommendedPrimaryAction:
| "answer_from_report"
| "read_targets"
| "targeted_gap_search"
| "skip_explore_result";
searchTargets?: string[]; // only for targeted_gap_search; bounded terms+scopes
confidence: "high" | "medium" | "low";
}
Notes:
role is a free string with a suggested vocabulary (entry, ui, handler, state, data/api,
persistence, render/output, type, test) in the prompt. No closed enum, no regex scoring of
roles. A mobile app can say "gesture recognizer"; a CLI can say "command dispatch".quote is the trust mechanism: <=2 verbatim lines per flow link (~30 tokens) that the
validator string-matches against observed evidence. A verifiable quote does more to stop
main-model re-reading than any amount of prohibition text.flow replaces findings + causalChain + flowSummary. One list, ordered, is the explanation.Add to exploreCodeSchema:
intent: z.enum(["explain", "locate", "edit", "debug"]).describe(
"What the result will be used for. explain/locate: answer or point at code. " +
"edit/debug: exact ranges will be read before changing code.",
);
The main model knows why it is calling the tool; inferring intent from English query regexes is
strictly worse. Intent flows into the sub-agent prompt and into validation thresholds (e.g.
answer_from_report is only legal for explain/locate). inferTaskIntent is deleted. Cache key
must include intent.
After submit_report, the validator checks the selection. If there is a critical, specific
gap (a flow link whose quote failed validation, a missing role the model itself named, zero
ranged candidates for an edit intent) and the budget allows, the host appends one user message
to the same conversation:
Your report cited a missing link: "<gap>". You have N tool steps remaining.
Find observed evidence for it, then call submit_report again.
At most 2 continuation rounds. Budgets: ~12 total tool steps, a total raw-observation cap (~60k chars), and wall-time. The model keeps everything it already learned; no packet rebuild, no re-selection call, no separate gap-pass system prompt. Stop reasons are recorded for the benchmark.
Allowed operations, in order:
quote does not appear (whitespace-normalized substring match) in that
candidate's observed evidence; record fact_unverified for each;clampRangeForReport);answer_from_report with intent edit/debug -> read_targets (or targeted_gap_search
if no ranged targets survive);read_targets with zero surviving targets -> targeted_gap_search;high with any dropped link or non-empty missingCoverage -> medium;medium with zero surviving flow links -> low.Forbidden operations: upgrading an action, padding primary files, synthesizing search queries, rewriting facts, reordering the model's flow. If validation guts the selection (no primary files survive), fall back to the deterministic low-confidence report.
## explore_code report
Query: "..." | Intent: explain | Confidence: high | Action: answer_from_report
Flow:
1. src/routes/app.tsx:18-44 (entry) - Route mounts <ChannelSidebar/>.
> <Route path="/channels/:id" element={<ChannelSidebar/>} />
2. src/components/sidebar.tsx:120-163 (handler) - Click calls switchChannel(id).
> onClick={() => switchChannel(channel.id)}
3. src/store/channels.ts:42-58 (state) - switchChannel dispatches setCurrentChannel.
> dispatch(setCurrentChannel(id))
Missing: none
Read targets (only if editing): src/store/channels.ts:42-58 - edit the dispatch payload.
```json
{ ...compact machine block: paths/ranges/action/confidence only... }
Rules:
- Each path appears exactly once outside the JSON block.
- The JSON block carries only what machines need (cache invalidation in `explore_code.ts`
parses it; keep that contract but shrink it — paths, ranges, action, confidence).
- Budget: `MAX_REPORT_CHARS = 2_500` (down from 8,000).
- Zero imperative policy text. "Follow recommendedPrimaryAction", "don't re-explore after a
high/medium report" live only in the tool description and `local_agent_prompt.ts`, which are
cached per session. Update both to describe the V2 format and remove references to sections
that no longer exist (answer draft, findings).
- Quotes are <=2 lines each and are the *only* source text allowed in the report. The validator
rejects anything longer.
### 7. Ranking stays, word lists go
Until worker-side graph ranking (prior plan Phase 3) lands, keep the existing
`buildCandidate` scoring but reduce it to structural features only:
- source weight (compiler > read_file > grep > list_files);
- exact query-identifier match against path basename / symbol names / evidence — using the
raw query tokens split on non-alphanumerics and camelCase, **no morphology table, no
action/noun lists**;
- range tightness and estimated token cost (fixed: ~4 tokens/line);
- generic test/support/generated/docs penalty.
Role coverage disappears from scoring entirely (roles are now model-assigned labels, not
ranking features). This costs some recall ordering; the explorer model compensates because it
now sees candidates inline and can keep exploring when the top of the list looks wrong.
## Token Accounting (why this nets out)
Per `explore_code` call, V1 (candidate-followup) vs V2:
| Cost center | V1 | V2 |
| ------------------------------- | --------------------------- | -------------------------- |
| Sub-agent LLM invocations | up to 6 sequential | 1 (+<=2 continuations) |
| Candidate packet resends | up to 3 x 3-8k tokens | 0 (inline IDs) |
| Report into main context | ~2k tokens, 4x redundant | ~600 tokens, single-form |
| Per-call policy text (uncached) | 400-600 chars every call | 0 |
| Padding files | always 5 primary | only what the model picked |
| Rediscovery after distrust | common (unverifiable facts) | reduced (verbatim quotes) |
The dominant lever is the last row. A report the main model trusts replaces 5-20 broad main
reads; a report it distrusts is pure overhead on top of them. Quotes plus validated facts are
the trust mechanism; everything else is supporting cost reduction.
## Implementation Sequence
### Phase A: submit_report + inline IDs (the collapse)
- Add stable candidate-ID assignment at observation time; annotate rendered tool results.
- Add `submit_report` tool with the V2 selection schema; force completion with one nudge.
- Delete the selection pass, both follow-up passes, the minimum-evidence pass, and JSON
scraping. Wire the validator + continuation loop.
- Keep the deterministic report builder as the model-failure fallback only.
- Rebaseline `explore_code_subagent.spec.ts` around invariants: unknown IDs cannot render,
quotes must match evidence, budgets hold, downgrades-only validation, fallback works. Do not
assert specific winners.
Acceptance: one streamText conversation per explore call on the happy path; selection arrives
as a validated tool call; all existing invariant tests pass.
### Phase B: intent argument + word-list deletion
- Add `intent` to `exploreCodeSchema`, tool description, main prompt guidance, and cache key.
- Delete `inferTaskIntent`, morphology/action/low-signal lists, role regexes, `target` role,
product-shaped support paths.
- Reduce `buildCandidate` to structural features.
- Add the **domain-literal guard test**: production explorer files must not contain benchmark
repo names, product nouns, or task vocabulary (`scene`, `channel`, `invoice`, `excalidraw`,
`mattermost`, ...). This test would have caught today's residue.
Acceptance: guard test passes; no intent regexes remain; scoring is explainable from structure.
### Phase C: report V2 + prompt tightening
- New renderer (single representation, 2.5k budget, quotes, slim JSON block).
- Update `explore_code.ts` cache-stat extraction for the slim JSON block.
- Rewrite tool description and `local_agent_prompt.ts` guidance for the V2 contract; move all
policy there.
Acceptance: report chars p50 <= 2.5k; every path renders once; zero imperative sentences in
report bodies.
### Phase D: measurement
- Benchmark arms: `explore-v1` (current candidate-followup) vs `explore-v2`, plus baseline.
- Headline metric: **main-model tool calls after a high/medium report** (broad grep/list_files
count, read_file calls outside read targets), on a held-out repo split (>=8 tasks from repos
never used to write any heuristic), repeats >= 3.
- Secondary: main uncached input p50, report tokens, sub-agent invocations and elapsed time,
`fact_unverified` rate, continuation-round distribution, answer quality rubric.
Acceptance: held-out quality >= V1; main uncached input and post-report broad calls decrease;
no production diff reintroduces domain literals.
## Risks
- **Cheap model can't drive a forced tool call reliably.** Mitigation: one nudge retry, then
deterministic fallback (already exists). Measure the fallback rate; if it exceeds ~10%,
revisit model choice for the sub-agent before adding orchestration back.
- **Quote validation is too strict** (whitespace/truncation mismatches drop real facts).
Normalize aggressively (collapse whitespace, strip line numbers), match against untruncated
observation text, and track `fact_unverified` rate before tightening further.
- **Deleting word lists drops benchmark recall.** Expected and acceptable in-sample; the
held-out split is the metric that matters. Do not add compensating vocabulary back.
- **Inline ID annotations confuse the explorer model.** Keep them terse (`[c7]`) and explain
them once in the system prompt; verify with a few manual traces before benchmarking.
- **Spec rebaseline is large.** It is — but it is the last rebaseline of this size if the tests
assert invariants instead of winners.
## Acceptance Criteria
- One sub-agent conversation per explore call; selection is a schema-validated tool call.
- The value model still cannot author paths or ranges; additionally, every rendered fact is
backed by a verbatim observed quote.
- No morphology tables, action/intent word lists, closed role enums, or domain literals in
production explorer code (enforced by a guard test).
- Reports are single-representation, <=2.5k chars, with policy text only in cached context.
- Held-out benchmark: post-report main broad-search calls and main uncached input decrease
versus V1 without quality regression, across >=3 repeats.