plans/LONG_TERM_EXPLORE.md
Written 2026-06-09 after the deterministic-report experiment in
src/pro/main/ipc/handlers/local_agent/tools/explore_code_subagent.tsand the follow-up discussion about generalizability, main-context density, and model-authored source references.
The long-term explore_code architecture should separate judgment from authority.
The value model is good at deciding what evidence matters after broad exploration. It is not a reliable authority for exact source references when it has to type paths, line ranges, and roles from memory. Deterministic code is good at preserving exact observations, validating budgets, and rendering canonical references. The right design is therefore:
user query
-> tools produce typed observed candidates
-> candidate registry normalizes, dedupes, scores, and assigns stable IDs
-> deterministic packer builds a bounded candidate packet
-> value model selects candidate IDs, roles, confidence, and next action
-> deterministic renderer emits the final verified report
-> main model reads only useful files/ranges when raw source is actually needed
This is not a quarantine around benchmark hacks. Benchmark-specific production logic should be deleted. The durable interface is observed candidates plus candidate-ID selection, not repo-specific path knowledge.
The current implementation has moved in the right direction by making deterministic,
host-rendered reports first-class. The sub-agent explores with tools, the final model text is
ignored, and buildDeterministicReport(...) emits a report from observed candidates only.
That fixed the source-reference reliability issue, but it is still an interim shape:
The full benchmark run before this change showed why this matters. explore_code improved overall
cost and main uncached input, but all 24 explorer trials fell back to deterministic reporting
because model-authored reports failed validation. The largest validation failures were unobserved
paths, unobserved ranges, too-wide ranges, unsupported role claims, and density-budget violations.
Those are symptoms of the wrong contract: the model should not author raw references.
path:line references.The value model never writes source references directly.
It can select candidateIds, assign roles to those IDs, describe missing coverage, and choose a
recommended action. Deterministic code resolves IDs to canonical path and range values. Unknown
IDs are rejected or dropped. Unobserved ranges are impossible to render.
This keeps the model's judgment while removing its ability to fabricate references.
Every exploration tool should emit structured observations at the host boundary:
explore_code_raw: compiler/graph candidates with symbols, declarations, references, and
declaration ranges;grep: structured matches with path, line, match text, literal/regex diagnostics, and
bounded context metadata;read_file: structured read ranges with canonical path, start/end lines, file size, and whether
the range was truncated;list_files: structured path candidates and directory traits, but no read target unless source
was actually read;Rendered Markdown is for humans and logs. It should not be parsed back into evidence when a typed payload can exist.
Normalize all tool observations into a registry:
type CandidateId = `c${number}`;
interface ObservedCandidate {
id: CandidateId;
path: string;
range: { start: number; end: number } | null;
source:
| "compiler"
| "grep"
| "read_file"
| "list_files"
| "framework"
| "index";
symbols: Array<{ name: string; kind: string; line: number }>;
evidence: {
summary: string;
matchedTerms: string[];
observedTextKinds: Array<
"symbol" | "path" | "line" | "import" | "call" | "route"
>;
};
traits: {
isTest: boolean;
isSupport: boolean;
isGenerated: boolean;
isDocsExample: boolean;
pathKinds: Array<
| "route"
| "component"
| "hook"
| "service"
| "api"
| "store"
| "action"
| "type"
| "config"
| "test"
>;
};
roles: Array<
| "entry"
| "ui"
| "handler"
| "state"
| "data"
| "api"
| "persistence"
| "render"
| "output"
| "type"
| "test"
>;
scores: {
lexical: number;
graph: number;
pathTrait: number;
roleCoverage: number;
rangeTightness: number;
genericPenalty: number;
};
provenance: Array<{ tool: string; reason: string }>;
estimatedTokens: number;
}
The registry is responsible for:
The value model should receive a compact packet, not raw tool dumps:
interface CandidatePacket {
query: string;
budget: {
maxPrimary: number;
maxReadTargets: number;
maxReportChars: number;
};
candidates: ObservedCandidateSummary[];
coverageHints: string[];
knownGaps: string[];
toolStats: {
toolCalls: number;
candidatesSeen: number;
diagnostics: string[];
};
}
The packet can include more candidates than the final report, but it must still be bounded. A reasonable starting point:
The packer should enforce diversity. It should avoid sending 20 near-duplicate files from the same directory or role. It should also avoid support, fixture, generated, docs, and test files unless the query asks for those surfaces or they are necessary to understand the flow.
The value model receives the packet and returns a structured selection:
interface ExploreSelection {
primaryCandidateIds: CandidateId[];
secondaryCandidateIds: CandidateId[];
readTargetIds: CandidateId[];
roleAssignments: Array<{
candidateId: CandidateId;
role: ObservedCandidate["roles"][number];
reason: string;
}>;
findings: string[];
flowSummary: string;
recommendedPrimaryAction:
| "answer_from_report"
| "read_targets"
| "targeted_gap_search"
| "skip_explore_result";
confidence: "high" | "medium" | "low";
missingCoverage: string[];
needMoreEvidence?: {
roleGap: string;
targetedQueries: string[];
preferredTools: string[];
};
}
This should be a schema-bound result or a dedicated tool call such as
select_explore_candidates. The model can be wrong about importance, but it cannot invent files.
The model may request more exploration when the packet is insufficient. That loop should be adaptive:
The sub-agent should have space to explore unfamiliar codebases. The guardrail is not "few calls"; the guardrail is "do not leak broad, low-density output into the main context."
The renderer resolves selected IDs to canonical references and emits the final report:
If selection validation fails, the renderer should degrade predictably:
The main model should treat a high/medium explorer report as a verified discovery map, not as a prompt to redo broad exploration.
Expected behavior:
This is realistic: the main model will often need several files. The optimization is that those files should be useful and bounded.
Candidate ranking must use generic features:
Candidate ranking must not use:
Tool diagnostics are not evidence.
Examples:
read_file ranges;These should help the explorer decide what to try next, but they should not appear as primary findings unless the user's task is about tool behavior. The final report can include a short diagnostic note only when it affects confidence or the recommended action.
Near-term robustness work:
grep diagnostics separate from matches;code_search shrink, chunk, or prefilter payloads when a request is too large;grep, read_file, list_files, and explore_code_raw.ObservedCandidate, CandidateRegistry, and candidate IDs.CandidatePacket.Primary metrics:
read_file calls;Secondary metrics:
Generalization guardrails:
needMoreEvidence loops.