plans/GENERALIZABLE_EXPLORER_AGENT.md
Written 2026-06-08 after reviewing
explore_code_subagent.ts,BENCHMARK.md,IMPROVE_CODE_EXPLORE.md, and consultingclaude -pas an architecture advisor.
explore_code has drifted from a general code-reconnaissance tool into a benchmark-shaped
deterministic ranker. The value model, which is the component most likely to generalize across
repositories, is currently used mostly as a one-step tool-call planner. When observations exist,
the implementation discards the model's report and rebuilds a deterministic report from a large
pile of query/path heuristics in explore_code_subagent.ts.
That is backwards. Deterministic code should retrieve, constrain, and verify. The value model should judge which evidence matters and compress it into dense findings for the main model.
This plan is intentionally aggressive: delete benchmark-specific production logic rather than quarantining it behind a permanent compatibility layer. Keep the benchmark as a measurement tool, not as a source of product behavior.
explore_code should work on repositories and workflows never seen
in the benchmark suite.grep and read_file; the explorer should reduce broad discovery, not remove all
targeted follow-up.The production sub-agent contains explicit benchmark-shaped logic:
This creates two failures:
Invert the responsibility split.
| Layer | Owner | Responsibility |
|---|---|---|
| Retrieve | deterministic worker | Build graph/search candidates with structured provenance. |
| Assemble | deterministic main process | Merge typed tool observations, dedupe, and budget candidates. |
| Judge and write | value model | Pick the important evidence, explain the flow, and choose the next main action. |
| Validate | deterministic main process | Enforce schema, observed ranges, no raw source, tight refs, and generic safety rules. |
Do not assume the TypeScript graph alone is enough. Real applications wire behavior through JSX props, framework routes, server actions, generated clients, dependency injection, string keys, config files, and callback boundaries that a declaration/call graph will miss.
Use multiple generic retrieval channels and merge them into one candidate set:
These channels are allowed because they are structural, not benchmark-specific. They should not mention product names, benchmark repo names, or one-off file paths.
Extend the worker result beyond rendered source windows. Return structured candidates:
interface ExplorerCandidate {
path: string;
range: { start: number; end: number } | null;
symbols: Array<{ name: string; kind: string; line: number }>;
score: number;
source: "compiler" | "grep" | "read_file" | "list_files";
provenance: string[];
graph?: {
rootMatchScore?: number;
distanceFromRoot?: number;
inboundEdges?: number;
outboundEdges?: number;
edgeKinds?: string[];
};
traits: {
isTest: boolean;
isSupport: boolean;
isGenerated: boolean;
isDocsExample: boolean;
pathKinds: Array<
| "route"
| "component"
| "hook"
| "service"
| "api"
| "store"
| "action"
| "type"
| "config"
>;
};
estimatedTokens: number;
evidenceRoles: Array<
| "entry"
| "ui"
| "handler"
| "state"
| "data"
| "api"
| "persistence"
| "render"
| "output"
| "type"
| "test"
>;
}
Ranking should use generic features only:
The report should not only list files; it should explain which generic role each file plays. This is how we avoid a valid-looking but semantically shallow report.
Use repo-independent evidence roles:
The model may assign roles, but validation must check that every claimed role is backed by an observed candidate with matching generic traits or evidence text. Confidence must be based on role coverage, not only on model self-assessment.
Stop using formatRawExploreCodeResult(...) as the data boundary for the sub-agent. Markdown is
for display only. The sub-agent observation log should preserve typed results from:
grep;read_file;list_files.If a tool only returns text today, wrap it in a typed adapter at the observation boundary. The
report builder should never regex-parse #### path - symbol or path:line: from rendered output
when structured data is available.
The value model should receive a compact candidate packet, not full raw source. It should author the report:
recommendedPrimaryAction;Do not discard this report just because observations exist. Deterministic report generation should be a fallback for model failure, invalid JSON, or abort-safe degraded behavior.
Confidence is not whatever the value model says. It is validator-approved:
For answer-only tasks, high/medium can support answer_from_report. For edit/debug tasks,
high/medium can support read_targets, but the targets must be explicitly tied to the edit/debug
purpose.
The validator should be strict and repo-agnostic:
path:start-end was observed;recommendedPrimaryAction is internally consistent:
answer_from_report has no read target and no missing critical coverage;read_targets has observed, tightly ranged targets with concrete purposes;targeted_gap_search has concrete terms and bounded scopes.If validation fails, repair generically:
Delete benchmark-specific production logic from explore_code_subagent.ts, including:
Do not replace these with a renamed list of equivalent hints. If the behavior cannot be expressed as a generic structural feature, it should not be production ranking logic.
Keep or strengthen generic mechanisms:
recommendedPrimaryAction;The report sent to the main model should be treated as a bounded artifact.
Keep separate budgets for two different artifacts:
Default report budget:
The report must answer: "What does the main model now know that saves it from searching?"
If a token does not reduce future reads, remove it.
Read targets may span multiple files. Real code changes often require checking a component, handler, state update, API/persistence boundary, type contract, and test. The rule is not "one file only"; the rule is "no useless files and no huge reads."
Every read target must:
Do not make the explorer sub-agent artificially timid. The cost/context objective is not "fewest sub-agent tool calls"; it is "fewest unnecessary tokens entering the main model context." The sub-agent runs on the cheaper value model and its raw observations are discarded after compression, so it should have enough room to investigate before writing a dense report.
Default exploration budget:
Suggested caps:
Stop criteria should be evidence-driven:
targeted_gap_search guidance instead.The strict budget belongs at the boundary back to the main model:
recommendedPrimaryAction is limited to validated read targets or bounded search guidance.This deliberately trades more value-side exploration for less main-side rediscovery. If the sub-agent needs 10 cheap tool calls to prevent the primary model from doing 20 broad reads, that is the correct trade.
Add these to benchmark events and generated summaries:
reportTokens / rawObservationTokens;grep / list_files calls after a high/medium report;read_file calls outside recommendedPrimaryAction;answer_from_report rate for answer-only tasks;The current benchmark is contaminated by the production heuristics. Add a held-out split:
Add:
Start with a small but real held-out set: 8-12 tasks across repos that were not used to write the old heuristics. Label acceptable files by evidence role instead of exact single paths, e.g. "entry may be any of these route/page files; persistence may be any of these service/mutation files."
explore_code_subagent.ts.Acceptance:
ExplorerCandidate and typed observation result types.explore_code sub-agent tool to record structured CodeExplorerResult data before
formatting it for display.Acceptance:
Acceptance:
Acceptance:
Acceptance:
Acceptance:
These risks are acceptable. The current alternative is a code explorer that looks good on its own benchmark and becomes less trustworthy as it accumulates special cases.
Start with Phase 1 and Phase 2 together:
Do not add new benchmark-specific compensating heuristics if scores dip. A dip is signal that the benchmark was being memorized.