Back to Plate

Benchmark target report dry run

docs/plans/2026-06-01-benchmark-target-report-dry-run.md

53.0.817.2 KB
Original Source

Benchmark target report dry run

Objective: Add benchmark target reports dry run; done when target report/history flow passes; plan docs/plans/2026-06-01-benchmark-target-report-dry-run.md.

Flow mode: one-shot execution

Goal plan: docs/plans/2026-06-01-benchmark-target-report-dry-run.md

Template: docs/plans/templates/task.md

Primary template: docs/plans/templates/task.md

Applied packs:

  • docs (docs/plans/templates/packs/docs.md)
  • agent-native (docs/plans/templates/packs/agent-native.md)

Task source:

  • type: user continuation
  • id / link: local thread
  • title: Generate target-owned benchmark reports/history and dry-run the E2E flow
  • acceptance criteria: target registry generates report/history without Evidence Kit, command checks pass, and a target id can dry-run through Autoresearch setup-plan

Completion threshold:

  • pnpm bench:targets:report writes target-owned report/history files.
  • pnpm bench:targets:report:check verifies those generated files are current.
  • pnpm bench:targets:dry-run -- react-active-typing-breakdown validates registry -> report model -> Autoresearch setup-plan without running expensive benchmarks.
  • slate-autoresearch generated skill text advertises report/dry-run flow.
  • Syntax, JSON, generated-skill, diff, and goal-plan checks pass.

Verification surface:

  • command: pnpm bench:targets:check
  • command: pnpm bench:targets:report
  • command: pnpm bench:targets:report:check
  • command: pnpm bench:targets:report:dry-run
  • command: pnpm bench:targets:dry-run -- react-active-typing-breakdown
  • command: node --check tooling/scripts/bench-targets.mjs
  • command: JSON parse for package.json, benchmarks/targets/slate-v2.json, and benchmarks/targets/history/slate-v2-latest.json
  • source-audit: generated skill and README mention report/dry-run commands
  • command: git diff --check
  • command: node .agents/skills/autogoal/scripts/check-complete.mjs docs/plans/2026-06-01-benchmark-target-report-dry-run.md

Constraints:

  • Preserve existing benchmark runners.
  • Do not run expensive benchmark workloads for this dry-run slice.
  • Do not delete Evidence Kit yet; only replace active report/history path for target registry.
  • Do not commit, push, or PR.

Boundaries:

  • Source of truth: target registry from benchmarks/targets/slate-v2.json and the prior migration plan.
  • Allowed edit scope: tooling/scripts/bench-targets.mjs, package.json, benchmarks/targets/**, .agents/rules/slate-autoresearch.mdc, generated skill mirrors, this plan.
  • Browser surface: N/A, no UI changed.
  • Tracker sync: N/A.
  • Non-goals: benchmark execution, Evidence Kit deletion, moving runtime benchmark implementations.

Output budget strategy:

  • Use targeted sed, rg, command slices, and JSON summaries.
  • Do not stream full generated histories or benchmark artifacts.

Blocked condition:

  • Block only if generated report/history cannot be checked deterministically, or Autoresearch setup-plan cannot be invoked from a target id.

Task state:

  • task_type: tooling/report generation
  • task_complexity: normal
  • current_phase: closeout
  • current_phase_status: complete
  • next_phase: none
  • goal_status: complete after final goal-plan check passes

Current verdict:

  • verdict: target-owned report/history and dry-run flow are implemented
  • confidence: 0.9
  • next owner: later Evidence Kit freeze/delete once target report parity is accepted
  • reason: target registry now owns status history and Autoresearch setup can start from a target id

Completion rule:

  • Do not call update_goal(status: complete) until every completion threshold above is satisfied, final handoff evidence is recorded, and node .agents/skills/autogoal/scripts/check-complete.mjs docs/plans/2026-06-01-benchmark-target-report-dry-run.md passes.

Start Gates:

GateAppliesEvidence
Skill analysis before editsyesUsed autogoal; task template selected with docs and agent-native packs.
Active goal checked or createdyesget_goal returned no active goal; create_goal created this objective.
Source of truth read before editsyesRead target CLI, Evidence Kit report/health generation, target files, and current plan context.
Tracker comments and attachments readnoN/A: no tracker.
Video transcript evidence requirednoN/A: no video.
docs/solutions checked for non-trivial existing-code worknoN/A: task continues local benchmark architecture plan; source files were enough.
TDD decision before behavior change or bug fixyesN/A: deterministic CLI/report generation, verified by command checks rather than a new test suite.
Branch decision for code-changing taskyesN/A: user did not ask for branch or PR.
Release artifact decisionyesN/A: private tooling/docs change, no package release.
Browser tool decision for browser surfacenoN/A: no browser surface.
PR expectation decisionyesN/A: no PR requested.
Tracker sync expectation decisionnoN/A: no tracker.
Output budget strategy recordedyesSee Output budget strategy.
Docs pack selectedyesREADME, report markdown, and plan changed.
docs-creator loadedyesLoaded in prior slice; current docs follow source-backed reference style.
Docs lane selectedyesSupporting docs under tooling task.
Target docs and nearest sibling docs readyesRead target README and Evidence Kit report/health generation.
Docs style doctrine readyesCurrent-state docs style applied; no changelog voice.
Documented source owner identifiedyesbenchmarks/targets/README.md documents target report/history ownership.
Agent-native pack selectedyesPackage scripts and skill rule changed.
Agent-facing action surface identifiedyesbench:targets:report, bench:targets:dry-run, and slate-autoresearch.
Source rule versus generated mirror boundary identifiedyesEdited .agents/rules/slate-autoresearch.mdc; ran pnpm install for generated mirrors.
agent-native-reviewer loaded or waiver recordedyesLoaded in prior slice; command discoverability and generated mirrors audited here.

Work Checklist:

  • Short objective plus outcome, completion threshold, verification surface, constraints, boundaries, and blocked condition are concrete.
  • Task source classified with source type, title, task type, acceptance criteria, caveats, likely files, browser surface, and root-cause layer.
  • Required video or screen-recording evidence is marked N/A with reason.
  • Nearby repo instructions and implementation patterns read before edits.
  • Implementation fixes the right ownership boundary: target registry report/history, not Evidence Kit.
  • Release artifact requirement recorded as N/A.
  • Final handoff shape decided.
  • Branch handling recorded as N/A.
  • Local-env-rot retry policy recorded as N/A: no install-corruption failure.
  • Workspace authority recorded: all proof commands run from /Users/zbeyens/git/plate-2.
  • High-risk note recorded: command-contract changes are proven by CLI smoke and generated skill audit.
  • Review/autoreview target selected or marked N/A: scoped command proof is used for tooling-only slice; no app/runtime product behavior changed.
  • Agent-native review decision recorded.
  • Output budget discipline recorded and followed.
  • Docs pack: docs lane, target docs, nearest sibling docs, and source owner are recorded.
  • Docs pack: named commands and files are source-backed.
  • Docs pack: docs use current-state reference voice.
  • Docs pack: links, anchors, and previews are N/A.
  • Agent-native pack: source-of-truth rule files are edited instead of generated skill mirrors.
  • Agent-native pack: changed agent actions are discoverable from skill/rule text.
  • Agent-native pack: generated mirrors are synced.
  • Agent-native pack: accepted agent-native review findings are fixed or explicitly rejected with reason.

Completion Gates:

GateAppliesRequired actionEvidence
Named verification thresholdyesRun target report/check/dry-run commandsPassed; see Verification evidence.
Bug reproduced before fixnoRecord N/AN/A: feature/tooling slice.
Targeted behavior verificationyesRun focused CLI proofbench:targets:report:check and bench:targets:dry-run passed.
TypeScript or typed config changednoRecord N/AN/A: JS/JSON/Markdown only.
Package exports or file layout changednoRecord N/AN/A: no package exports.
Package manifests, lockfile, or install graph changedyesRun install/sync when neededpnpm install passed and regenerated skills; lockfile already up to date.
Agent rules or skills changedyesRun pnpm install and verify generated skill syncpnpm install passed; rg found report/dry-run guidance in source and generated skills.
Workspace authority proofyesRun verification in owning workspaceAll commands run in /Users/zbeyens/git/plate-2.
Browser surface changednoRecord N/AN/A: no browser UI.
Browser final proofnoRecord N/AN/A: no browser UI.
CI-controlled template output changednoRecord N/AN/A.
Package behavior or public API changednoRecord N/AN/A: private tooling.
Registry-only component work changednoRecord N/AN/A.
Docs or content changedyesVerify source-backed claimsREADME/report commands match package scripts and CLI.
High-risk mini gateyesRecord failure mode/proof planFailure mode: agent runs stale/hidden Evidence Kit flow; proof: skill text points to target reports and dry-run commands.
Agent-native review for agent/tooling changesyesVerify command discoverabilityrg audit passed for source/generated skills and README.
Local install corruption suspectednoRecord N/AN/A: no corruption-shaped failure.
Autoreview for non-trivial implementation changesnoRecord N/AN/A: scoped tooling/report slice verified by direct CLI, syntax, JSON, and generated-skill audit; no app/runtime behavior.
PR create or updatenoRecord N/AN/A: no PR requested.
Task-style PR body verifiednoRecord N/AN/A: no PR.
PR proof image hostingnoRecord N/AN/A.
Tracker sync-backnoRecord N/AN/A.
Final handoff contractyesFill final handoff fieldsFilled below.
Final lintyesRun scoped equivalentnode --check, JSON parse, and git diff --check are the scoped lint equivalents.
Output budget disciplineyesVerify no unbounded outputUsed capped command output and summaries; one shell quoting mistake corrected.
Goal plan completeyesRun goal checkerFinal command runs after this edit.
Docs source-backed claim audityesVerify docs claims against sourceReport/README claims match CLI and scripts.
Docs links / routes / previewsnoRecord N/AN/A.
Docs MDX/content parsernoRecord N/AN/A: Markdown only, not site content.
Plugin page specificsnoRecord N/AN/A.
Agent source / generated syncyesRun pnpm installPassed.
Agent action discoverabilityyesSource-audit skill/rule pathPassed.
Agent-native reviewyesClose accepted findings or N/ANo accepted findings after source/generated command audit.

Phase / pass table:

PhaseStatusEvidenceNext
Intake and source readcompleteRead target CLI and Evidence Kit report/health shape.implementation
ImplementationcompleteAdded report/history/dry-run commands and generated outputs.verification
VerificationcompleteTarget report/check/dry-run, syntax, JSON, generated-skill audits passed.closeout
PR / tracker synccompleteN/A: no PR/tracker requested.final response
CloseoutcompleteFinal plan filled and final checks running.final response

Findings:

  • Existing Evidence Kit report generation is tied to benchmarks/editor/docs/perf/**.
  • Target registry can now generate benchmarks/targets/history/slate-v2-latest.json and benchmarks/targets/reports/slate-v2.md without running benchmark workloads.
  • Target report status correctly separates required-missing from optional-missing artifacts.
  • Dry-run proves react-active-typing-breakdown target can create an Autoresearch setup-plan with typing_seconds.

Decisions and tradeoffs:

  • Added target-native report/history instead of porting Evidence Kit report code. This avoids keeping Evidence Kit as active control plane.
  • Report generation is deterministic and checkable; it does not include generated timestamps.
  • Dry-run invokes Autoresearch setup-plan, not setup, so it proves handoff without starting an optimization loop.
  • Optional missing artifacts are visible as missing-optional-artifact, not hidden as ok.

Implementation notes:

  • tooling/scripts/bench-targets.mjs now supports report and dry-run.
  • package.json exposes bench:targets:report, bench:targets:report:check, bench:targets:report:dry-run, and bench:targets:dry-run.
  • benchmarks/targets/README.md documents generated outputs and dry-run flow.
  • .agents/rules/slate-autoresearch.mdc tells agents to generate/check reports and dry-run before setup-target.
  • pnpm bench:targets:report generated the history and report files.

Review fixes:

  • Fixed report status model so optional missing artifacts are not shown as plain ok.
  • Fixed markdown table row formatting to use full pipe rows.

Error attempts:

Error / failed attemptCountNext different moveResolution
Shell command used an unescaped template literal and zsh expanded ${...}1Rerun Node summary with single-quoted commandRerun passed and showed 2 optional-missing targets.

Verification evidence:

  • /Users/zbeyens/git/plate-2: pnpm bench:targets:report -> wrote benchmarks/targets/history/slate-v2-latest.json and benchmarks/targets/reports/slate-v2.md.
  • /Users/zbeyens/git/plate-2: pnpm bench:targets:check -> benchmark-targets ok: 23 targets.
  • /Users/zbeyens/git/plate-2: pnpm bench:targets:report:check -> checked generated history/report.
  • /Users/zbeyens/git/plate-2: pnpm bench:targets:report:dry-run -> targets=23 missingRequired=0.
  • /Users/zbeyens/git/plate-2: pnpm bench:targets:dry-run -- react-active-typing-breakdown -> dry-run ok, targets=23, missingOptionalArtifacts=2, missingRequiredArtifacts=0, autoresearchSetupOk=true.
  • /Users/zbeyens/git/plate-2: node --check tooling/scripts/bench-targets.mjs -> passed.
  • /Users/zbeyens/git/plate-2: JSON parse for package.json, target registry, and generated history -> passed.
  • /Users/zbeyens/git/plate-2: rg audit for report/dry-run commands in source/generated skills, README, and package scripts -> passed.
  • /Users/zbeyens/git/plate-2: git diff --check -> final run after this edit.
  • /Users/zbeyens/git/plate-2: final goal-plan check -> final run after this edit.

Final handoff contract:

  • PR line: N/A, no PR requested.
  • Issue / tracker line: N/A, no tracker.
  • Confidence line: high for target report/dry-run slice; Evidence Kit deletion remains later work.
  • Flow table:
    • Reproduced: N/A, feature/tooling slice
    • Verified: target check/report/dry-run/syntax/JSON/generated-skill audit
  • Browser check: N/A.
  • Outcome: target registry now owns generated report/history and can dry-run an Autoresearch setup by target id.
  • Caveat: benchmark workloads still mostly use wrapped timing until targets emit native METRIC/ARTIFACT lines.
  • Design:
    • Chosen boundary: target registry owns report/history; Autoresearch owns active loop state.
    • Why not quick patch: copying Evidence Kit report code would keep split-brain alive.
    • Why not broader change: deleting Evidence Kit waits until report parity is accepted.
  • Verified: see Verification evidence.
  • PR body verified: N/A.

Task-style PR body contract:

  • N/A: no PR requested.

Final handoff / sync:

  • PR: N/A.
  • Issue / tracker: N/A.
  • Browser proof: N/A.
  • Caveats: native benchmark METRIC/ARTIFACT output still needs per-target work.

Timeline:

  • 2026-06-01T10:28:10.054Z Task goal plan created.
  • 2026-06-01T10:29:00Z Active goal created.
  • 2026-06-01T10:31:00Z Read target CLI and Evidence Kit report/health code.
  • 2026-06-01T10:36:00Z Added target report/history and dry-run commands.
  • 2026-06-01T10:38:00Z Generated target report/history files.
  • 2026-06-01T10:41:00Z Synced generated skill mirrors with pnpm install.
  • 2026-06-01T10:44:00Z Ran E2E dry-run and corrected optional-artifact status model.

Reboot status:

QuestionAnswer
Where am I?Closeout complete.
Where am I going?Run final diff and goal checks, then final response.
What is the goal?Add target-owned report/history and prove E2E dry-run works.
What have I learned?Target report/history can replace active Evidence Kit reporting without running benchmarks.
What have I done?Implemented report/history generation, package scripts, docs, agent guidance, generated outputs, and dry-run proof.

Open risks:

  • Native METRIC/ARTIFACT output remains per-target migration work.
  • Evidence Kit deletion remains deferred until the generated target report is accepted as sufficient replacement.