docs/plans/2026-05-28-evidence-control-plane-hard-cut.md
Objective: Replace the active editor benchmark workflow with an Evidence Kit control plane, registered active benchmark artifacts only, health/next-action reporting, refresh command, generated dashboard cockpit, and docs that say old one-off benchmarks are discarded unless registered.
Completion threshold:
benchmarks/editor/research/benchmark-registry.json.benchmarks/editor/src/index.mjs ingests active artifacts from the registry instead of hardcoded Slate v2 artifact lists/fallbacks.benchmarks/results/benchmark-health-latest.json with status counts, missing/stale/ignored artifacts, and ranked next actions.evidence:refresh and include health generation in full/check flows.docs/perf/index.html surfaces benchmark health and next actions.cd benchmarks/editor && npm run check passes./index.html contains benchmark health content.node .agents/rules/autogoal/scripts/check-complete.mjs docs/plans/2026-05-28-evidence-control-plane-hard-cut.md passes.Verification surface:
benchmarks/editor/package.jsonbenchmarks/editor/src/index.mjsbenchmarks/editor/benchmarks/benchmark-health.mjsbenchmarks/editor/research/benchmark-registry.jsonbenchmarks/editor/docs/perf/*cd benchmarks/editor && npm run checkhttp://127.0.0.1:8765/index.htmlConstraints:
Boundaries:
benchmarks/editor, generated perf docs, root benchmark script aliases, and this plan.http://127.0.0.1:8765/.Blocked condition:
Task state:
Current verdict:
Completion rule:
Start Gates:
| Gate | Applies | Evidence |
|---|---|---|
| Skill analysis before edits | yes | Loaded major-task, hard-cut, and autogoal. |
| Active goal checked or created | yes | Active goal created for Evidence Kit control-plane replacement. |
| Source of truth read before edits | yes | Read current benchmarks/editor scripts, source map, package scripts, Slate v2 benchmark README, and prior memory note. |
| Tracker comments and attachments read | no | N/A: no tracker. |
| Video transcript evidence required | no | N/A: no video. |
docs/solutions checked for non-trivial existing-code work | no | N/A: benchmark docs/source map and Slate v2 benchmark README were the direct source. |
| TDD decision before behavior change or bug fix | yes | No TDD; registry/health script checks and package check are the right proof. |
| Branch decision for code-changing task | yes | No branch action requested. |
| Release artifact decision | yes | No changeset; private benchmark lab only. |
| Browser tool decision for browser surface | yes | Used route-level static server proof. |
| PR expectation decision | yes | No PR requested. |
| Tracker sync expectation decision | yes | No tracker sync requested. |
Work Checklist:
<video-transcripts> XML, or marked N/A with reason./Users/zbeyens/git/plate-2/benchmarks/editor and route proof hit the local static docs server.Completion Gates:
| Gate | Applies | Required action | Evidence |
|---|---|---|---|
| Named verification threshold | yes | Run package checks and served route proof. | npm run check passed; served /index.html proof passed. |
| Bug reproduced before fix | no | N/A: workflow replacement. | N/A. |
| Targeted behavior verification | yes | Verify registry ingest, health output, docs index. | npm run evidence:refresh passed; health has 23 active artifacts and 10 next actions; served index has health content. |
| TypeScript or typed config changed | no | N/A. | JS syntax checks passed through npm run check. |
| Package exports or file layout changed | no | N/A. | No barrels. |
| Package manifests, lockfile, or install graph changed | yes | Run benchmark package check. | cd benchmarks/editor && npm run check passed. |
| Agent rules or skills changed | no | N/A. | No agent sync. |
| Workspace authority proof | yes | Run checks in benchmarks/editor. | npm run evidence:refresh, npm run docs:perf:check, and npm run check ran in benchmarks/editor. |
| Browser surface changed | yes | Route proof for generated index. | HTTP proof: status 200, health/control-plane/discard text present, 2 primary cards, Evidence Kit not primary. |
| Browser final proof | yes | Record route proof or caveat. | Route-level static proof recorded; no screenshot needed. |
| CI-controlled template output changed | no | N/A. | No templates. |
| Package behavior or public API changed | no | N/A. | No changeset. |
| Registry-only component work changed | no | N/A. | No changelog. |
| Docs or content changed | yes | Regenerate and check docs. | npm run docs:perf:check and npm run check passed. |
| High-risk mini gate | yes | Record realistic failure mode and why chosen boundary is right. | Failure mode: random old tmp JSON becomes active evidence again; registry + ignored-artifact health count prevents it. |
| Agent-native review for agent/tooling changes | no | N/A. | No agent surfaces. |
| Local install corruption suspected | no | N/A. | No reinstall. |
| Autoreview for non-trivial implementation changes | no | N/A: scoped benchmark lab workflow. | Full package check and route proof passed. |
| PR create or update | no | N/A. | No PR. |
| PR proof image hosting | no | N/A. | No PR. |
| Tracker sync-back | no | N/A. | No tracker. |
| Final handoff contract | yes | Fill final handoff fields. | Final handoff fields completed below. |
| Final lint | yes | Run scoped Biome/check. | npx biome check ... --fix passed after correcting the root package path. |
| Goal plan complete | yes | Run autogoal checker. | To run after this update. |
Phase / pass table:
| Phase | Status | Evidence | Next |
|---|---|---|---|
| Intake and source read | complete | Skills, memory, benchmark docs/scripts read. | implementation |
| Implementation | complete | Registry, health script, refresh scripts, docs/index cockpit, README/source-map/iteration updated. | verification |
| Verification | complete | evidence:refresh, docs:perf:check, check, and served route proof passed. | closeout |
| PR / tracker sync | complete | N/A. | final response |
| Closeout | complete | Plan ready for autogoal checker. | final response |
Findings:
Decisions and tradeoffs:
Implementation notes:
research/benchmark-registry.json with active artifacts and workload mappings.benchmarks/benchmark-health.mjs.evidence:health, evidence:refresh, and root bench:editor:health / bench:editor:refresh aliases.index.html to show health, next actions, and ignored old artifacts.Review fixes:
Error attempts:
| Error / failed attempt | Count | Next different move | Resolution |
|---|---|---|---|
Biome path used ../package.json from benchmarks/editor, resolving to missing benchmarks/package.json | 1 | Reran Biome from repo root with explicit paths | Passed. |
Verification evidence:
npx biome check benchmarks/editor/src/index.mjs benchmarks/editor/benchmarks/benchmark-health.mjs benchmarks/editor/benchmarks/render-perf-index.mjs benchmarks/editor/package.json package.json benchmarks/editor/research/benchmark-registry.json benchmarks/editor/README.md benchmarks/editor/research/evidence-source-map.md benchmarks/editor/iterations/003-evidence-control-plane.md --fix passed from repo root.node --check src/index.mjs, node --check benchmarks/benchmark-health.mjs, and node --check benchmarks/render-perf-index.mjs passed in benchmarks/editor.npm run bench:rich-text:check && npm run evidence:health passed in benchmarks/editor.npm run evidence:refresh passed in benchmarks/editor.npm run docs:perf:check && npm run check passed in benchmarks/editor.http://127.0.0.1:8765/index.html: status 200, health/control-plane/discard text present, 2 primary cards, rich-text and internals primary links present, evidence not primary, next action present.Final handoff contract:
evidence:refresh, docs:perf:check, check, and HTTP route proof passed.Final handoff / sync:
Timeline:
Reboot status:
| Question | Answer |
|---|---|
| Where am I? | Closeout |
| Where am I going? | Final response |
| What is the goal? | Evidence Kit becomes the active benchmark control plane and old unregistered benchmarks are discarded. |
| What have I learned? | Registry-driven ingest is the missing boundary. |
| What have I done? | Added registry, health, refresh command, docs cockpit, and proof. |
Open risks: