docs/plans/2026-05-28-refresh-five-stale-benchmark-artifacts.md
Objective:
Refresh the five stale active Evidence Kit artifacts for react-huge-document-browser-trace, core-normalization-current, core-query-ref-observation, core-refs-projection, and history-compare; regenerate benchmark health/docs; verify those five refresh actions disappear while the dashboard stays scoped to Slate and Slate v2.
Goal plan: docs/plans/2026-05-28-refresh-five-stale-benchmark-artifacts.md
Task source:
refresh-react-huge-document-browser-trace, refresh-core-normalization-current, refresh-core-query-ref-observation, refresh-core-refs-projection, refresh-history-comparerich-text-data.json remains Slate-only.Completion threshold:
.tmp/slate-v2 runs all five active registry commands successfully.benchmarks/editor runs npm run check successfully after refresh.benchmarks/editor/benchmarks/results/benchmark-health-latest.json has none of the five refresh action IDs.http://127.0.0.1:8765/rich-text-data.json contains no Plate, ProseMirror, Lexical, TipTap, chunk-off, or runtime-adapter scope.node .agents/rules/autogoal/scripts/check-complete.mjs docs/plans/2026-05-28-refresh-five-stale-benchmark-artifacts.md passes.Verification surface:
.tmp/slate-v2: bun run bench:react:huge-document:browser-trace:local.tmp/slate-v2: bun run bench:core:normalization:local.tmp/slate-v2: bun run bench:core:query-ref-observation:local.tmp/slate-v2: bun run bench:core:refs-projection:local.tmp/slate-v2: HISTORY_BENCH_LEGACY_REPO=../../../slate bun run bench:history:compare:local.tmp/slate-v2: bun lintbenchmarks/editor: npm run checkrich-text-data.json Slate-only fetch auditConstraints:
slate and slate-v2.Boundaries:
benchmarks/editor/research/benchmark-registry.json, generated Evidence Kit health output, and registered JSON artifacts under .tmp/slate-v2/tmp.http://127.0.0.1:8765/.Blocked condition:
Task state:
Current verdict:
Start Gates:
| Gate | Applies | Evidence |
|---|---|---|
| Skill analysis before edits | yes | Used autogoal because the task had five measurable health-action removals. |
| Active goal checked or created | yes | Created goal for refreshing five stale active artifacts and verifying health/docs/data. |
| Source of truth read before edits | yes | Read current health next actions and benchmarks/editor/research/benchmark-registry.json. |
| Tracker comments and attachments read | no | No tracker or attachment was part of this request. |
| Video transcript evidence required | no | No video or screen recording supplied. |
docs/solutions checked for non-trivial existing-code work | no | Focused benchmark refresh; local registry/scripts were the source of truth. |
| TDD decision before behavior change or bug fix | yes | No new test added; target benchmark commands and Evidence Kit health are the regression proof. |
| Branch decision for code-changing task | yes | No branch action; user requested local execution only. |
| Release artifact decision | yes | No package release artifact; benchmark harness/registry only. |
| Browser tool decision for browser surface | yes | Browser MCP was not exposed after tool discovery; direct HTTP JSON fetch verified served dashboard data. |
| PR expectation decision | yes | No PR requested. |
| Tracker sync expectation decision | yes | No tracker sync requested. |
Work Checklist:
.tmp/slate-v2, benchmarks/editor, or repo root.Completion Gates:
| Gate | Applies | Required action | Evidence |
|---|---|---|---|
| Named verification threshold | yes | Run all five benchmark commands, Evidence Kit check, health audit, artifact audit, and live JSON audit | Completed. |
| Bug reproduced before fix | yes | Record failing command before fix | bun run bench:history:compare:local failed on missing .tmp/slate/package.json; env override then exposed stale withHistory import against current Slate v2. |
| Targeted behavior verification | yes | Run each target benchmark command | All five target commands passed after fixing history compare. |
| TypeScript or typed config changed | no | No typecheck required | Changed JS benchmark script and JSON registry only. |
| Package exports or file layout changed | no | No barrel action | No exported layout changed. |
| Package manifests, lockfile, or install graph changed | no | No install needed | No manifest or lockfile changed. |
| Agent rules or skills changed | no | No skill sync | No .agents source edited. |
| Workspace authority proof | yes | Run verification in owning workspaces | Target commands and lint ran in .tmp/slate-v2; Evidence Kit suite ran in benchmarks/editor; HTTP data audit ran from repo root. |
| Browser surface changed | yes | Verify generated/served dashboard data | Direct fetch of served rich-text-data.json passed with no forbidden non-Slate terms. |
| Browser final proof | yes | Record caveat | Browser MCP was not exposed; live HTTP JSON proof covers the dashboard data source. |
| CI-controlled template output changed | no | No template output | No templates/** changed. |
| Package behavior or public API changed | no | No changeset | Benchmark harness only. |
| Registry-only component work changed | no | No component changelog | Not registry component work. |
| Docs or content changed | yes | Verify generated docs | npm run check ran docs:perf, docs:rich-text:check, and docs:index:check successfully. |
| High-risk mini gate | yes | Record failure mode, proof plan, and boundary | Failure mode was stale command/API drift making refresh impossible; boundary is benchmark harness and Evidence Kit command registry. |
| Agent-native review for agent/tooling changes | no | No agent-native review | No agent/tooling changes. |
| Local install corruption suspected | no | No reinstall | Failures were deterministic path/API errors, not local install corruption. |
| Autoreview for non-trivial implementation changes | no | No autoreview | Focused benchmark harness repair; target commands and Evidence Kit check are sufficient. |
| PR create or update | no | No PR | User did not ask for PR. |
| PR proof image hosting | no | No PR body | Not applicable. |
| Tracker sync-back | no | No tracker | Not applicable. |
| Final handoff contract | yes | Fill final handoff fields | Completed below. |
| Final lint | yes | Run scoped equivalent | .tmp/slate-v2: bun lint passed; benchmarks/editor: npm run check passed. |
| Goal plan complete | yes | Run autogoal checker | To be run after this file update. |
Phase / pass table:
| Phase | Status | Evidence | Next |
|---|---|---|---|
| Intake and source read | complete | Read five refresh actions and registry commands. | none |
| Implementation | complete | Refreshed four artifacts directly; fixed history registry command and compare runner drift. | none |
| Verification | complete | All target benchmark commands passed, Evidence Kit check passed, health/live-data audits passed. | none |
| PR / tracker sync | complete | No PR or tracker requested. | none |
| Closeout | complete | Durable evidence recorded; autogoal checker remains final mechanical gate. | final response |
Findings:
history-compare defaulted to .tmp/slate, but this repo workflow compares .tmp/slate-v2 against ../../../slate.history-compare still used old withHistory assumptions. Current Slate v2 uses history() extensions and tx.history.undo()/redo().Decisions and tradeoffs:
HISTORY_BENCH_LEGACY_REPO=../../../slate, matching the established Slate vs Slate-v2 checkout layout.Implementation notes:
history-compare uses HISTORY_BENCH_LEGACY_REPO=../../../slate.history() and withHistory() APIs.benchmarks/editor.Review fixes:
Error attempts:
| Error / failed attempt | Count | Next different move | Resolution |
|---|---|---|---|
history-compare looked for .tmp/slate/package.json | 1 | Pin HISTORY_BENCH_LEGACY_REPO=../../../slate | Fixed in registry and command rerun. |
withHistory missing from current slate-history | 1 | Support both current history() extension and legacy withHistory() plugin | Fixed in compare runner. |
Verification evidence:
.tmp/slate-v2: bun run bench:react:huge-document:browser-trace:local passed and wrote tmp/slate-react-huge-document-browser-trace-benchmark-surfaces-defaultAuto-stagedDomPresent-blocks-5000-iters-3-ops-10.json..tmp/slate-v2: bun run bench:core:normalization:local passed and wrote tmp/slate-normalization-benchmark.json..tmp/slate-v2: bun run bench:core:query-ref-observation:local passed and wrote tmp/slate-query-ref-observation-benchmark.json..tmp/slate-v2: bun run bench:core:refs-projection:local passed and wrote tmp/slate-refs-projection-benchmark.json..tmp/slate-v2: HISTORY_BENCH_LEGACY_REPO=../../../slate bun run bench:history:compare:local passed and wrote tmp/slate-history-compare-benchmark.json.2026-05-28T21:59:39.539Z; normalization 2026-05-28T21:59:45.787Z; query/ref observation 2026-05-28T21:59:54.777Z; refs projection 2026-05-28T22:00:05.880Z; history compare 2026-05-28T22:01:58.315Z.benchmarks/editor: npm run check passed. It regenerated benchmark results, scope, docs, health, and research list output.nextActionCount=3; remaining titles are Decide whether core-transaction-current should stay optional, Decide whether history-retained-memory should stay optional, and Delete or ignore historical unregistered artifacts; none of the five refresh IDs remain.rowCount=463, groupCount=11, libraries slate, slate-v2, slate-v2:browser-replay, slate-v2:current, slate-v2:default-render-auto, slate-v2:dom-present, slate:baseline, slate:browser-replay; forbidden terms found []..tmp/slate-v2: bun lint passed.Final handoff contract:
rich-text-data.json; Browser MCP unavailable.npm run check, health audit, artifact audit, live JSON audit, bun lint.Final handoff / sync:
127.0.0.1:8765.Timeline:
npm run check passed and health dropped to three next actions.Reboot status:
| Question | Answer |
|---|---|
| Where am I? | Closeout. |
| Where am I going? | Run autogoal checker, close goal, final response. |
| What is the goal? | Refresh the five stale active benchmark artifacts and remove their health actions. |
| What have I learned? | History compare needed both the correct legacy repo path and current-vs-legacy history API compatibility. |
| What have I done? | Ran all five refresh commands, repaired history compare, regenerated Evidence Kit output, and verified health/data. |
Open risks: