Back to Plate

refresh five stale benchmark artifacts

docs/plans/2026-05-28-refresh-five-stale-benchmark-artifacts.md

53.0.816.2 KB
Original Source

refresh five stale benchmark artifacts

Objective: Refresh the five stale active Evidence Kit artifacts for react-huge-document-browser-trace, core-normalization-current, core-query-ref-observation, core-refs-projection, and history-compare; regenerate benchmark health/docs; verify those five refresh actions disappear while the dashboard stays scoped to Slate and Slate v2.

Goal plan: docs/plans/2026-05-28-refresh-five-stale-benchmark-artifacts.md

Task source:

  • type: benchmark health next actions
  • id / link: refresh-react-huge-document-browser-trace, refresh-core-normalization-current, refresh-core-query-ref-observation, refresh-core-refs-projection, refresh-history-compare
  • title: refresh the five remaining stale active artifacts
  • acceptance criteria: all five registry commands pass, all five registered artifact files are fresh, Evidence Kit health/docs regenerate, health no longer lists those five refresh actions, and live rich-text-data.json remains Slate-only.

Completion threshold:

  • .tmp/slate-v2 runs all five active registry commands successfully.
  • benchmarks/editor runs npm run check successfully after refresh.
  • benchmarks/editor/benchmarks/results/benchmark-health-latest.json has none of the five refresh action IDs.
  • Served http://127.0.0.1:8765/rich-text-data.json contains no Plate, ProseMirror, Lexical, TipTap, chunk-off, or runtime-adapter scope.
  • node .agents/rules/autogoal/scripts/check-complete.mjs docs/plans/2026-05-28-refresh-five-stale-benchmark-artifacts.md passes.

Verification surface:

  • .tmp/slate-v2: bun run bench:react:huge-document:browser-trace:local
  • .tmp/slate-v2: bun run bench:core:normalization:local
  • .tmp/slate-v2: bun run bench:core:query-ref-observation:local
  • .tmp/slate-v2: bun run bench:core:refs-projection:local
  • .tmp/slate-v2: HISTORY_BENCH_LEGACY_REPO=../../../slate bun run bench:history:compare:local
  • .tmp/slate-v2: bun lint
  • benchmarks/editor: npm run check
  • repo root: health-next-action JSON audit
  • repo root: artifact mtime audit
  • repo root: live rich-text-data.json Slate-only fetch audit

Constraints:

  • Keep active benchmark scope to slate and slate-v2.
  • Do not reintroduce Plate, ProseMirror, Lexical, TipTap, or chunk-off lanes.
  • Do not repair optional artifact decisions or unregistered historical artifact policy in this refresh.
  • Do not create commits, PRs, pushes, or tracker comments.

Boundaries:

  • Source of truth: benchmarks/editor/research/benchmark-registry.json, generated Evidence Kit health output, and registered JSON artifacts under .tmp/slate-v2/tmp.
  • Allowed edit scope: stale benchmark harness drift, Evidence Kit registry command contract, generated Evidence Kit outputs, and this goal plan.
  • Browser surface: static dashboard served at http://127.0.0.1:8765/.
  • Tracker sync: not applicable, no external tracker requested.
  • Non-goals: optional artifact policy, unregistered artifact cleanup, non-Slate adapters, and unrelated Slate v2 repo contract test cleanup.

Blocked condition:

  • A real blocker would be any of the five benchmark commands failing after source-level harness repair, or Evidence Kit still reporting one of the five refresh actions after regenerated docs/health. Neither blocker remains.

Task state:

  • task_type: benchmark-refresh
  • task_complexity: normal
  • current_phase: closeout
  • current_phase_status: complete
  • next_phase: final-response
  • goal_status: active-until-autogoal-close

Current verdict:

  • verdict: complete after autogoal checker
  • confidence: high
  • next owner: user
  • reason: all five refresh commands passed, Evidence Kit check passed, and health now has only the three non-refresh actions.

Start Gates:

GateAppliesEvidence
Skill analysis before editsyesUsed autogoal because the task had five measurable health-action removals.
Active goal checked or createdyesCreated goal for refreshing five stale active artifacts and verifying health/docs/data.
Source of truth read before editsyesRead current health next actions and benchmarks/editor/research/benchmark-registry.json.
Tracker comments and attachments readnoNo tracker or attachment was part of this request.
Video transcript evidence requirednoNo video or screen recording supplied.
docs/solutions checked for non-trivial existing-code worknoFocused benchmark refresh; local registry/scripts were the source of truth.
TDD decision before behavior change or bug fixyesNo new test added; target benchmark commands and Evidence Kit health are the regression proof.
Branch decision for code-changing taskyesNo branch action; user requested local execution only.
Release artifact decisionyesNo package release artifact; benchmark harness/registry only.
Browser tool decision for browser surfaceyesBrowser MCP was not exposed after tool discovery; direct HTTP JSON fetch verified served dashboard data.
PR expectation decisionyesNo PR requested.
Tracker sync expectation decisionyesNo tracker sync requested.

Work Checklist:

  • Objective includes outcome, completion threshold, verification surface, constraints, boundaries, and blocked condition.
  • Task source classified with source type, id/link, title, task type, acceptance criteria, caveats, likely files/routes/packages, browser surface, and root-cause layer.
  • Required video or screen-recording evidence is cached/read as normalized XML, or marked not applicable with reason.
  • Nearby repo instructions and implementation patterns read before edits.
  • Implementation fixes the right ownership boundary, or the narrower choice is recorded with reason.
  • Release artifact requirement recorded: no changeset or registry changelog applies to benchmark harness refresh.
  • Final handoff shape decided: concise summary with five refreshed artifacts, checks, and remaining health actions.
  • Branch handling recorded for code-changing work: no branch action because none was requested.
  • Local-env-rot retry policy recorded for any surprising repo-wide failure: not env rot; history failure was source/API drift and wrong legacy path.
  • Workspace authority recorded: every proof command names .tmp/slate-v2, benchmarks/editor, or repo root.
  • High-risk note recorded for command-contract/runtime changes: history compare needed current extension-based history API support and legacy path pinning.
  • Review/autoreview target selected from actual diff state: not run; focused benchmark harness repair with target checks.
  • Agent-native review decision recorded: not applicable, no agent tooling changed.

Completion Gates:

GateAppliesRequired actionEvidence
Named verification thresholdyesRun all five benchmark commands, Evidence Kit check, health audit, artifact audit, and live JSON auditCompleted.
Bug reproduced before fixyesRecord failing command before fixbun run bench:history:compare:local failed on missing .tmp/slate/package.json; env override then exposed stale withHistory import against current Slate v2.
Targeted behavior verificationyesRun each target benchmark commandAll five target commands passed after fixing history compare.
TypeScript or typed config changednoNo typecheck requiredChanged JS benchmark script and JSON registry only.
Package exports or file layout changednoNo barrel actionNo exported layout changed.
Package manifests, lockfile, or install graph changednoNo install neededNo manifest or lockfile changed.
Agent rules or skills changednoNo skill syncNo .agents source edited.
Workspace authority proofyesRun verification in owning workspacesTarget commands and lint ran in .tmp/slate-v2; Evidence Kit suite ran in benchmarks/editor; HTTP data audit ran from repo root.
Browser surface changedyesVerify generated/served dashboard dataDirect fetch of served rich-text-data.json passed with no forbidden non-Slate terms.
Browser final proofyesRecord caveatBrowser MCP was not exposed; live HTTP JSON proof covers the dashboard data source.
CI-controlled template output changednoNo template outputNo templates/** changed.
Package behavior or public API changednoNo changesetBenchmark harness only.
Registry-only component work changednoNo component changelogNot registry component work.
Docs or content changedyesVerify generated docsnpm run check ran docs:perf, docs:rich-text:check, and docs:index:check successfully.
High-risk mini gateyesRecord failure mode, proof plan, and boundaryFailure mode was stale command/API drift making refresh impossible; boundary is benchmark harness and Evidence Kit command registry.
Agent-native review for agent/tooling changesnoNo agent-native reviewNo agent/tooling changes.
Local install corruption suspectednoNo reinstallFailures were deterministic path/API errors, not local install corruption.
Autoreview for non-trivial implementation changesnoNo autoreviewFocused benchmark harness repair; target commands and Evidence Kit check are sufficient.
PR create or updatenoNo PRUser did not ask for PR.
PR proof image hostingnoNo PR bodyNot applicable.
Tracker sync-backnoNo trackerNot applicable.
Final handoff contractyesFill final handoff fieldsCompleted below.
Final lintyesRun scoped equivalent.tmp/slate-v2: bun lint passed; benchmarks/editor: npm run check passed.
Goal plan completeyesRun autogoal checkerTo be run after this file update.

Phase / pass table:

PhaseStatusEvidenceNext
Intake and source readcompleteRead five refresh actions and registry commands.none
ImplementationcompleteRefreshed four artifacts directly; fixed history registry command and compare runner drift.none
VerificationcompleteAll target benchmark commands passed, Evidence Kit check passed, health/live-data audits passed.none
PR / tracker synccompleteNo PR or tracker requested.none
CloseoutcompleteDurable evidence recorded; autogoal checker remains final mechanical gate.final response

Findings:

  • Four artifacts refreshed without code changes.
  • history-compare defaulted to .tmp/slate, but this repo workflow compares .tmp/slate-v2 against ../../../slate.
  • After the legacy path was corrected, history-compare still used old withHistory assumptions. Current Slate v2 uses history() extensions and tx.history.undo()/redo().
  • Evidence Kit health now has only three non-refresh next actions.

Decisions and tradeoffs:

  • Patched the Evidence Kit registry command to pin HISTORY_BENCH_LEGACY_REPO=../../../slate, matching the established Slate vs Slate-v2 checkout layout.
  • Patched the history compare runner to support both current extension-based history and legacy plugin-based history.
  • Did not touch optional artifact decisions or historical unregistered artifacts; those are separate next actions.

Implementation notes:

  • Updated benchmark-registry.json so history-compare uses HISTORY_BENCH_LEGACY_REPO=../../../slate.
  • Updated history.mjs to dynamically support history() and withHistory() APIs.
  • Regenerated benchmark results and docs under benchmarks/editor.

Review fixes:

  • None from external review.

Error attempts:

Error / failed attemptCountNext different moveResolution
history-compare looked for .tmp/slate/package.json1Pin HISTORY_BENCH_LEGACY_REPO=../../../slateFixed in registry and command rerun.
withHistory missing from current slate-history1Support both current history() extension and legacy withHistory() pluginFixed in compare runner.

Verification evidence:

  • .tmp/slate-v2: bun run bench:react:huge-document:browser-trace:local passed and wrote tmp/slate-react-huge-document-browser-trace-benchmark-surfaces-defaultAuto-stagedDomPresent-blocks-5000-iters-3-ops-10.json.
  • .tmp/slate-v2: bun run bench:core:normalization:local passed and wrote tmp/slate-normalization-benchmark.json.
  • .tmp/slate-v2: bun run bench:core:query-ref-observation:local passed and wrote tmp/slate-query-ref-observation-benchmark.json.
  • .tmp/slate-v2: bun run bench:core:refs-projection:local passed and wrote tmp/slate-refs-projection-benchmark.json.
  • .tmp/slate-v2: HISTORY_BENCH_LEGACY_REPO=../../../slate bun run bench:history:compare:local passed and wrote tmp/slate-history-compare-benchmark.json.
  • Fresh artifact mtimes: browser trace 2026-05-28T21:59:39.539Z; normalization 2026-05-28T21:59:45.787Z; query/ref observation 2026-05-28T21:59:54.777Z; refs projection 2026-05-28T22:00:05.880Z; history compare 2026-05-28T22:01:58.315Z.
  • benchmarks/editor: npm run check passed. It regenerated benchmark results, scope, docs, health, and research list output.
  • Health audit: nextActionCount=3; remaining titles are Decide whether core-transaction-current should stay optional, Decide whether history-retained-memory should stay optional, and Delete or ignore historical unregistered artifacts; none of the five refresh IDs remain.
  • Live dashboard data audit: rowCount=463, groupCount=11, libraries slate, slate-v2, slate-v2:browser-replay, slate-v2:current, slate-v2:default-render-auto, slate-v2:dom-present, slate:baseline, slate:browser-replay; forbidden terms found [].
  • .tmp/slate-v2: bun lint passed.

Final handoff contract:

  • PR line: no PR requested.
  • Issue / tracker line: no tracker requested.
  • Confidence line: high.
  • Flow table:
    • Reproduced: five stale refresh actions existed; history compare initially failed on wrong legacy path and stale history API.
    • Verified: all five target commands passed, Evidence Kit check passed, health refresh actions disappeared, live dashboard data stayed Slate-only.
  • Browser check: direct HTTP fetch against served rich-text-data.json; Browser MCP unavailable.
  • Outcome: all five stale active artifacts are fresh and health is down to the three non-refresh tasks.
  • Caveat: remaining health tasks are optional-artifact decisions and unregistered artifact cleanup, not refreshes.
  • Design:
    • Chosen boundary: benchmark harness plus Evidence Kit registry command.
    • Why not quick patch: editing health JSON directly would fake the refresh.
    • Why not broader change: optional artifact policy and historical cleanup are separate choices.
  • Verified: five benchmark commands, npm run check, health audit, artifact audit, live JSON audit, bun lint.

Final handoff / sync:

  • PR: not requested.
  • Issue / tracker: not requested.
  • Browser proof: live HTTP JSON audit against 127.0.0.1:8765.
  • Caveats: three non-refresh health next actions remain by design.

Timeline:

  • 2026-05-28T21:59:07.963Z Task goal plan created.
  • 2026-05-28T21:59:39.539Z Browser trace artifact refreshed.
  • 2026-05-28T21:59:45.787Z Normalization artifact refreshed.
  • 2026-05-28T21:59:54.777Z Query/ref observation artifact refreshed.
  • 2026-05-28T22:00:05.880Z Refs projection artifact refreshed.
  • 2026-05-28T22:01:58.315Z History compare artifact refreshed after harness repair.
  • 2026-05-28T22:02Z Evidence Kit npm run check passed and health dropped to three next actions.

Reboot status:

QuestionAnswer
Where am I?Closeout.
Where am I going?Run autogoal checker, close goal, final response.
What is the goal?Refresh the five stale active benchmark artifacts and remove their health actions.
What have I learned?History compare needed both the correct legacy repo path and current-vs-legacy history API compatibility.
What have I done?Ran all five refresh commands, repaired history compare, regenerated Evidence Kit output, and verified health/data.

Open risks:

  • The three remaining health next actions are intentionally not handled in this refresh: two optional artifact decisions and one unregistered artifact cleanup.