docs/plans/2026-06-01-history-compare-ar-perf.md
Objective:
Run and repair history-compare under Slate AR until target-backed evidence is
green, plateaued, or blocked by correctness/architecture proof.
Completion threshold:
history-compare emits a benchmark-native primary METRIC line and
one of these is true: two correctness-green repeat packets are at or below
history_compare_worst_p95_ratio <= 2.0, two correctness-green packets
plateau under 5% improvement, or a remaining optimization is blocked by
concrete correctness or architecture evidence.Verification surface:
pnpm bench:targets:check,
pnpm bench:targets:dry-run -- history-compare, and target report refresh.HISTORY_BENCH_LEGACY_REPO=../../../slate bun run bench:history:compare:local in .tmp/slate-v2..tmp/slate-v2 bun check for evidence used to close the lane.Constraints:
Boundaries:
benchmarks/targets/slate-v2.json target history-compare
and .tmp/slate-v2/scripts/benchmarks/core/compare/history.mjs.slate-history patch changeset because runtime history
replay behavior changed.Blocked condition:
.tmp/slate-v2 correctness fails
from an unrelated owner that cannot be isolated, or further improvement needs
a public history API/runtime architecture decision.Work Checklist:
Phase / pass table:
| Phase | Status | Evidence | Next |
|---|---|---|---|
| Intake | complete | target registry and history benchmark read | implementation done |
| Implementation | complete | native metrics, stable iteration count, and history replay optimization added | verification done |
| Verification | complete | AR runs 11 and 12 under 2.0, checks green | closeout done |
| Closeout | complete | plan updated and mechanical check passed | final response |
Findings:
benchmark_seconds; that was not a useful
history comparison metric.2.63x, 15-iteration 3.1x,
worst on typing undo.Decisions and tradeoffs:
history_compare_worst_p95_ratio as the primary metric and kept mean
ratio/delta as secondary evidence.slate-history patch changeset because this is package runtime
behavior, not just benchmark plumbing.Verification evidence:
node --check .tmp/slate-v2/scripts/benchmarks/core/compare/history.mjs
passed.pnpm bench:targets:check passed.pnpm bench:targets:dry-run -- history-compare passed with
metric=history_compare_worst_p95_ratio.pnpm bench:targets:report regenerated target history/report files.history_compare_worst_p95_ratio metric.cd packages/slate-history && bun test ./test/history-contract.ts ./test/integrity-contract.ts ./test/document-state-history-contract.ts
with 51 pass, 0 fail.HISTORY_BENCH_TYPE_OPS=200 measured
history_compare_worst_p95_ratio=1.61.history_compare_worst_p95_ratio=1.9,
history_compare_worst_mean_ratio=0.57; benchmark exited 0 and
bun check passed.history_compare_worst_p95_ratio=0.38,
history_compare_worst_mean_ratio=0.37; benchmark exited 0 and
bun check passed.bash ./autoresearch.checks.sh passed in .tmp/slate-v2: lint had one
non-blocking React Hook warning in site/examples/ts/pagination.tsx, typecheck
passed, Bun tests 1172 pass, 95 skip, 0 fail, slate-layout 41 pass,
slate-react Vitest 56 files, 590 tests passed.Reboot status:
| Question | Answer |
|---|---|
| Where am I? | Closeout complete |
| Where am I going? | Final response |
| What is the goal? | Make history-compare truthful and decide whether it needs optimization |
| What have I learned? | The red lane was real history replay overhead, then later p95 noise after the fix |
| What have I done? | Upgraded the metric contract, optimized historic replay, added a changeset, and proved two green AR packets |
Open risks:
site/examples/ts/pagination.tsx hook dependency
warning appears in bun check but exits 0.