Slate AR Perf

Handle $ARGUMENTS.

Use this for Slate v2 performance work where "try another optimization" needs a measured loop, not another plan essay. This is the performance lane on top of slate-ar, which wraps codex-autoresearch:codex-autoresearch.

Do not duplicate generic packet/dashboard/finalization mechanics here. Load slate-ar for Slate wrapper behavior and codex-autoresearch:codex-autoresearch for the underlying Autoresearch state machine.

Use When

The user invokes slate-ar-perf.
The user says fast, fastest, max perf, pagination, virtualization, benchmark, or asks to make a Slate v2 surface faster.
Pagination, virtualization, huge-document, table, render, layout, selection, typing, paste, scroll, or mount performance needs iterative optimization.
There is, or should be, a benchmark command that prints METRIC name=value.
The next move depends on measured results, not just architecture judgment.

Do Not Use When

The bug is primarily correctness and needs a direct fix. Use slate-patch.
The target is too vague to infer a benchmark and correctness contract. Use slate-ar-recipe for target discovery or slate-plan for architecture framing.
The output is an architecture/API proposal for user review. Use slate-plan.
The target is Plate product code instead of raw Slate v2.

Natural Modes

fast, fastest, max perf, make it fastest: fastest-safe mode. Pick or resume the matching target and keep running packets until target parity, plateau, correctness blocker, architecture blocker, unsafe finalization/dirty-tree boundary, or user interruption.
pagination, virtualization: use the pagination default contract unless the user gives a sharper target.
continue, resume, status, dashboard, finalize: delegate to slate-ar operator modes, then apply perf policy to any next packet.

Plateau means three consecutive valid correctness-green packets improve the primary metric by less than 5% and no safe P0/P1 profiler hypothesis remains. Do not stop at the first win.

Target Registry

Use benchmarks/targets/slate-v2.json as the migration spine when it exists. It is the source of truth for benchmark questions, cohorts, commands, metrics, correctness checks, artifacts, and supporting docs.

Default path:

list targets with pnpm bench:targets:list;
check registry health with pnpm bench:targets:check;
generate or check target reports with pnpm bench:targets:report;
dry-run the target with pnpm bench:targets:dry-run -- <target-id>;
initialize the real .tmp/slate-v2/autoresearch.* session only when needed: node tooling/scripts/bench-targets.mjs autoresearch-init <target-id>;
use slate-ar / Codex Autoresearch for setup inspection, benchmark lint, checks inspection, packets, stale-run detection, ASI, dashboard, keep/discard decisions, and final evidence.

Missing Target Policy

A specific target id is enough instruction. Do not force the user to write a long prompt like "create target if missing".

If <target-id> is missing from benchmarks/targets/slate-v2.json and the name is specific enough to infer the surface, create the first-class target contract in the same pass. Examples of specific-enough names:

react-huge-document-select-all
core-observation-compare
pagination-virtualized-fast-scroll
history-fragment-undo-redo

When creating a target:

add the registry entry before initializing Autoresearch;
reuse or extend the nearest existing benchmark script when possible;
create a small benchmark owner only when no existing command can expose the metric honestly;
make the benchmark print METRIC lines from the start;
use a primary metric that names the real surface, not generic benchmark_seconds;
compare against ../slate/../../../slate when legacy parity is the claim;
add a correctness command that covers the native editor behavior at risk;
dry-run the target and run pnpm bench:targets:check before AR init.

If the target name is ambiguous, stop and recommend one or two concrete target ids. Do not create another slate-ar-* wrapper skill for a missing benchmark target. slate-ar-perf owns perf-target bootstrapping.

The old Slate v2 bench:* package scripts remain workload owners during migration. The clean split is: target registry owns the decision contract, benchmark scripts own runtime workload, Autoresearch owns active optimization state, and target reports/history own historical status.

Exactness Gate

Performance wins do not count when the editor is less correct.

Before pagination, virtualization, hidden DOM, model-backed selection, or staged-render optimization:

identify the exact correctness oracle or browser proof command;
if no oracle exists, add it first with slate-patch or tdd;
classify each native behavior as preserved, intentionally degraded, or out of scope before using it as a benchmark cohort;
keep cold-path estimates as scaffold hints only, not authoritative layout or selection truth;
if a packet improves speed but breaks selection, input ordering, IME, copy, paste, undo, focus, or follow-up typing, log checks_failed or discard, never keep.

Pagination Default Contract

For pagination or page-level virtualization, start from this contract unless the user gives a sharper one.

Target route:

txt

http://localhost:3100/examples/pagination?page_layout=single&strategy=virtualized&rows=800

Required cohorts:

small: rows=8, staged and virtualized
table-large: rows=500, staged and virtualized
stress: rows=800 or rows=1000, virtualized
table-span: table spans at least 10 pages

Primary metrics, lower is better:

fast typing burst p95 or total interaction latency
initial interactive time for the route
strategy switch latency from staged to virtualized
fast scroll recovery time

Secondary metrics:

dropped or reordered characters count
DOM node count
mounted page count
page overscan count
React commit count or render count when available
heap estimate when cheap to gather

Correctness checks:

no skipped/reordered characters during fast typing bursts;
insert break keeps following typed characters after the caret;
click left margin selects start of line;
click right margin selects end of line;
double click selects a word;
drag selection autoscroll works near top and bottom;
text selection across visible page content works;
native copy/paste/select-all behavior is preserved or explicitly classified.

Benchmark output must print METRIC lines. Example:

txt

METRIC typing_p95_ms=42
METRIC dropped_chars=0
METRIC dom_nodes=1840

Huge Document Select-All Default Contract

For react-huge-document-select-all, create or use a target with this contract.

Required cohorts:

5k blocks against legacy Slate;
the main Slate v2 huge-document React surface;
staged/DOM-present or virtualized surface when that is the product path;
native keyboard select-all (Mod+A), not only programmatic model selection.

Primary metrics, lower is better:

react_huge_doc_select_all_p95_ms;
react_huge_doc_select_all_worst_p95_ratio versus legacy;
react_huge_doc_select_all_failure_count.

Secondary metrics:

DOM node count after select-all;
React commit count when cheap to collect;
selection export/import time when visible separately;
copy latency for the selected document when cheap to collect.

Correctness checks:

Mod+A selects the full editor document;
typing after select-all replaces the selected content once;
undo restores the previous document and selection coherently;
copy returns complete plain text for the selected document;
broad selection stays valid in staged, partial-DOM, or virtualized rendering;
no hidden debounce or delayed correctness.

Promotion target:

first pass: worst p95 ratio <=1.5 with failure count 0;
final target: worst p95 ratio <=1.0 or plateau after three correctness-green packets with less than 5% gain and no safe P0/P1 profiler hypothesis left.

Handoff

Report:

benchmark command and primary metric;
baseline, latest, and best values;
kept, discarded, crashed, and checks-failed packets;
correctness checks used;
files changed by kept work;
dashboard URL, if served;
next recommended packet or blocker.