Back to Plate

Slate AR Perf

.agents/skills/slate-ar-perf/SKILL.md

53.0.88.2 KB
Original Source

Slate AR Perf

Handle $ARGUMENTS.

Use this for Slate v2 performance work where "try another optimization" needs a measured loop, not another plan essay. This is the performance lane on top of slate-ar, which wraps codex-autoresearch:codex-autoresearch.

Do not duplicate generic packet/dashboard/finalization mechanics here. Load slate-ar for Slate wrapper behavior and codex-autoresearch:codex-autoresearch for the underlying Autoresearch state machine.

Use When

  • The user invokes slate-ar-perf.
  • The user says fast, fastest, max perf, pagination, virtualization, benchmark, or asks to make a Slate v2 surface faster.
  • Pagination, virtualization, huge-document, table, render, layout, selection, typing, paste, scroll, or mount performance needs iterative optimization.
  • There is, or should be, a benchmark command that prints METRIC name=value.
  • The next move depends on measured results, not just architecture judgment.

Do Not Use When

  • The bug is primarily correctness and needs a direct fix. Use slate-patch.
  • The target is too vague to infer a benchmark and correctness contract. Use slate-ar-recipe for target discovery or slate-plan for architecture framing.
  • The output is an architecture/API proposal for user review. Use slate-plan.
  • The target is Plate product code instead of raw Slate v2.

Natural Modes

  • fast, fastest, max perf, make it fastest: fastest-safe mode. Pick or resume the matching target and keep running packets until target parity, plateau, correctness blocker, architecture blocker, unsafe finalization/dirty-tree boundary, or user interruption.
  • pagination, virtualization: use the pagination default contract unless the user gives a sharper target.
  • continue, resume, status, dashboard, finalize: delegate to slate-ar operator modes, then apply perf policy to any next packet.

Plateau means three consecutive valid correctness-green packets improve the primary metric by less than 5% and no safe P0/P1 profiler hypothesis remains. Do not stop at the first win.

Target Registry

Use benchmarks/targets/slate-v2.json as the migration spine when it exists. It is the source of truth for benchmark questions, cohorts, commands, metrics, correctness checks, artifacts, and supporting docs.

Default path:

  1. list targets with pnpm bench:targets:list;
  2. check registry health with pnpm bench:targets:check;
  3. generate or check target reports with pnpm bench:targets:report;
  4. dry-run the target with pnpm bench:targets:dry-run -- <target-id>;
  5. initialize the real .tmp/slate-v2/autoresearch.* session only when needed: node tooling/scripts/bench-targets.mjs autoresearch-init <target-id>;
  6. use slate-ar / Codex Autoresearch for setup inspection, benchmark lint, checks inspection, packets, stale-run detection, ASI, dashboard, keep/discard decisions, and final evidence.

Missing Target Policy

A specific target id is enough instruction. Do not force the user to write a long prompt like "create target if missing".

If <target-id> is missing from benchmarks/targets/slate-v2.json and the name is specific enough to infer the surface, create the first-class target contract in the same pass. Examples of specific-enough names:

  • react-huge-document-select-all
  • core-observation-compare
  • pagination-virtualized-fast-scroll
  • history-fragment-undo-redo

When creating a target:

  • add the registry entry before initializing Autoresearch;
  • reuse or extend the nearest existing benchmark script when possible;
  • create a small benchmark owner only when no existing command can expose the metric honestly;
  • make the benchmark print METRIC lines from the start;
  • use a primary metric that names the real surface, not generic benchmark_seconds;
  • compare against ../slate/../../../slate when legacy parity is the claim;
  • add a correctness command that covers the native editor behavior at risk;
  • dry-run the target and run pnpm bench:targets:check before AR init.

If the target name is ambiguous, stop and recommend one or two concrete target ids. Do not create another slate-ar-* wrapper skill for a missing benchmark target. slate-ar-perf owns perf-target bootstrapping.

The old Slate v2 bench:* package scripts remain workload owners during migration. The clean split is: target registry owns the decision contract, benchmark scripts own runtime workload, Autoresearch owns active optimization state, and target reports/history own historical status.

Exactness Gate

Performance wins do not count when the editor is less correct.

Before pagination, virtualization, hidden DOM, model-backed selection, or staged-render optimization:

  • identify the exact correctness oracle or browser proof command;
  • if no oracle exists, add it first with slate-patch or tdd;
  • classify each native behavior as preserved, intentionally degraded, or out of scope before using it as a benchmark cohort;
  • keep cold-path estimates as scaffold hints only, not authoritative layout or selection truth;
  • if a packet improves speed but breaks selection, input ordering, IME, copy, paste, undo, focus, or follow-up typing, log checks_failed or discard, never keep.

Pagination Default Contract

For pagination or page-level virtualization, start from this contract unless the user gives a sharper one.

Target route:

txt
http://localhost:3100/examples/pagination?page_layout=single&strategy=virtualized&rows=800

Required cohorts:

  • small: rows=8, staged and virtualized
  • table-large: rows=500, staged and virtualized
  • stress: rows=800 or rows=1000, virtualized
  • table-span: table spans at least 10 pages

Primary metrics, lower is better:

  • fast typing burst p95 or total interaction latency
  • initial interactive time for the route
  • strategy switch latency from staged to virtualized
  • fast scroll recovery time

Secondary metrics:

  • dropped or reordered characters count
  • DOM node count
  • mounted page count
  • page overscan count
  • React commit count or render count when available
  • heap estimate when cheap to gather

Correctness checks:

  • no skipped/reordered characters during fast typing bursts;
  • insert break keeps following typed characters after the caret;
  • click left margin selects start of line;
  • click right margin selects end of line;
  • double click selects a word;
  • drag selection autoscroll works near top and bottom;
  • text selection across visible page content works;
  • native copy/paste/select-all behavior is preserved or explicitly classified.

Benchmark output must print METRIC lines. Example:

txt
METRIC typing_p95_ms=42
METRIC dropped_chars=0
METRIC dom_nodes=1840

Huge Document Select-All Default Contract

For react-huge-document-select-all, create or use a target with this contract.

Required cohorts:

  • 5k blocks against legacy Slate;
  • the main Slate v2 huge-document React surface;
  • staged/DOM-present or virtualized surface when that is the product path;
  • native keyboard select-all (Mod+A), not only programmatic model selection.

Primary metrics, lower is better:

  • react_huge_doc_select_all_p95_ms;
  • react_huge_doc_select_all_worst_p95_ratio versus legacy;
  • react_huge_doc_select_all_failure_count.

Secondary metrics:

  • DOM node count after select-all;
  • React commit count when cheap to collect;
  • selection export/import time when visible separately;
  • copy latency for the selected document when cheap to collect.

Correctness checks:

  • Mod+A selects the full editor document;
  • typing after select-all replaces the selected content once;
  • undo restores the previous document and selection coherently;
  • copy returns complete plain text for the selected document;
  • broad selection stays valid in staged, partial-DOM, or virtualized rendering;
  • no hidden debounce or delayed correctness.

Promotion target:

  • first pass: worst p95 ratio <=1.5 with failure count 0;
  • final target: worst p95 ratio <=1.0 or plateau after three correctness-green packets with less than 5% gain and no safe P0/P1 profiler hypothesis left.

Handoff

Report:

  • benchmark command and primary metric;
  • baseline, latest, and best values;
  • kept, discarded, crashed, and checks-failed packets;
  • correctness checks used;
  • files changed by kept work;
  • dashboard URL, if served;
  • next recommended packet or blocker.