docs/performance/performance-benchmark-spec.md
This file defines the benchmark program for Plate's editor-performance work.
It is not a public summary. It is the contract behind the benchmark runner, the results UI, and any public claims.
If a benchmark page, doc, or chart says something stronger than this file supports, that output is wrong.
This benchmark needs two levels of truth:
Performance work here starts from the editor-behavior authority stack:
That means:
These docs are the best current source for shaping the first benchmark model.
The lasting source of truth for cross-editor benchmarking is not Plate docs. It is the benchmark-owned registry.
That registry must:
Rules:
The benchmark must inherit the scenario model from the editor-behavior work, but it must not stay editorially Plate-owned forever.
This benchmark should copy the best parts of
js-framework-benchmark, not the shallow parts:
What it must not copy:
The benchmark is valid only when correctness and performance are both visible.
This benchmark targets rich-text markdown editing plus the mainstream editor surfaces that surround it.
In scope:
Out of scope:
The benchmark model is multi-editor by design.
Current editors:
Future editors can include:
The registry and UI must not hardcode Plate vs Slate assumptions.
This is the missing fairness boundary.
There is no honest single comparison universe for all editor products. The benchmark must define profiles and rank editors only inside the relevant profile.
Required profiles:
core-markdown-editor
extended-editor-surface
core-markdown-editorRules:
N/A outside a profile and must not count
against the editorThis kills the main fairness bug: no editor should lose a markdown benchmark because another editor also ships callouts, columns, or media blocks.
The benchmark program has four suites.
This is the hard behavioral gate.
It is derived from editor-protocol-matrix.md.
Every protocol row is a benchmarkable scenario record with:
familyentitycontextselectioncaret_or_edgeinputexpectedauthorityspec_idevidencestatusThe conformance suite reports:
This suite is the difference between “fast” and “correct.”
This measures the latency of concrete editing operations.
It is not enough to benchmark mount and typing. The benchmark must measure the operations users actually hit while editing:
This measures whole-workflow behavior across realistic document families and sizes.
It answers:
This measures everything outside direct editing latency that still matters in a production editor:
The benchmark registry has two layers:
The scenario registry comes from the protocol matrix and remains exhaustive.
Its job is coverage and correctness.
Long term, this registry must become benchmark-owned and editor-neutral.
Minimum scenario-registry fields:
scenario_idprofile_idsprotocol_familyfeature_familyentitycontextselection_shapecaret_or_edgeinputexpectedauthority_primaryauthority_secondaryauthority_syntaxstatustimed_candidatetimed_priorityThe measurement registry defines benchmark lanes with stable ids.
Its job is timing, memory, startup, and payload.
The benchmark UI should render measurement lanes directly and use the scenario registry for correctness gating and drill-down.
This is mandatory. Without it, the spec becomes a combinatorial joke.
Rules:
Each timed lane must declare:
Hard selection formula:
For each active benchmark profile, for each supported feature family:
structural-edittext-mutationselection-navigationclipboardhistorycollapsed representative if the family supports collapsed
editingexpanded representative if the family supports expanded
editingSo the minimum timed set for one feature family is:
operation-class × relevant selection classes+ 1 scaling lane+ 1 stress lane when warrantedEscalation rules:
Every protocol family needs:
Escalation rule:
That is how we get exhaustive correctness without drowning in meaningless timed cells.
The benchmark must recognize these protocol families from the editor-behavior source docs:
markdown-nativemarkdown-extensionblock-editor-nativestyling-layoutcollaborationOnly the first four are part of the current benchmark claim.
collaboration stays visible as deferred or excluded, not silently omitted.
The measurement registry must cover these feature families because they are the real content surfaces people expect from a Typora-grade markdown editor:
Not every editor will support every family. Unsupported families must be shown as unsupported, not dropped from the model.
But unsupported does not mean “ranking loser.” Whether unsupported families matter depends on the active benchmark profile.
Every interaction lane should be taggable by these axes:
familyentitycontextselection_shapecaret_or_edgeinput_sourcedocument_familydocument_sizeRequired document families:
plain-paragraphsmixed-markdownquote-heavylist-heavytask-list-heavycode-heavytable-heavyheavy-marksmixed-rich-textRequired sizes:
1k5k10k50kRequired shapes:
collapsedexpanded-inlineexpanded-multiblockbackward-expandedcell-rangenode-selectedRequired sources:
“Same scenario” must mean the same semantic document, not two hand-built docs that feel similar.
The benchmark needs a neutral fixture layer.
Every benchmark lane must point at a canonical fixture id.
Each fixture entry needs:
fixture_idprofile_idsdocument_familydocument_sizesemantic_sourceserialization_variantsrequired_feature_familiesFixtures should come from:
Each fixture can have multiple source representations:
Rules:
Editors may adapt the fixture into their internal model, but they may not:
unsupportedIf an editor cannot represent a required construct for the active profile, the
lane is N/A or disqualifying according to the profile rules. It is not
allowed to mutate the corpus into a different workload and call it fair.
The interaction suite should mirror the stable-id style of
js-framework-benchmark.
01_ready-empty
02_mount-1k03_mount-10k04_mount-50k05_replace-same-size06_append-1k-to-1k07_append-5k-to-10k08_remove-single-block09_clear-document10_type-middle11_type-start12_type-end13_type-inside-marked-text14_partial-update-every-10th-block15_partial-update-every-10th-leaf16_enter-split-paragraph17_backspace-merge-block18_delete-forward-merge19_tab-indent20_shift-tab-outdent21_toggle-mark-selection22_toggle-block-selection23_select-single-caret24_shift-arrow-expand-inline25_shift-arrow-expand-cross-block26_mouse-drag-range27_arrow-nav-cross-block28_select-table-range29_paste-plain-text30_paste-html-rich-text31_paste-markdown32_paste-large-fragment33_paste-duplicate-id-fragment34_undo-single-change35_redo-single-change36_undo-after-large-paste37_redo-after-structural-edit38_move-block-up39_move-list-item40_swap-adjacent-blocksThe benchmark ids above are generic operation ids. They must be instantiated across feature families.
For a Typora-grade benchmark, the interaction registry must include at least these concrete feature + operation combinations:
These are the minimum canonical timed representatives. The protocol matrix remains broader than this list.
CPU and interaction benchmarks must reserve these slices:
totalscriptlayoutpaintotherThe current runner may not expose every slice yet. That is an implementation gap, not a reason to shrink the spec.
Every timed lane must record:
Every lane must also declare:
The benchmark UI must support these display modes:
meanmedianbox-plotOptional later:
p95worstThe result payload should still store p95 and worst even if the first UI
pass does not surface them yet.
Like js-framework-benchmark, not every lane should run at the same CPU
profile.
Heavy interaction lanes should support throttled runs where that makes the difference visible:
The benchmark artifact must record the exact throttle factor for every run.
Memory is not one number.
The required memory lanes are:
51_ready-memory
52_mount-1k-memory53_mount-10k-memory54_mount-50k-memory55_typing-churn-memory
56_paste-clear-memory
57_history-churn-memory
58_table-selection-memoryMemory results should include:
The benchmark must explicitly list real editor dimensions that are deferred instead of pretending they do not exist.
Current active clipboard lanes cover direct editor-side paste costs.
Deferred here:
Current active selection lanes cover basic range and table selection cost.
Deferred here:
Deferred here:
Deferred here:
Rules:
The startup suite should match the rigor of the startup and Lighthouse lanes in
js-framework-benchmark.
Required startup lanes:
61_startup-time62_consistently-interactive63_script-bootup64_main-thread-work65_first-paint66_first-contentful-paint67_editor-readyeditor-ready is editor-specific and required. A page that painted is not
necessarily an editor that is ready for input.
Headline benchmark publication uses one environment first.
Required primary environment:
Chrome stableMacBook Pro 16-inchSpace BlackApple M5 Max18-core40-core16-core128 GB unified memory2 TB SSDThis machine profile must be recorded exactly in published artifacts and result metadata.
Required captured environment metadata:
Deferred environments:
Those are explicitly deferred, not forgotten. The benchmark UI should expose the current browser/environment selector model even if only one environment is populated at first.
Required payload lanes:
71_size-uncompressed72_size-compressed73_editor-route-js74_editor-route-css75_total-byte-weightPayload should be reported per editor route, not as a vague app bundle total.
Correctness is a first-class suite, not a note beside performance.
Required correctness metrics:
81_protocol-coverage82_protocol-pass-rate83_open-critical-regressions84_open-major-regressions85_family-completenessAn editor with unresolved correctness failures must be visibly flagged.
The UI must support:
Correctness-clean means:
No editor should appear as a clean performance leader when it fails protocol rows in the same family.
The benchmark system should maintain a known-issues registry similar to
js-framework-benchmark.
Each issue entry needs:
The results UI must show:
The homepage must be a dense results table.
Not a dashboard. Not cards first. Not charts first.
Table model:
Sortable keys:
Required sticky surfaces:
Required editor metadata rows:
The result app should expose these controls:
Which editors?Which benchmarks?Which profile?Which protocol families?Which document families?Which sizes?Which environment?Display modeDuration sliceCompare withHide flaggedShow only correctness-cleanCopy / paste current selection stateIf a control does not map to a meaningful benchmark dimension, it should not exist.
Charts are secondary views. They must help someone understand the table, not replace it.
Required chart families:
Do not add:
Allowed:
Not allowed:
If an editor is missing support for a family, that family stays visible as unsupported.
Ranking rules:
N/A outside the active profile: visible, excluded from rankingN/A inside the active profile: visible, disqualifying for clean rankingThe benchmark pipeline has three stages, just like
js-framework-benchmark:
Every raw result file must capture:
The compiled results payload must include:
The UI should not compute benchmark identity from ad hoc labels. The ids and family groupings belong in the registry.
Any headline comparison must satisfy all of these:
If any of those drift, the lane is no longer headline material.
Public benchmark pages may claim:
Public benchmark pages may not claim:
A benchmark implementation in this repo is only credible when it has:
Anything less is a probe, not a benchmark lane.