docs/internal/scenario-meta-testing.md
Status: accepted (initial), 2026-05-20
Owner: test-infra
Related: e2e-test-migration-design.md, scenario-migration-status.md
The semantic test suite expresses tests as data: each scenario is a
value (initial state, events, expected state) consumed by a
fallible runner check_*(s) -> Result<(), ScenarioFailure> (and a
panicking assert_* wrapper). The mod.rs doc already frames these
values as consumed by multiple drivers: the regression runner,
proptest generators (property.rs), and shadow differentials
(shadow.rs).
Two recurring quality problems motivated this work (both surfaced by the PR #2061 migration audit):
anti_* tests +
check-semantic-migration-conventions.sh guard this manually,
per file.We want automated meta-tests — a fourth driver — that consume the existing corpus and prove properties about the tests themselves.
cargo-mutants on production code (the complementary
axis: "do tests cover the code?" vs. "are the tests
load-bearing?").FRESH_MUTATION=1) so the
normal cargo test fast path is unchanged; a dedicated CI job runs
the meta pass.events/actions vector. Not the 2ⁿ powerset.check(candidate) passes. (No assertion-flip guard, since
validation is dropped.)original_len, minimal_len, and the dropped actions. Sort by
original_len - minimal_len.
minimal_len == 0 → the expectation holds with no actions →
vacuous test (this is the FAKE-test catch). Flag loudly.minimal_len ≪ original_len → setup bloat / over-specification.BufferScenario (its check is cheap — no render).Active reset = actions that reverse the editor to the scenario's initial state, driven through the production action path — not a fresh harness. For the buffer layer:
Esc (cancel modal/selection)
RemoveSecondaryCursors
SelectAll
InsertChar × initial_text (first char replaces the selection)
MoveDocumentStart
clear selection
Honest limits (these bound where combination applies):
Undo-to-empty (fragile).
Instead, combination is scoped to BufferScenarios whose
observable is text/cursor/selection and that do not assert on
event_log_len / is_modified. Persistence / workspace / modal /
temporal scenarios cannot use action-reset (fs, multi-buffer, clock)
and are out of scope for combination.Procedure:
reset,
and assert the observable equals the fresh-harness baseline. If
SelectAll + retype + MoveDocumentStart doesn't reproduce a clean
baseline on its own, reset is buggy — finding #0.S1; reset; S2; reset; … on one
long-lived harness, asserting each Sᵢ's expectation at its
checkpoint.Findings: Sᵢ passes alone but fails after Sⱼ ⇒ either
Sᵢ reads (extend reset, or
document the leaked state), orSᵢ secretly depends on ambient state (test smell).Both are bugs the fresh-harness model can never surface — which is the entire reason for using an active reset instead of a fresh harness.
There is exactly one way to evaluate a BufferScenario's actions:
run_buffer_actions(harness, &[Action]) (in buffer_scenario.rs). It
always renders — a frame before the first action and after each
action — exactly as the real event loop renders before every keystroke.
Every consumer routes through it:
check_buffer_scenario (the per-test regression runner),run_scenarios_with_reset_between (the combination driver),evaluate_actions (proptest generators + the shadow differential).So the drivers differ ONLY along explicit axes — harness lifetime (fresh-per-call vs one shared harness + active reset) and inputs (fixed vs proptest-generated) — never in dispatch/render semantics.
Why always render (not no-render-for-speed). The real editor always
has a rendered frame before handling input, so a no-render harness is
strictly less faithful: layout-dependent actions (MoveDown,
MoveLineEnd, SelectLineEnd, …) resolve against the rendered line
structure and silently no-op without a render — cursor never moves,
no selection forms. That silent no-op is a footgun: it bit a real change
(two corpus scenarios appeared to fail; the actual cause was the missing
render). Always rendering removes the footgun, lets the single unified
corpus hold any scenario (logical or layout-dependent), and means
LayoutScenario is distinguished only by what it asserts (rendered
rows / viewport / cell colors), not by whether it renders.
Cost: rendering is ~4 ms/frame; always-render added ~15% to the semantic suite (≈133 s → ≈155 s) — accepted for the faithfulness + unification.
Shadow note: the corpus also feeds the pure-state shadow differential. A
real (non-delegating) shadow can't model layout-dependent cursor
movement, so it should skip those scenarios via the existing
supports_scenario capability filter — a reason to filter, not to keep
them out of the corpus.
BufferScenario (self-contained; no reset needed) →
produce the ranked report; triage vacuous/bloated tests.cargo-mutants nightly job.minimal_len == 0 report to
a hard CI failure (would replace the manual anti_* convention for
the buffer layer).select_word_selections_at_each_grapheme (the only remaining
bespoke buffer-eval loop) onto run_buffer_actions for full
consistency — currently left as-is since SelectWord is logical and
the per-grapheme loop wants to skip render cost.