karate-js-test262

ECMAScript test262 conformance harness for karate-js. Reproducible pass/fail matrix across the ES surface area, declarative skip list (etc/expectations.yaml), and the roadmap for what to tackle next. Not published to Maven Central.

The bar is can karate-js run real-world JavaScript written in the wild, especially by LLMs? test262 is the scorecard; pragmatic ES6 coverage of idiomatic code is the goal — not spec-lawyer compliance for its own sake.

See also: ../karate-js/README.md — what karate-js is · ../docs/JS_ENGINE.md — engine architecture, slot family, prototype machinery, spec invariants, benchmarks. Design reference for every TODO below. · ../docs/DESIGN.md — wider project design · test262 INTERPRETING.md — authoritative test-runner spec.

This file is the roadmap. For why a TODO exists or how a subsystem is shaped, follow the JS_ENGINE.md anchors below.

Working principles

Operating-mode maxims for the test262 conformance loop. Treat as load-bearing.

Real-world JS first; test262 is the scorecard, spec is ground truth. A fix that unblocks 500 idiomatic tests beats one that tightens a rare spec corner. Existing JUnit tests can be wrong: when the spec disagrees, the spec wins — fix the test along with the engine.
Errors must look like JavaScript, not Java. A raw IndexOutOfBoundsException or at io.karatelabs.js.Interpreter.eval(...) frame escaping Engine.eval(...) is a correctness bug, not cosmetic noise. See JS_ENGINE.md § Exception Handling and § Error routing & shape.
Fix friction before moving on. Bad error messages in results.jsonl, parse-vs-runtime classification gaps, missing report fields, --single -vv not showing what you need — stop and fix the tooling rather than working around it.
Protect the hot path — pay edge-case cost on the edge case. Sentinels over thrown signals, type-check rare cases after the common-case miss, parse-time analysis over inner-loop checks. After any non-trivial engine change, run EngineBenchmark profile and compare against JS_ENGINE.md § Performance Benchmarks.
Code should be DRY and aligned with the JS spec. Near-duplicate dispatch and wrong-layer workarounds are clues that the layer below is wrong; collapse to a single spec-shaped seam. Fix it inline or file a Deferred TODO with concrete pointers (file, method, what the unification looks like) — vague "this could be cleaner" notes are worthless.
Batched commits are fine if the message enumerates the changes. What matters is that the commit message lets a future bisect attribute regressions.
Aggregate, don't dump — context is precious. A full run is ~53k JSONL rows. Treat run output as files to query, not streams to tail. Full rules in Context discipline.
Playbook hygiene is the work, not a chore. Stale counts, "past wins" narration, log patterns that flood context, JSONL the queries can't parse — fix the rot inline in the session that surfaced it. Fix it at the writer, not in a workaround. A playbook future sessions can trust is worth more than a museum piece.
Refactor — or rewrite — boldly; the regression net is the license. This repo carries an unusually strong safety net: the test262 language slice with byte-for-byte FAIL-set diffing (Diff two run-dirs), 1086+ unit tests with SpecPinTest spec-invariant pins, 2224+ karate-core consumer tests, and JIT-stable benchmarks. That net exists so you can do the right structural thing instead of accreting another local workaround. When a subsystem is fighting you — near-duplicate traversals, a check at the wrong layer, a seam that every new feature has to special-case — you are empowered to restructure or rewrite it, not just patch around it. This is the active form of principle #5: #5 says spot the wrong-layer smell; #9 says act on it. The discipline that makes boldness safe, not reckless: (a) state the smell and the target shape before cutting; (b) keep behavior-preserving refactors and new behavior in separate commits; (c) gate every such change on the full net — unit tests, test/language/** 0-regression diff, EngineBenchmark profile within budget, karate-core consumer check — and quote the before/after in the commit. A refactor that the net certifies as behavior-identical is always cheaper than the compounding cost of the workaround it removes. (Worked example: the 2026-05-30 fused early-error walk — three full-tree validation passes collapsed to one, ~13% of parse CPU reclaimed, FAIL set byte-for-byte identical. See Engine — cleanup → Fuse the early-error parse walks.)

Per-session ritual

Each session that touches the engine should:

Re-probe the slice baseline with --only before scoping. Old slice numbers go stale fast — record fresh before/after pass counts in the commit message and pin the run-dir. If target/test262/ has no run-* dirs yet (clean clone, or after mvn clean), your first --only invocation is the baseline; pin its run-dir in the commit so the next session has a diff target.
Unit tests: mvn -f pom.xml -pl karate-js -o test → Tests run: 1086+, Failures: 0, Errors: 0, Skipped: 2 (count grows as SpecPinTest accretes invariants).
test262 built-ins probe: diff results.jsonl against the previous run. Zero regressions (PASS → FAIL). Document any flip in the commit message.
EngineBenchmark profile: within ±10% of the JS_ENGINE.md reference; ±5% on hot-path refactors. If unavoidable (correctness > speed), update the reference table in the same commit.

karate-core consumer check:

mvn -f pom.xml -pl karate-js -o install -DskipTests
mvn -f pom.xml -o test -pl karate-core

Expect Tests run: 2224+, Failures: 0, Errors: 0, Skipped: 3.

Update this file's TODOs in the same commit. This is a roadmap, not a changelog. For each item the commit addressed (active priority bullet, background sweep, deferred TODO, or implicit assumption a bug broke), strike or rewrite it here. If the work surfaced a new architectural invariant — a contract that future code must respect — push the why into JS_ENGINE.md under the relevant spec-invariant anchor, then leave a one-line pointer here from any TODO that still depends on it. Yesterday's done work doesn't belong in this file; the commit log is the audit trail.

Context discipline

A full conformance run is ~53k JSONL rows. A slice (test/language/**) emits one FAIL <path> — <type>: <msg> line per failure on stdout plus a growing results.jsonl.partial. Per-test -vv dumps full source. Pulling any of this raw into your context burns the budget you need for the actual engineering work. Treat run output as files to be queried, not streams to be tailed.

Rules:

Never tail -f or cat a full progress.log / results.jsonl. For liveness, tail -n 1 <progress.log> returns the last heartbeat (processed N pass M fail K skip L @ rate) — that's tests-done authoritatively in either mode. (wc -l <partial> counts only FAIL+SKIP in dev mode, so don't use it for total-processed.) For slicing use the Failure triage jq one-liners.
Default --single to -v, not -vv. -v prints metadata + classification + the engine's location: <path>:<line>:<col> — usually enough to find the call site. Escalate to -vv (full test source) only after -v fails to localize the cause.
Cap diff output. When comparing two run-dirs, emit counts + top-N representative paths + per-slice cluster breakdown. Never the full regressed / new-pass lists. The Diff two run-dirs recipe is already capped — use it as written.
Delegate slice runs to a sub-agent with a strict return contract. Spawn a general-purpose agent (it has Bash) and require a ≤200-word digest: pass/fail/skip counts, top 3 failure clusters with one example each, anything surprising. The agent reads the full output; you receive the digest. See Delegate a slice run for the exact prompt template.
Prefer reading engine source over reading log streams. A FAIL line tells you what threw; the engine source tells you why. Once you have one representative failing path and the call site from --single -v, close the JSONL and work from the code — the slice re-run to confirm the fix is a single etc/run.sh --only away (delegate it).
Mvn output is verbose — pipe to tail -n 30. Unit tests, benchmark, and karate-core consumer check from the per-session ritual all dump compile noise before the summary. mvn ... -o test 2>&1 | tail -n 30 is enough to see Tests run: ... and any failures. Use -q to suppress compile chatter when you don't need it.
etc/expectations.yaml is 175 lines — fine to Read whole when editing the skip list. Long-form files in target/test262/run-*/ are not — query them.

Active priorities

Strict mode + onlyStrict — DONE and un-skipped. The keystone landed: the parser tracks lexical strictness (JsParser.checkStrictEarlyErrors, a strict-gated post-parse walk: program prologue → function-body prologue → always-strict class bodies) and enforces the simple-binding early errors (legacy/non-octal-decimal literals 0755/08, eval/arguments as assign/update target or function-name / param / var-binding name) plus the full BoundNames walk over binding patterns (collectBoundNames — duplicate names in arrow params / non-simple parameter lists / catch params, and eval/arguments bound inside any pattern). The runner prepends a "use strict" directive for flags: [onlyStrict] (Test262Runner.evaluate), and the flag: onlyStrict skip is removed. Measured onlyStrict un-skip (test/language/**): 282 PASS / 146 FAIL, 0 regressions (lang pass 5433 → 5715). ⚠️ This only worked once a latent tooling bug was fixed — etc/run.sh ran the runner via exec:java, which does not recompile, so edits to Test262Runner (the strict-prepend) silently never took effect; run.sh now test-compiles the module first. The prepend had measured as a no-op (71/357) for a full cycle because of this.

Next up — negative parse-phase early errors (the dominant test/language cluster). A probe of test/language/statements/** buckets these under the new MissingParseError error type (negative phase: parse tests the engine parses instead of rejecting — see Results schema): 141 remain in the statements slice (was 183; −42 from the declaration-in-statement-position sweep below). Use it to scope: jq -r 'select(.error_type=="MissingParseError").path'. Function- AND lexical/class-declaration-in-statement-position, AND per-scope lexical-redeclaration are DONE (see Background sweeps — the latter cleared the whole switch/syntax/redeclaration/** cluster, 24→0). The residual is the long tail of other early errors: escaped-keyword / reserved-word misuse, break/continue to undefined labels, new.target / super outside a method, getter/setter arity, etc. Pick the next sub-cluster by count from the MissingParseError histogram. The remaining for-(of|in)/dstr/** cluster (~56) is a fragmented long tail of distinct destructuring-pattern rules — lower leverage per unit effort. One known un-enforced parser corner carried over: a "use strict" prologue inside a non-simple-parameter-list function is itself an early error. (Lexical duplicate-BoundNames for let/const patterns — let {a,a}=… — is now covered by the redeclaration walk's per-VAR_DECL BoundNames collection.)

Also residual from the onlyStrict un-skip: (2) ~16 runtime SyntaxError not thrown; (3) ~14 strict-assignment runtime TypeError (arguments-object write guards et al.). with-statement early error stays deferred (with lexes as a call; statements/with/** path-skipped — negligible payoff). Details: Deferred TODOs → Strict mode.

Beyond that, remaining work is concentrated in test/language/**, dominated by destructuring-assignment pattern parsing (see Background sweeps). Symbol stays parked — real-world JS doesn't use Symbol(...), and the well-known symbols (@@iterator / @@toPrimitive / @@toStringTag) already work as string stand-ins. For current pass/fail/skip counts, query the latest run-dir (Recipes → Failure triage) — counts go stale fast and don't belong in this file.

Built-in health — the business-rules / logic-scripting surface

Qualitative verdict from a scoped probe of the data-type built-ins (the methods business-rules and logic scripts actually lean on). Counts rot, so they're omitted — re-probe with --only 'test/built-ins/<X>/**' for fresh numbers. The shape of what's solid vs. gapped is the durable part:

Number, Date, Object — solid. toFixed/parseInt/parseFloat/ toString(radix)/isNaN/isInteger; Date parse/format/getTime/ getFullYear/arithmetic; keys/values/entries/assign/freeze/ create/getPrototypeOf/hasOwn/spread/fromEntries all work. Residual fails are spec-corner arg-validation (missing TypeError on bad args), descriptor-attribute edges, and Symbol gates — not core method behavior. Object had zero Java-leak rows.
String, Array — solid on the common path. split/replace/slice/ substring/indexOf/includes/trim/pad/case and push/map/ filter/reduce/slice/concat/find/sort/from/spread all work. The low raw pass-% is dominated by strict coercion-error semantics and Symbol/feature gates, not everyday breakage. Caveats: a few principle-#2 Java leaks — String.prototype.replaceAll/endsWith Range […) out of bounds, Array at near-2³³ lengths (Index out of bounds / VM-size / heap). Narrow but real; see Cleanup residuals.
RegExp — solid for everyday patterns; advanced-pattern tail remains. test/exec/match/replace (string + function replacer) / split / search, g/i/m flags, and named-group capture (m.groups.name, $<name> substitution, the function-replacer groups arg) all work — Java Pattern is the backend. Remaining gaps: lookbehind, unicode property escapes, /v flag, group-name early-error validation; plus null-arg Java leaks in exec/test and one catastrophic-backtracking timeout. The Symbol.{replace,match,matchAll,split,search} protocol fails are conformance-only — the everyday str.replace(re, fn) path does not route through them and works.

Bottom line for the target workload: String/Number/Date/Array/Object are dependable, and RegExp now covers the common path including named groups; the residual RegExp tail (lookbehind / unicode escapes / /v / early-error validation) is advanced-pattern territory.

Slice	What's blocking it
`test/language/statements/for-of`	IteratorClose done — on destructuring (normal/throwing/non-object `return()`, rest-skips-close) AND on abrupt loop exit (break/return/throw closes the outer iterator); body-skip-on-abrupt-binding; member-expression LHS (`for (x.attr of …)`) is now PutValue (invokes setters) not a var declaration — this also fixed the `body-put-error.js` infinite-loop hang. (`Interpreter.destructurePattern`/`evalForStmt` + `JsIterator.close`.) Remaining: assignment-pattern target-eval-order (`[ obj[sideEffect()] ] of …` must evaluate the target reference before stepping the iterator — the `thrw-close` family, a rare spec corner); fn-name inference for `[x = (function(){})] of …`; negative-parse tightenings; `array-elem-put-let.js`-style ReferenceError-on-bad-target (now fires under in-body `"use strict"`; `onlyStrict` variants stay SKIP).
`test/language/expressions/object`	Escaped-keyword cover-name dominates; `__proto__`-duplicate edges; computed-key / spread / method-def tail.
`test/language/expressions/assignment`	Iterator-return semantics on default-expr throw.
`test/language/{statements,expressions}/function` + `arrow-function`	fn-name inference for `[x = (function(){})]`-style defaults; IteratorClose-on-throw; rest-element edges.
`test/language/expressions/compound-assignment`	Strict-mode ReferenceError on undeclared LHS now fires under in-body `"use strict"` (the `onlyStrict`-flagged variants stay SKIP until the runner runs a strict pass); `valueOf` / ToNumeric ordering for `+=` / `=`; `A5._T2/T3` family (non-identifier LHS — Annex-B carve-out).
`test/language/statements/{try,for,switch}`	Control-flow tail; abrupt-completion handles headline cases.
`test/built-ins/Array/**`	`splice` / `concat` `Symbol.species` (Symbol-gated).
`test/built-ins/RegExp/**`	Named-group capture access done (`result.groups` + `$<name>` + function-replacer `groups` arg; see Background sweeps). Residual: group-name early-error validation, `Symbol.{match,replace,search,split,matchAll}` protocol (Symbol-gated, conformance-only — everyday `str.replace(re,fn)` doesn't use it), lookbehind / unicode-property-escapes / `/v` flag (feature-gated), one functional-replace-global ordering case. Null-arg Java leaks + one catastrophic-backtracking timeout in `exec`/`test` (principle #2 — see Cleanup residuals).
`test/built-ins/String/**`	`substring` / `lastIndexOf` / `charAt` ToInteger corners; parser-blocked; Symbol-gated tail; `replaceAll`/`endsWith` `Range […) out of bounds` Java leaks (principle #2). See JS_ENGINE.md § Spec preamble at built-in entry points.
`test/built-ins/Object/**`	Descriptor edges; `seal` (TypedArray-gated); Annex-B `arguments` aliasing. See JS_ENGINE.md § Property attributes.
`test/built-ins/JSON/**`	`JSON.stringify` reviver/replacer 2-arg semantics; `-0`/`__proto__` parser tail. Calibration: run JSONTestSuite — see JS_ENGINE.md § Future TODO Items.
`test/built-ins/Number/**`	`[object Number]` (Symbol-gated) + a literal-form parser edge.
`test/built-ins/Date/**`	ISO format edges + invalid-date propagation. See JS_ENGINE.md § Date.
`test/built-ins/Symbol/**` (parked)	Symbol primitive. Deprioritized — no real-world code uses it. Pick up after the language work.

Background sweeps

Picked off opportunistically when nearby — not session-sized on their own.

Function-declaration-in-statement-position early error — DONE. JsParser now rejects a FunctionDeclaration used as the sole body of a Statement clause (its body STATEMENT directly wraps an FN_EXPR; a braced body is a BLOCK and stays legal). Loop bodies (for / for-in / for-of / while / do-while) are an early error in BOTH modes — no Annex B carve-out — so they live in validateEarlyErrors (checkNoFunctionDeclarationBody). The if/else clause is sloppy-legal (Annex B.3.4) but a strict-mode early error, so it rides the strict-gated checkStrictEarlyErrors walk. Labelled-function declarations (label: function f(){}) are NOT covered — the parser has no LABELLED node type. Slice delta (test/language/statements/**): ~13 PASS, 0 regressions (if 8, plus one each in for/while/do-while/for-in/ for-of). Pinned by SpecPinTest.functionDeclAsLoopBody_* / functionDeclAsIfBody_*. Invariant recorded in JS_ENGINE.md § Strict Mode Policy.
Lexical/class-declaration-in-statement-position early error — DONE. JsParser now also rejects a LexicalDeclaration (let/const) or a ClassDeclaration used as the sole body of an if/else/loop clause (checkNoLexicalOrClassDeclarationBody, beside the function-decl helper, called from the mode-independent validateEarlyErrors). Unlike FunctionDeclaration these have no Annex B carve-out, so they are an early error in BOTH modes for every clause including if/else (§13.6/§14.x). var hoists and stays legal; a braced body is a BLOCK. The let-vs-var distinction lives on the VAR_STMT keyword token, not VAR_DECL (isLexicalVarStmt). One sloppy-mode subtlety handled: let is not reserved, so if (x) let\n y is let-the-identifier + ASI (the only forbidden let-form at ExpressionStatement start is let [); a LineTerminator after a let keyword (lineTerminatorFollows, scanning across WS/comments to the next primary token) means it is NOT a lexical declaration here — const is reserved and has no such escape. Slice delta (test/language/statements/**): +42 PASS, 0 regressions (MissingParseError 183 → 141), dominated by the let/syntax + const/syntax statement-position families. Pinned by SpecPinTest.lexicalOrClassDeclAsClauseBody_isAlwaysParseError / letAsIdentifierWithLineTerminator_isNotALexicalDeclaration.
Per-scope lexical-redeclaration early error — DONE. JsParser now enforces, per lexical scope (Script, function body, plain Block, switch CaseBlock), that LexicallyDeclaredNames has no duplicates and does not intersect VarDeclaredNames (§14.2.1 / §14.12.1 / §16.1.1). It rides the strict-aware checkStrictEarlyErrors walk (checkScopeRedeclaration + checkSwitchCaseBlockRedeclaration + collectVarNames + declarationName + directStatements) because the only Annex B.3.3 relaxation is strict-gated: a duplicate bound solely by FunctionDeclarations is sloppy-legal but a strict early error (e.g. { function f(){} function f(){} }). The lexical∩var clash has no carve-out (always an error). FunctionDeclarations are lexical in a Block/CaseBlock but var-scoped at Script / function-body top level (the topLevel flag, derived for a BLOCK from whether its parent is a function via Node.getParent). A hot-path guard returns immediately when a scope has no lexical declarations (keeps the benchmark flat). Error message aligned to the existing runtime wording (identifier 'X' has already been declared, CoreContext) so it reads the same whichever layer catches it; the two JUnit tests that asserted the old runtime message (EvalTest.testConstRedeclare, EngineTest.testConstRedeclareAcrossEvals) — correct in spirit, these ARE phase: parse early errors — still pass unchanged (REPL cross-eval redeclaration stays legal since each eval parses independently). Also subsumes the previously-deferred let {a,a}=… duplicate-pattern rule (per-VAR_DECL BoundNames). Slice delta (test/language/**): switch redeclaration cluster 24 → 0, +134 PASS overall, 0 regressions. Pinned by SpecPinTest.duplicateLexicalNamesInScope_* / lexicalNameClashingWithVar_* / duplicateFunctionDeclarations_areSloppyLegalStrictError. ⚠️ Three positive tests still FAIL with the same has already been declared message but from the runtime CoreContext check, not the parser — pre-existing env-scoping gaps unrelated to this change (see Deferred TODOs → spec alignment): indirect (0,eval)(...) must get a distinct declarative environment, and a switch CaseBlock must get its own block environment at runtime (scope-lex-close-case.js).
C-style for per-iteration let/const environment — DONE. Interpreter.evalForStmt now models §14.7.4.3 ForBodyEvaluation properly: the test + body run in one iteration scope (so a body closure captures that iteration's distinct binding), then a fresh scope is seeded from the body's end-of-iteration values and the increment runs in it. Fixes the infinite-loop hang on an in-body update with no increment clause (for (let x = 0; x < 10;) { x++; } — previously the per-iteration scope discarded x++ and the snapshot reset to 0). The old code wrote the body's values back to the LOOP_INIT slot via update(), which corrupted closures created in the initializer (for (let i = 0, f = () => i; …) must keep returning 0); the rewrite threads values through an explicit carry list, never resolving back to the captured LOOP_INIT slots. Also fixed: loopVarNames collected initializer identifiers (for (let i = digits.length - 1; …) wrongly captured digits) — now only binding targets. Slice delta (test/language/statements/**): +7 PASS, 0 regressions (4 continue/ timeout hangs + 3 let/for closure-scope tests). Pinned by SpecPinTest.forLet_*.
String iterator splits surrogate pairs — DONE. IterUtils.stringIterator now walks code-points (codePointAt / Character.charCount) per spec §22.1.5.1, so for-of over a string with astral chars / emoji yields one element per code point. test262 for-of/string-astral.js now passes.
Array.prototype.values() returns raw List — DONE. Now returns a spec Array Iterator object via IterUtils.toIteratorObject(IterUtils.listIterator(...)), so arr.values().next() works. listIterator exposed package-private. test262 Array/prototype/values/{iteration,returns-iterator,returns-iterator-from-object}.js now pass; JsArrayTest.testArrayApi updated to spec iterator semantics. Note: keys() / entries() still return raw List (same class of bug, lower-value — arr.keys().next() is rare); apply the same fix when a workload surfaces it.
Object-literal spread of arbitrary expressions — DONE. {...fn()} / {...obj.method()} / {...{x:1}} now parse: object_elem() parses expr(-1, true) after ... (mirrors array_elem). evalLitObject evaluates the operand and merges own-enumerable props via spreadInto (Map / JsArray index keys / String code-unit keys / null+undefined no-op), which also fixed the latent {...array} / {...string} cases. EvalTest.testObjectLiteralSpread covers it.
.length / .name rollout to remaining prototypes — JsBuiltinMethod infra in place; most residual name.js fails are Symbol-gated.
RegExp named-group capture access (result.groups) — DONE. JsRegex.exec / JsStringPrototype.match / matchAll now attach a spec-shaped groups object (null prototype; name→value, undefined for non-participating groups, undefined when the pattern has no named groups); function replacers receive the trailing groups arg per spec §22.1.3.18. Names derived once at construction via JsRegex.groupNames (scans the source for (?<name>, skipping (?<=/(?<! lookbehind and escaped/char-class contexts). feature: regexp-named-groups skip removed. Slice delta (run-2026-05-30-003211 vs …-001414): +12 PASS, 0 regressions, covered in JsRegexTest.testNamedGroups*. Residual named-groups/** tail (still failing, separate concerns): group-name early-error validation ((?<__proto__>…) / (?<_>…) should SyntaxError — engine accepts; part of the parser-tightening sweep), the Symbol.replace/match protocol (Symbol-gated), and one global functional-replace argument-ordering case (named-groups/functional-replace-global.js — «badc» vs «bacd»; worth a focused look, likely pre-dates this change).
Destructuring BoundNames early-error walk — DONE. JsParser now has a spec-shaped BoundNames walk over binding patterns (collectBoundNames + collectObjectElemBoundNames + collectBindingBoundNames) that mirrors the binding structure — {a: x = y} binds only {x}, so keys / defaults / renamed targets do not false-positive (verified: 0 regressions across test/language/**). Wired into checkFormalParameters (arrow params and non-simple parameter lists: duplicate BoundNames always SyntaxError; plain duplicate simple params in a sloppy non-arrow fn stay legal) and a new checkCatchParameter (duplicate catch BoundNames always SyntaxError; eval/arguments bound in any pattern under strict). try/early-catch-duplicates.js un-skipped → PASS. Slice delta: +17 PASS, 0 regressions (13 arrow, 2 function, 1 object-method, 1 catch). Pinned by SpecPinTest.dupParams_* / boundNames_mirrorStructure_noFalsePositive / dupCatchParam_* / evalArguments_boundInsidePattern_strictOnly. Residual (deferred): lexical duplicate-BoundNames for let/const patterns (let {a,a}=… — distinct rule; VAR_DECL doesn't carry the let-vs-var distinction), and object-method simple-param dup in sloppy code (rare).
Cleanup residuals — occasional "null" NPE paths, IllegalName JDK lambda leak, Java heap space OOM in array-slice paths. Built-in probe (2026-05-30) added concrete principle-#2 leaks to chase: String.prototype.replaceAll/endsWith throwing Java Range […) out of bounds instead of behaving/throwing-as-JS; RegExp exec/test surfacing Cannot invoke Object.toString() because args[N] is null on null args + one catastrophic-backtracking Timeout (RegExp/.../S15.10.2.8_A3_T17.js); Array at near-2³³ lengths leaking Index out of bounds / Requested array size exceeds VM limit / heap (unshift/splice/reverse). All confined to edge/pathological inputs.

Deferred TODOs

Tracked but un-scheduled. Each item: a one-line what + why parked + file pointer. For how the subsystem is shaped, read the file. For spec invariants worth honoring, see JS_ENGINE.md § Spec Invariants.

Engine — feature gaps

Strict mode — runtime semantics + simple-binding parser early-errors DONE. "use strict" activates the spec runtime flips: this→undefined in plain calls, ReferenceError on implicit-global assign, and TypeError on write-to-frozen / read-only / getter-only / non-extensible and delete of non-configurable. Strictness is lexical, cached on JsFunctionNode.strict, threaded via CoreContext.strict. See JS_ENGINE.md § Strict Mode Policy for the flip table; pinned by SpecPinTest.strict_*. The parser now also tracks lexical strictness (JsParser.checkStrictEarlyErrors, a strict-gated post-parse walk: program prologue → function-body prologue → always-strict class bodies) and raises parse-phase SyntaxError for legacy/non-octal-decimal literals (0755/08), eval/arguments as assign/update target or as a function-name / param / var-binding name, and duplicate simple params. Pinned by SpecPinTest.strict_octalLiteral* / *EvalOrArguments* / *duplicateParameters* / *classBodyIsAlwaysStrict* / *parenthesizedDirective*. The runner prepends a strict directive for flags: [onlyStrict] (Test262Runner.evaluate), the BoundNames walk over binding patterns landed (collectBoundNames), and the flag: onlyStrict skip is removed — measured 282 PASS / 146 FAIL, 0 regressions (lang pass 5433 → 5715). Remaining (the 146 residual, now visible in probes): (1) ~99 negative parse-phase early errors — function-declaration in statement position (if (x) function f(){}), block-scope function-decl rules; (2) ~16 runtime SyntaxError; (3) ~14 strict-assignment runtime TypeError (arguments-object write guards). with-statement early error deferred (path-skipped, lexes as a call). Two known un-enforced parser corners: a "use strict" prologue inside a non-simple-parameter-list function is itself an early error; lexical duplicate-BoundNames for let/const patterns (let {a,a}=…).
Promises / async / await / setTimeout. Skipped (feature: Promise, async-functions, Symbol.asyncIterator). karate-js is synchronous. Viable path: sync subset first — Promise as eager thenable, async function runs sync, await sync-unwraps.
Class syntax (ES6) — Phases 1+2+3 DONE; only the conformance tail remains. class declarations + expressions parse and evaluate: constructor, instance methods, static methods, get/set accessors, computed method names, default-constructor synthesis, always-strict bodies, constructor-without-new TypeError, extends + super(...) + super.method(), and public instance + static fields (x = 1 / static n = …, computed names, ASI, enumerable own props, derived-class fields init after super() — JsFunctionNode.instanceFields + Interpreter.runInstanceFieldInitializers). Desugared at eval time onto the existing constructor-function + prototype machinery (Interpreter.evalClassExpr → constructor JsFunctionNode whose .prototype holds the methods; statics on the constructor; methods/accessors non- enumerable). extends links both chains: Child.__proto__ = Parent (static inheritance + the super(...) target) and Child.prototype.__proto__ = Parent.prototype (instance inheritance). super dispatch uses a JsFunctionNode.homeObject ([[HomeObject]]) + a CoreContext.activeFunction seam set per non-arrow call (arrows inherit their defining method's): a super.m() reads off homeObject.getPrototype() with this=current receiver; super(...) runs the parent constructor against the instance under construction (Interpreter.runSuperConstructor — no invokeCallable refactor needed, since the derived instance is created normally and super() only initializes it). extends Error/built-ins works via a pragmatic copy-own-props shim. Public fields ride defineOwn/putMember (enumerable, unlike the non-enumerable methods); computed field names are resolved once at class-definition time, the value per instance. New tokens CLASS/EXTENDS/SUPER + NodeTypes CLASS_EXPR/CLASS_METHOD/CLASS_FIELD/SUPER_EXPR (CLASS_METHOD also carries fields — eval distinguishes by the trailing FN_EXPR). Covered by JsClassTest (44 cases). Remaining conformance tail (deferred): private #x fields/methods, generator/async methods, decorators, static-init blocks, class early-errors, object-literal-method super (needs object [[HomeObject]]), two super edge cases (this-TDZ before super(), super() return-override), numeric/string-literal method-name canonicalization (get 0x10(){} → key "16"; shared with object literals' NUMBER-key path), escaped-keyword method names. Most have existing feature:-tag skips (class-fields-private / class-methods-private / generators / async-functions / decorators); see the Skip list note for the path-skip un-skip plan.
Symbol primitive. Gates a long tail across String / Array / RegExp / Object. Deprioritized — real-world code doesn't use it.

Engine — cleanup

Benchmark-gated or coordinated with other work.

Fuse the early-error parse walks — DONE (2026-05-30). JsParser.parse ran three full post-parse traversals (validateEarlyErrors, validateCoverInitializedNames, checkStrictEarlyErrors); a JFR profile showed the three walks at ~13% of CPU on both EngineBenchmark and RealisticBenchmark — as costly as the entire interpreter. They are now a single descent: earlyErrors(node, strict, inPattern) threads the two pieces of top-down state (strict, inPattern) and calls per-node helpers (earlyErrorNodeChecks + the inlined CoverInitializedName/rest-element checks + strictNodeChecks, which returns the propagated childStrict). Per-node check order mirrors the former pass order so multi-error messages are unchanged. Behavior-preserving: test/language/** FAIL set byte-for-byte identical before/after (5849 PASS / 2397 FAIL, 0 regressions), all 1167 unit tests + 2235 karate-core tests green. Perf: EngineBenchmark array 1.41→1.33 ms / object 0.62→0.57 ms (+6.7% iters); RealisticBenchmark 68.6→62.3 µs/feature. This is now the single seam for new parse-phase early errors — the dominant MissingParseError backlog (escaped-keyword, undefined labels, new.target/super placement, getter/setter arity, regex group-name, the non-simple-param "use strict" corner) should each be added as another per-node helper in earlyErrors, never another whole-tree walk. Reference table updated in JS_ENGINE.md § Performance Benchmarks.
Prototype.toMap() rebuilds per call — memoize on slot-map mod stamp or expose a non-materializing iterator. Defer until benchmark shows it matters.
HOLE → tombstone full elimination. Sparse-array storage rework; pair with parser in support. Pinned in SpecPinTest. ~6–8 h.
HOLE leak audit at JsArray Java-interop seams — iterator() / toArray() / subList() / contains() / indexOf() / lastIndexOf() route raw; only get(int) translates HOLE→null. Centralize on one unwrap helper. ~30 min. Pairs with above.
PropertyKey abstraction. Symbol prep — YAGNI until Symbol lands.
Arguments → spec exotic Arguments object. Cached JsArray today; missing arguments.callee (strict TypeError), non-strict alias-to-formal-parameters, and [object Arguments] toStringTag. Subclass when a workload demands.
CreateDataPropertyOrThrow + ArraySpeciesCreate. Array result-allocation (slice / concat / splice / map / filter / flat / flatMap) bypasses spec sequence; depends on Symbol.species. Defer until Symbol.

Engine — spec alignment

Observably non-spec; pick up when the owning slice surfaces them.

Runtime block/eval environments for lexical bindings. Surfaced by the per-scope redeclaration sweep: three positive tests FAIL because the runtime CoreContext redeclaration check (identifier 'X' has already been declared, not the parser) fires where the spec wants a fresh declarative environment. (1) A switch CaseBlock needs its own block environment so let x inside a case does not collide with an outer let x (statements/switch/scope-lex-close-case.js). (2) Indirect (0,eval)('const x…') must evaluate in a NewDeclarativeEnvironment off the global env, so the eval'd lexical binding does not collide with an outer global const x (eval-code/indirect/lex-env-distinct-{const,let}.js). Both are runtime scoping gaps, independent of the parse-phase early-error work.
JsArray.handleLengthAssign strict TypeError on non-writable length. Strict-mode plumbing has landed (CoreContext.strict), but the length write still routes through handleLengthAssign(value, ctx) with no strict arg — PropertyAccess.setByName special-cases "length" before the strict-aware putMember, so 'use strict'; arr.length = 0 on a non-writable length is still a silent no-op. Thread strict into handleLengthAssign to finish the flip; everyday code doesn't hit it.
ToObject for non-empty string descriptor sources — short-circuits to TypeError (correct end-state), skips wrapper pipeline.
JsArray.jsEntries vs [[OwnPropertyKeys]] asymmetry — jsEntries is indices only; for-in / Object.keys / defineProperties want indices + named. JsObjectConstructor.ownKeys works around it. Split into arrayEntries(ctx) / ownEntries(ctx) when a 4th caller surfaces.
ToPropertyKey no-ctx callers — JsObjectConstructor.hasOwn and getOwnPropertyDescriptor still on the no-ctx path. Migrate when a workload passes non-string keys.
Integer-index accessors beyond JsArray.list.size() — high-index accessor via defineProperty is missed by jsEntries. Current workaround: defineOwnAccessor HOLE-pads. Real fix: merge integer-index namedProps into Phase 1. Pairs with HOLE elimination.
JsGlobalThis two-store reads — data in BindingsStore, accessors in JsObject.props. Extend BindingSlot with accessor side-table OR commit to a unified two-store contract. ~2 h.
Symbol.toPrimitive not dispatched — matches minimal Symbol surface; fix with Symbol.
(0, fn)() indirect-call this-binding — comma should drop reference base (→ this = undefined); today falls through to globalThis. Audit evalCallExpr for the parenthesized-comma case.

Harness quality

FIXED — etc/run.sh now compiles the runner. exec:java does not trigger compilation, so for a full cycle the runner ran stale target/classes and edits to Test262Runner (the onlyStrict strict-prepend) silently never took effect — the prepend measured as a no-op (71/357) until a test-compile step was added before exec:java. Lesson for harness edits: changes under karate-js-test262/src need the module recompiled; only karate-js engine changes are picked up by the install step alone.
Replace hand-rolled YAML parser with SnakeYAML (Expectations.java / Test262Metadata.java — breaks on # in quoted reasons, block scalars).
--resume echoes records for deleted / now-SKIP'd tests — gate or rename to --resume-crash-only.
Cache parsed harness ASTs in HarnessLoader (~50k re-parses per run).
Plumb per-test console capture into ResultRecord (currently wired and discarded by evaluate(...)).
phase: resolution (module-resolution) negatives conflated with runtime — latent (modules skipped).
$262 surface stubs (AbstractModuleSource, IsHTMLDDA, agent.*) — add when a feature unblocks.
Parallel execution — prior attempts showed no speedup; engine doesn't poll Thread.interrupt(). Revisit when per-test cost grows.
Test262Runner.readHeadSha walks parent chain — prefer git rev-parse HEAD or --karate-sha.
Commit target/test262/results.jsonl once engine churn slows.

Running

All commands run from karate-js-test262/ (the runner resolves etc/expectations.yaml and test262/ relative to cwd). Use -f ../pom.xml so Maven finds the parent reactor. After any change under karate-js/, re-install it first — the runner uses the karate-js jar from your local Maven repo, not from the reactor.

Quick start

etc/run.sh does install + run (+ HTML on --full):

cd karate-js-test262
etc/fetch-test262.sh                                       # first time only — shallow clone
etc/run.sh                                                 # dev mode, full suite
etc/run.sh --only 'test/language/**' --max-duration 300000 # scoped, 5-min cap
etc/run.sh --full                                          # PASS rows + HTML

Each run writes a fresh target/test262/run-<timestamp>/ (the runner prints the path) containing results.jsonl, results.jsonl.partial, run-meta.json, progress.log; html/ only with --full. Old runs are immutable; mvn clean wipes them.

Dev mode (default) keeps results.jsonl to FAIL+SKIP only; the pass count is in run-meta.json (counts.pass). --full adds PASS rows (for CI artifacts / audits / HTML).

Liveness sampling (never tail -f — see Context discipline):

tail -n 1 <run-dir>/progress.log    # last heartbeat: processed N pass M fail K skip L
tail -n 5 <run-dir>/progress.log

Driving by hand

If you need to invoke the runner or HTML report without etc/run.sh, read etc/run.sh — it documents the install step and the -am gotcha (exec:java is a direct goal; with -am the reactor includes karate-parent, which has no mainClass, and aborts before this module). Install karate-js separately, then run without -am.

Flags

Most-used flags below. Full set + defaults: read main(...) in Test262Runner.

Flag	Purpose
`--only <glob>`	restrict to matching paths
`--single <path> [-v] [-vv]`	run one test, no file writes. `-v` prints metadata + classification + engine location; `-vv` adds full source
`--full`	write PASS rows (default is FAIL+SKIP only); also gates HTML render in `etc/run.sh`
`--max-duration <ms>`	overall wall-clock cap (default unlimited); writes partial results + prints `Aborted:` on hit
`--timeout-ms <n>`	per-test watchdog (default 10s)
`--run-dir <path>`	output dir (default `target/test262/run-<ts>/`)

Runs are silent except FAIL lines + periodic [progress]. FAIL lines on stdout are capped at 20 (footer (… N more FAILs, see results.jsonl) fires after). [progress] lines emit every 5000 tests or 60 s and are mirrored to <run-dir>/progress.log. Per-FAIL detail lives only in JSONL — sample progress.log for liveness, never tail -f results.jsonl.partial.

Hang handling

The runner uses a single-thread ExecutorService to enforce --timeout-ms per test. The karate-js engine doesn't poll Thread.interrupt(), so cancel(true) can't stop the underlying thread. When a timeout fires, the runner retires the executor (shuts it down, creates a fresh one) so subsequent tests don't queue behind the stuck thread. Net cost of a genuine hang: one abandoned daemon thread, one Timeout row in results.jsonl, a few ms of recreate overhead.

For scripts / agents driving the runner: pass --max-duration <ms> as a safety net and follow Context discipline.

Skip list

There is only one concept: SKIP. A test matching any rule in etc/expectations.yaml is not run and appears as {"status":"SKIP",...} in results. Everything else is attempted; failures are failures.

Match order: paths → flags → features → includes. First match wins. Every entry requires a reason.

Precedence example. A test at test/language/statements/class/foo.js with flags: [module] and features: [Symbol] is skipped with the module reason (the flags match fires before features is consulted). If you want features: [Symbol] to win, don't have a matching flag rule.

Starter set covers Symbol, BigInt, generators, class syntax, Proxy, Reflect, Promises, async/await, Temporal, TypedArray beyond Uint8Array, WeakRef, ArrayBuffer, and the suite directories test/intl402/, test/staging/, test/annexB/. To add a skip: edit the YAML under the right section with a reason. To remove a skip: delete the entry, re-run the relevant --only glob, debug failures with --single -v.

Adding a new unimplemented feature. If you hit FAILs for an ES surface the engine genuinely doesn't implement (e.g. JSON.rawJSON / isRawJSON from ES2024), add a features: rule with the test262 feature flag name — not a paths: rule. The feature names match what the tests declare in their YAML frontmatter (features: [json-parse-with-source]), which is also what --single -v prints under features:. See existing entries for the exact shape; precedence rules above still apply.

Results schema

Two JSONL files during a run:

<run-dir>/results.jsonl.partial — appended per test as results arrive, flushed per write. Run order, not sorted. Deleted on clean exit; preserved on abort (--max-duration hit, Ctrl-C, JVM kill).
<run-dir>/results.jsonl — canonical output, sorted alphabetically by path, atomically written at end-of-run (tmp + rename). This is what tooling reads.

Dev mode (default): only FAIL and SKIP rows are written. The pass count comes from run-meta.json (counts.pass). The Failure triage and Diff two run-dirs recipes are designed to work without PASS rows.

--full mode: PASS rows are also written, one per attempted test that didn't fail or get skipped. Use when you need the canonical full record (CI artifact, deep audit) or want the HTML report (which etc/run.sh gates on --full).

Example line shape (same in both):

jsonl

{"path":"test/language/expressions/addition/S11.6.1_A1.js","status":"PASS"}
{"path":"test/.../something.js","status":"FAIL","error_type":"TypeError","message":"foo is not a function"}
{"path":"test/.../bigint-test.js","status":"SKIP","reason":"BigInt not supported"}

(The PASS row only appears in --full mode.)

Error types are classified into: SyntaxError | TypeError | ReferenceError | RangeError | Error | Timeout | Harness | Unknown by inspecting message prefixes (the engine emits "TypeError: ..." style messages at most failure sites). The classifier itself is in ErrorUtils. Two buckets are assigned by the runner (not the classifier) for negative tests: ExpectedThrow (a non-parse negative test completed normally) and MissingParseError (a phase: parse negative test parsed instead of being rejected — the engine is missing that early error; the code then ran and usually tripped the harness $DONOTEVALUATE() marker). Keeping MissingParseError distinct stops the unimplemented-early-error backlog from hiding inside Unknown alongside genuine engine crashes.

Recipes

Debug one failing test

# Default: -v gives metadata + classification + location — usually enough
# to find the engine call site without dumping test source into context.
mvn -pl karate-js-test262 -o exec:java \
    -Dexec.args="--single <path> -v" 2>&1 | tail -n 40

# Escalate to -vv (full source) only if -v didn't pinpoint the cause:
mvn -pl karate-js-test262 -o exec:java \
    -Dexec.args="--single <path> -vv" 2>&1 | tail -n 200

-v prints parsed YAML metadata (description / flags / features / includes / negative), the classification, and — if the engine attached a position — a location: <path>:<line>:<col> line. -vv additionally prints the full test source. --single does no file writes. No HTML drill-down page is generated — the details.html report shows path + error_type + message inline.

Location-line caveat. location: only appears when the engine itself threw and attached a position. Two common FAIL shapes carry no location and you should skip straight to reading the relevant built-in source:

Test262Error: <expectation> — the harness assertion fired inside the test's own JS, not the engine. Find the failure inside the test source (or look at what the test is asserting) and trace back to the engine method that built the wrong value.
Unknown: java.lang.StackOverflowError / NullPointerException / other Java exceptions — uncaught Java throwables surface without a JS-level position. Grep the stack for the engine class.

Failure triage

Compact rollups over results.jsonl. All return tens of lines, not thousands. Use these instead of reading the raw JSONL when scoping a slice or hunting for clusters.

RD=target/test262/run-<ts>            # the run-dir to analyze
JSONL=$RD/results.jsonl               # use .partial during an in-progress run

# PASS / FAIL / SKIP counts.
jq -r .status "$JSONL" | sort | uniq -c

# FAIL histogram by error_type — which classifier buckets dominate.
jq -r 'select(.status=="FAIL").error_type' "$JSONL" \
  | sort | uniq -c | sort -rn

# Top 20 FAIL message clusters (numbers normalized so near-duplicates merge).
jq -r 'select(.status=="FAIL").message' "$JSONL" \
  | sed 's/[0-9][0-9]*/N/g' \
  | sort | uniq -c | sort -rn | head -20

# FAIL counts per slice (two path components deep).
jq -r 'select(.status=="FAIL") | .path | split("/")[1:3] | join("/")' "$JSONL" \
  | sort | uniq -c | sort -rn | head -30

# One example failing path per error_type — for `--single -v` follow-up.
jq -r 'select(.status=="FAIL") | "\(.error_type)\t\(.path)"' "$JSONL" \
  | sort -u -k1,1 | head -20

# All FAILs under a specific slice — bounded with head, never raw.
jq -r 'select(.status=="FAIL" and (.path|startswith("test/language/statements/for-of"))) | .path' \
  "$JSONL" | head -30

Diff two run-dirs (regression check)

FAIL-set difference — works in dev mode (no PASS rows needed). Capped output: counts + first 10 of each list + per-slice cluster breakdown. Assumes both runs covered the same --only scope (recorded in each run-meta.json if you want to verify).

PREV=target/test262/run-<prev>/results.jsonl
CURR=target/test262/run-<curr>/results.jsonl

python3 - "$PREV" "$CURR" <<'PY'
import json, sys, collections
def fails(p):
    return {json.loads(l)['path'] for l in open(p) if json.loads(l)['status']=='FAIL'}
prev, curr = fails(sys.argv[1]), fails(sys.argv[2])
regr = sorted(curr - prev)   # newly failing — likely regressions
fixed = sorted(prev - curr)  # newly passing (or removed/skipped)
def by_slice(paths):
    c = collections.Counter('/'.join(p.split('/')[1:3]) for p in paths)
    return c.most_common(10)
def show(label, paths):
    print(f'{label}: {len(paths)}')
    for p in paths[:10]: print(f'  {p}')
    if len(paths) > 10: print(f'  ... {len(paths)-10} more')
    if paths: print(f'  by slice: {by_slice(paths)}')
show('Regressed (newly FAIL)', regr)
show('Fixed   (no longer FAIL)', fixed)
PY

Per-session safety check against the slice baseline. If Regressed is non-zero, drill into a couple of representative paths with --single -v — do not paste the full list into context. Note: a path appearing under "Fixed" could mean it now PASSes or that it was moved to SKIP / removed from scope; cross-check with the SKIP set if ambiguous.

Delegate a slice run

For re-probing a slice or triaging a cluster, spawn a general-purpose sub-agent and require a digest. The sub-agent reads the full output; you receive only the summary.

Prompt template (copy, fill in <glob>, paste into Agent):

Run etc/run.sh --only '<glob>' --max-duration 600000 from karate-js-test262/. After it completes, query the run-dir's results.jsonl (the runner prints Run dir: <path> on completion) and return ≤200 words:

PASS / FAIL / SKIP counts for the slice.

Top 3 FAIL clusters (group by error_type + normalized message prefix). For each: count, one example path, one example message.

Anything surprising: Timeouts, NPE-shaped errors, Java heap space, IllegalName lambda leaks, parse-vs-runtime classification gaps.

Do not paste raw FAIL lines, full test source, or JSONL contents. If you need to inspect a specific test, use --single -v and quote ≤3 relevant lines.

Use it for: slice probes, cluster triage, "did my engine change regress anything" checks, post-edit slice re-runs. Skip for small targeted lookups (one test, one symbol) — run those inline.

Check performance after an engine change

The conformance suite allocates a fresh Engine per test (~50k tests); small regressions compound into minutes of wall time. Prefer profile mode — the 30 s warm loop is JIT-stable and directly comparable to the reference table in JS_ENGINE.md.

mvn -pl karate-js -q test-compile

# Profile mode (30 s warm loop; JIT-stable, ~16k iterations averaged).
java -cp "karate-js/target/classes:karate-js/target/test-classes:$(find ~/.m2/repository -name 'slf4j-api-*.jar' | head -1)" \
    io.karatelabs.parser.EngineBenchmark profile

# Fast mode (median of 10 cold runs) — noisy, gut-check only
java -cp "…same classpath…" io.karatelabs.parser.EngineBenchmark

If averages move >±10%, understand why before merging. If unavoidable (correctness > speed), update the reference table in JS_ENGINE.md in the same commit.

Bump the pinned test262 SHA

Edit TEST262_SHA=... at the top of etc/fetch-test262.sh, delete the local test262/ directory, re-run the script. All subsequent runs use the new commit. Coordinate bumps with whoever else is iterating — the suite itself evolves.

Troubleshooting

Symptom	Likely cause / fix
`expectations file not found: etc/expectations.yaml`	Wrong directory. `cd karate-js-test262` first.
`test262 directory not found: test262`	Haven't run `etc/fetch-test262.sh` yet.
`Failed to execute goal ... exec-maven-plugin ... on project karate-parent: 'mainClass' ... missing`	Used `-am` with `exec:java`. Don't — install `karate-js` separately and run without `-am`.
Engine change has no effect on test262 output	Forgot `mvn ... -pl karate-js -o install -DskipTests`. The runner uses the local Maven repo jar, not the reactor classpath.
`Test262Report` says `--run-dir <path> is required`	Pass the path the runner printed on completion: `--run-dir target/test262/run-<ts>`. `etc/run.sh` does this for you.
Where's my report?	The runner prints `Run dir: <path>` on completion. Look in `<path>/html/index.html`. Each invocation creates a fresh `run-<timestamp>/` dir; nothing is overwritten.
Suite hangs on one test	Infinite loop; watchdog kicks in at `--timeout-ms`. The inner executor is retired and replaced; a genuine hang leaks one daemon thread and keeps going. Bisect with `--only`, or add `--max-duration` as a safety net.
Driving from a script that must not block	Pass `--max-duration <ms>`. On hit, partial results written and `Aborted:` replaces `Summary:`.
Tests that used to pass now fail	Run `EngineBenchmark` too — perf regression sometimes manifests as timeouts before correctness.
`target/test262/` growing unbounded across iteration sessions	No auto-pruning; each run writes its own `run-<ts>/`. `mvn clean` wipes the lot.

Directory layout

karate-js-test262/
├── TEST262.md                         # this file
├── pom.xml                            # Maven module (deploy explicitly disabled)
├── etc/
│   ├── expectations.yaml              # declarative SKIP list (committed)
│   ├── fetch-test262.sh               # shallow clone of tc39/test262 at pinned SHA
│   └── run.sh                         # one-shot: install + run + HTML
├── src/main/java/…/test262/           # runner + report + helpers
├── src/test/java/…/test262/           # unit tests for the harness itself
├── src/main/resources/report/         # HTML/CSS/JS templates for the report
├── src/main/resources/logback.xml     # logger config (file appender → target/test262/)
├── test262/                           # [gitignored] the cloned suite
└── target/test262/                    # [gitignored] one subdir per run
    └── run-<timestamp>/               # self-contained per-run dir
        ├── results.jsonl              # per-test pass/fail/skip, sorted by path (end of run)
        ├── results.jsonl.partial      # live feed — appended per test, flushed; deleted on clean exit, kept on abort
        ├── run-meta.json              # per-run context (test262 SHA, karate-js ver+SHA, JDK, OS, started/ended, counts)
        ├── progress.log               # banner + [progress] lines + final summary
        └── html/                      # two-file static HTML report
            ├── index.html             # tree + per-slice summary tiles
            └── details.html           # full per-test list with search + status filter

Each run is self-contained and immutable; old runs persist until mvn clean. The CI workflow uploads target/test262/ (parent) as a single artifact.

CI

A workflow_dispatch-only workflow at .github/workflows/test262.yml runs etc/fetch-test262.sh + the runner + the report, and uploads the whole target/test262/ directory as a single artifact. Never triggered automatically — kick off from the Actions tab when you want a fresh run. Two inputs (only and timeout_ms) default to full-suite / 10 s per test.

The module's pom.xml sets maven.deploy.skip=true / gpg.skip=true / skipPublishing=true so the release workflow does not publish this module to Maven Central.

References

tc39/test262 — the suite
test262 INTERPRETING.md — authoritative runner spec
../docs/JS_ENGINE.md — engine architecture, slot family, prototype machinery, spec invariants, benchmarks
../karate-js/README.md — what karate-js is and isn't
../docs/DESIGN.md — wider project design principles