Debugging E2E Tests

This skill investigates a failed test in the Opik E2E suite (tests_end_to_end/e2e/). You give it a failure from wherever you noticed it; it gathers the evidence, decides whether it's a real regression or a flake, and proposes a fix.

Announce at start: "I'm using the debugging-e2e-tests skill to investigate X."

What this does — and doesn't

It diagnoses and proposes, grounded in cited evidence (the trace, the error, the history). It is read-only: it does not edit tests, and it does not re-run the suite as part of investigating.
To apply a proposed fix, hand off to the writing-e2e-tests skill (or just say "apply it") — that's a separate, deliberate act with its own run-until-green loop.

Where the evidence lives

Local run — traces under tests_end_to_end/e2e/test-results/ (retained on failure), Allure results under allure-results/.
CI run — the suite uploads three artifacts per run (7-day retention): test-results-v2 (Playwright traces + videos), playwright-report-v2 (the HTML report), allure-results-v2. Download with gh run download <run-id> -n test-results-v2 -D <dir>.
Allure TestOps (comet.testops.cloud, project id 1) — results stream live during CI. Launches are named Opik v2 … <tier> - <run_id> (the trailing number is the GitHub Actions run id; the env segment varies — E2E, Post-Merge, Local, staging, production).

Tooling

allure-testops MCP (already connected) — the richest source. Validated calls:
- list_launches(projectId: 1, search: "<run_id or name fragment>", sort: ["createdDate,DESC"]) or search_launches(rql: …) — find the launch.
- list_test_results(launchId) — per-test name, fullName (spec path + line, e.g. datasets/dataset-crud-smoke.spec.ts:8:7), status, a TestOps-computed flaky flag, muted/known, tags, jobRun.url (the GitHub Actions run), and the result id. Use search to filter to the failing test.
- get_test_result_history(id) — the pass/fail timeline for that test across recent launches. This is the flake signal.
gh — gh run view <run-id> to find the failed job; gh run download <run-id> -n test-results-v2 -D <dir> for the trace artifact. A launch's jobRun.url gives you the run id.
npx playwright show-trace <trace.zip> (from tests_end_to_end/e2e/) — open the trace to see the exact step that failed, the DOM snapshot, and console/network at that moment.
git — diff the suspected change against the failing test's code path.

The loop

dot

digraph debugging_e2e {
    rankdir=TB;
    "1. Resolve entry point" [shape=box];
    "2. Gather evidence" [shape=box];
    "3. Classify" [shape=box];
    "4. Diagnose" [shape=box];
    "5. Report + propose (no edits)" [shape=box];

    "1. Resolve entry point" -> "2. Gather evidence";
    "2. Gather evidence" -> "3. Classify";
    "3. Classify" -> "4. Diagnose";
    "4. Diagnose" -> "5. Report + propose (no edits)";
}

Step 1 — Resolve the entry point

Normalize whatever you were given into "a failed test + where its evidence lives":

A red CI check / Actions run — take the run id. gh run view <run-id> for the failed job; find the matching launch via list_launches(projectId: 1, search: "<run-id>"); the trace is in the test-results-v2 artifact (gh run download).
A TestOps launch — query it directly: list_test_results(launchId), filter to the failed results.
A test name — list_test_results with search across a recent launch, or search launches, to find the result id; then pull its history.
A local failure — use the local test-results/ trace and allure-results/ directly; TestOps may have nothing for an uncommitted local run, which is fine.

Step 2 — Gather evidence

The failed assertion and error message (from the trace, the report, or the TestOps result).
The trace: npx playwright show-trace on the retained/downloaded .zip. Read the failing step, the DOM snapshot at that point, and console/network around it.
Screenshot / video if present (only-on-failure / retain-on-failure).
The test's history via get_test_result_history(id), plus the TestOps flaky flag on the result. Skip history gracefully when TestOps isn't reachable (e.g. a purely local run) and fall back to trace + diff reasoning.

Step 3 — Classify

Decide: real regression, flake, or environment / selector drift.

History when available: a clean pass streak that broke right after a related change → lean regression. Intermittent pass/fail with no related change, or a TestOps flaky: true → lean flake.
Diff correlation: does a recent change touch the code path the failed assertion exercises (the page/component, the POM method, the fixture)? If yes → regression is likely. If the failed area is untouched → flake or environment is likely.
Default to "flake / uncertain" when history is intermittent and no related diff exists — don't over-call a regression without evidence.

Step 4 — Diagnose

Root cause, grounded in cited evidence (the specific trace step, the error, the history pattern) — not speculation. Apply the suite's lenses:

Verify the test render before blaming the backend. A "X didn't appear" failure is often a DOM race (a loading spinner still up, an eventually-consistent write not yet landed), not a backend regression. Check the trace's DOM snapshot at the failing step.
Selector drift — the FE changed an accessible name / removed a data-testid, so a locator no longer resolves.
Eventually-consistent state — async scoring/ingestion that needed a poll, not a fixed wait.
Fixture seed-shape mismatch — the page rendered an empty/partial state because the seed didn't match what the assertion expects.

Step 5 — Report + propose (no edits)

Produce:

Verdict — classification (regression / flake / environment-or-selector) + a confidence level.
Evidence — the trace step, the error, the history pattern, and the correlated change (if any), each cited.
Proposed fix — specific. For a regression: the code/selector/poll change to make. For a flake: a poll instead of a fixed wait, a quarantine, or "no code fix — known flaky, retry."

Do not edit anything. If the developer wants the fix applied, hand off to writing-e2e-tests.

Boundaries

Read-only: no test edits, no investigation-driven re-runs.
Works from all four entry points; degrades gracefully without TestOps (local failures use the trace + diff alone).
Distinct from authoring: writing-e2e-tests makes a new test; this explains a red one.