.agents/skills/debugging-e2e-tests/SKILL.md
This skill investigates a failed test in the Opik E2E suite (tests_end_to_end/e2e/). You give it a failure from wherever you noticed it; it gathers the evidence, decides whether it's a real regression or a flake, and proposes a fix.
Announce at start: "I'm using the debugging-e2e-tests skill to investigate X."
writing-e2e-tests skill (or just say "apply it") — that's a separate, deliberate act with its own run-until-green loop.tests_end_to_end/e2e/test-results/ (retained on failure), Allure results under allure-results/.test-results-v2 (Playwright traces + videos), playwright-report-v2 (the HTML report), allure-results-v2. Download with gh run download <run-id> -n test-results-v2 -D <dir>.comet.testops.cloud, project id 1) — results stream live during CI. Launches are named Opik v2 … <tier> - <run_id> (the trailing number is the GitHub Actions run id; the env segment varies — E2E, Post-Merge, Local, staging, production).allure-testops MCP (already connected) — the richest source. Validated calls:
list_launches(projectId: 1, search: "<run_id or name fragment>", sort: ["createdDate,DESC"]) or search_launches(rql: …) — find the launch.list_test_results(launchId) — per-test name, fullName (spec path + line, e.g. datasets/dataset-crud-smoke.spec.ts:8:7), status, a TestOps-computed flaky flag, muted/known, tags, jobRun.url (the GitHub Actions run), and the result id. Use search to filter to the failing test.get_test_result_history(id) — the pass/fail timeline for that test across recent launches. This is the flake signal.gh — gh run view <run-id> to find the failed job; gh run download <run-id> -n test-results-v2 -D <dir> for the trace artifact. A launch's jobRun.url gives you the run id.npx playwright show-trace <trace.zip> (from tests_end_to_end/e2e/) — open the trace to see the exact step that failed, the DOM snapshot, and console/network at that moment.git — diff the suspected change against the failing test's code path.digraph debugging_e2e {
rankdir=TB;
"1. Resolve entry point" [shape=box];
"2. Gather evidence" [shape=box];
"3. Classify" [shape=box];
"4. Diagnose" [shape=box];
"5. Report + propose (no edits)" [shape=box];
"1. Resolve entry point" -> "2. Gather evidence";
"2. Gather evidence" -> "3. Classify";
"3. Classify" -> "4. Diagnose";
"4. Diagnose" -> "5. Report + propose (no edits)";
}
Normalize whatever you were given into "a failed test + where its evidence lives":
gh run view <run-id> for the failed job; find the matching launch via list_launches(projectId: 1, search: "<run-id>"); the trace is in the test-results-v2 artifact (gh run download).list_test_results(launchId), filter to the failed results.list_test_results with search across a recent launch, or search launches, to find the result id; then pull its history.test-results/ trace and allure-results/ directly; TestOps may have nothing for an uncommitted local run, which is fine.npx playwright show-trace on the retained/downloaded .zip. Read the failing step, the DOM snapshot at that point, and console/network around it.only-on-failure / retain-on-failure).get_test_result_history(id), plus the TestOps flaky flag on the result. Skip history gracefully when TestOps isn't reachable (e.g. a purely local run) and fall back to trace + diff reasoning.Decide: real regression, flake, or environment / selector drift.
flaky: true → lean flake.Root cause, grounded in cited evidence (the specific trace step, the error, the history pattern) — not speculation. Apply the suite's lenses:
data-testid, so a locator no longer resolves.Produce:
Do not edit anything. If the developer wants the fix applied, hand off to writing-e2e-tests.
writing-e2e-tests makes a new test; this explains a red one.