skills/qa/SKILL.md
Drive a website with a real browser, judge how well it does the thing the user asked about, and return a score from 1 (broken) to 5 (excellent) with evidence. The deliverable is a verdict, not a screenshot dump.
From the user's invocation (the text after /qa, or their message):
https://…) or a local dev server (localhost:5173, :3000, "the app on 5173"). Required — if absent, ask for it before doing anything else.This skill's verdict is visual — you judge the app by looking at screenshots. So before anything else, check whether you — the agent running this skill — can actually see images:
judge pass/fail + a 1–5 structuredOutput). Use v2 for every flow — even a single one — per references/browser-use-v2.md. Do not drive browser-harness yourself to read screenshots, and do not fan out to your own (equally blind) subagents. The single-flow-vs-fan-out choice below does not apply to you — it's v2 either way.If you're unsure whether you can see images, assume you can't and use v2.
Scale the approach to the ask:
browser-harness directly yourself, following references/methodology.md. That's the right, lowest-overhead tool for a single test, and it's how the rest of this skill works.judge (pass/fail) + structuredOutput (1–5 score), running server-side and in parallel, returning step-by-step screenshot evidence. Spends Browser Use credits (~$0.01/task + ~$0.006/step + $0.02/hr browser). Per-task flow + how to fan out: references/browser-use-v2.md.browser-harness through references/methodology.md. No Browser Use task credits; uses your agent's own usage.Rule of thumb (vision agents only — text-only agents use v2 for everything, see above): one flow → browser-harness directly; many flows → subagents (v2 recommended). Either way browser-harness is required — as the direct driver, the subagent driver, the v2 key store, and the localhost tunnel.
This skill runs the test through browser-harness — a separate tool you install once. It is not optional; QA must run on a real Browser Use cloud browser, never the user's local Chrome.
Before anything else, verify it's available:
command -v browser-harness && browser-harness <<'PY'
print("browser-harness OK")
PY
If browser-harness is not on PATH, install it yourself — don't make the user do it. QA runs on a cloud browser, so the CLI is all you need: none of browser-harness's local-browser setup (chrome://inspect, --remote-debugging-port, the "Allow remote debugging" popup) applies here — skip all of it. The install is one-time (~30s), no clone:
command -v uv || curl -LsSf https://astral.sh/uv/install.sh | sh # the uv installer, only if missing
uv tool install "git+https://github.com/browser-use/browser-harness"
command -v browser-harness # verify it's on PATH now
(No uv and can't curl | sh? Install uv per https://docs.astral.sh/uv/getting-started/installation/ then re-run the uv tool install line — or pipx install "git+https://github.com/browser-use/browser-harness".)
The only other thing a run needs is a BROWSER_USE_API_KEY, and that also auto-resolves with no human — see references/methodology.md step 0 (it can self-sign-up for a free key). So a fresh machine goes from "just installed qa" to a working test without the user doing any setup.
Last resort only — if you genuinely can't install it (no network, no Python): stop and have the user open https://www.browser-harness.com/ and paste its "prompt for LLMs" into an agent:
Set up https://github.com/browser-use/browser-harness for me. Read
install.mdfirst to install and connect this repo to my real browser. Then readSKILL.mdfor normal usage. Always readhelpers.pybecause that is where the functions are. When you open a setup or verification tab, activate it so I can see the active browser tab. After it is installed, open this repository in my browser and, if I am logged in to GitHub, ask me whether you should star it for me as a quick demo that the interaction works — only click the star if I say yes. If I am not logged in, just go to browser-use.com.
Re-run command -v browser-harness and don't continue until it succeeds. Never fall back to the user's local Chrome.
Do not attempt to QA with anything other than browser-harness + a cloud browser.
Confirm the target is reachable (curl -s -o /dev/null -w "%{http_code}" <url>), and identify what the app is (title, README) so you can frame a sensible test task.
Run it — first apply the vision gate from "Can you see images?" above:
judge + a 1–5 structuredOutput), per references/browser-use-v2.md. Skip the single-flow/subagent split below — it's v2 regardless of scale. Tunnel a localhost target and pass the public startUrl.references/methodology.md: resolve the key, tunnel localhost, and run the test loop with the field-tested gotchas (host-header rewrite, proxy-off, per-tab interstitial header, CORS-pinned APIs).references/browser-use-v2.md, create one task per flow (each with judge + a 1–5 structuredOutput schema), poll them all, and collect the verdicts. A localhost target still needs a tunnel (the cloud agent can't reach localhost) — tunnel it and pass the public startUrl.references/methodology.md on browser-harness.Whenever you use the v2 backend (either the no-vision case or the recommended fan-out): as each task is created, open its dashboard thread https://cloud.browser-use.com/thread/{sessionId} (the session id from the create response — not the task id) in the user's local Chrome so they can watch the agent run — the reference's open_local(...) helper does this. Always print the URL too.
Tear down what you started — stop everything you opened, on every path:
localhost target, kill the tunnel process. This applies to every path that tunnels, v2 included (a v2 run against localhost still starts a tunnel) — don't leave it orphaned.Return the verdict: lead with Score: N/5, then task, result, what worked, issues (tagged), edge cases, and evidence — per the rubric and output format in references/methodology.md. Fanning out? Give a per-flow Score: N/5 line and an overall score that reflects the weakest critical path (don't average a broken flow up because others passed).
Scale effort to the ask: a quick "does X work?" is a few interactions and one score; "thoroughly QA this" warrants more flows and edge cases. Keep the verdict honest, specific, and reproducible.