skills/qa/references/methodology.md
The full test loop, gotchas, rubric, and output format for the qa skill. Everything below runs through browser-harness (the browser-harness <<'PY' … PY heredoc form — each call is a fresh shell, daemon auto-starts).
Run every QA test on a Browser Use cloud browser — never the user's local Chrome. This is the default for all targets, public or local. A cloud browser is a clean, real-user environment: no logged-in sessions, no extensions, no clobbering the tab the user is working in, and it produces a liveUrl the user (and the report) can point to. Testing on the user's own Chrome contaminates the result with their state and is not what QA wants.
This means a localhost site is not an exception that lets you skip the cloud browser — it's the case that requires the tunnel. A cloud browser lives on the public internet and cannot reach localhost, so you expose the dev server with a tunnel first, then point the cloud browser at the public URL. Do not fall back to the local daemon just because the site is local; tunnel it out.
0. Get a Browser Use API key — the only credential this skill uses. The cloud browser authenticates with BROWSER_USE_API_KEY, and browser-harness is the single source of it: it auto-loads a .env (from its repo root and agent-workspace/) on every call, with the process env winning over that. Use only that key — never substitute another credential, and never fall back to the user's local Chrome if it's absent. Don't assume it's missing just because it isn't echoed in your shell; the authoritative test is whether a cloud browser starts (step 2's start_remote_daemon(...) returns a liveUrl).
If no key is resolvable, do not proceed on local Chrome. Pick one of exactly two paths (ask the user which, if it's unclear), then make the key available and retry step 2:
cloud.browser-use.com/new-api-key).docs.browser-use.com/llms.txt (the browser-harness README links it under setup) to get a free key. The challenge is a randomized, often obfuscated word problem (leetspeak, foreign-language numerals, multi-step) — not always simple arithmetic. If one is hard to decode confidently, just re-request a fresh challenge; difficulty varies and re-rolling is the fastest unblock.To make the key available, prefer the inline form — export BROWSER_USE_API_KEY=bu_… at the top of every browser-harness call. This always works and sidesteps a real trap: the package that runs (a uv/pip install) often lives in a different tree than the skill/docs directory you're reading, so "browser-harness's .env" is ambiguous. If you do want to persist it via .env, write to the running package's repo root (parents[2] of the executing helpers.py/admin.py, e.g. ~/Developer/browser-harness/.env) — not necessarily the folder these skill files live in.
A tunnel binary must also be installed and authed — but only for a localhost target (a public URL skips the tunnel entirely, so don't block on this for public sites). Check before you rely on it, and recover if it's missing:
# ngrok is the default; cloudflared is the no-account fallback.
command -v ngrok && ngrok config check # installed AND authed?
command -v cloudflared # fallback that needs no account/auth
config check passes → use it (step 1).config check fails (no authtoken) → it's installed but unauthed. Either ask the user to run ngrok config add-authtoken <token> (free token at dashboard.ngrok.com), or fall through to cloudflared.brew install cloudflared on macOS, or the binary from github.com/cloudflare/cloudflared/releases) and tunnel with cloudflared tunnel --url http://localhost:PORT. If you can't install either, stop and tell the user which one to install (don't fall back to local Chrome — that violates the cloud-browser rule).1. Tunnel the local port — with host-header rewrite. ngrok is the default (already on $PATH); cloudflared is a friction-free alternative if installed. Use --host-header=rewrite: modern dev servers (Vite, Next, webpack, CRA) reject requests whose Host is an unknown public domain with a 403 Blocked request / host not allowed (Vite's server.allowedHosts). Rewriting the Host to localhost:PORT makes the dev server see a local request. Start it in the background and read the assigned URL from ngrok's agent API — don't scrape stdout.
# dev server is on, say, http://localhost:3000
ngrok http 3000 --host-header=rewrite --log=stdout > /tmp/qa-ngrok.log 2>&1 &
sleep 3
PUBLIC_URL=$(curl -s http://localhost:4040/api/tunnels \
| python3 -c 'import sys,json; print(json.load(sys.stdin)["tunnels"][0]["public_url"])')
echo "$PUBLIC_URL" | tee /tmp/qa-public-url.txt
# Verify the tunnel reaches the APP, not a 403/interstitial, before spending a cloud browser:
curl -s -H "ngrok-skip-browser-warning: true" "$PUBLIC_URL" | head -c 200
# cloudflared alternative (no account, no interstitial): cloudflared tunnel --url http://localhost:3000
For a public target, skip the tunnel — just use the URL directly in step 2.
2. Spin up a cloud browser — with the BU proxy DISABLED — and drive the public URL. start_remote_daemon creates the cloud browser, prints its liveUrl, and wires the daemon to BU_NAME. Pass proxyCountryCode=None: Browser Use's default residential proxy mangles ngrok's TLS, so the cloud browser lands on chrome-error:// / ERR_SSL_PROTOCOL_ERROR even though curl and other sites work fine. Disabling the proxy fixes it.
browser-harness <<'PY'
start_remote_daemon("qa", proxyCountryCode=None) # proxy off — required for ngrok TLS to work
PY
PUBLIC_URL=$(cat /tmp/qa-public-url.txt)
BU_NAME=qa browser-harness <<PY
new_tab("about:blank")
# ngrok FREE shows a one-time interstitial; skip it by sending this header before navigating.
cdp("Network.setExtraHTTPHeaders", headers={"ngrok-skip-browser-warning": "true"})
goto_url("$PUBLIC_URL")
wait_for_load()
print(page_info()) # MUST show the app's real title — not chrome-error, 403, or interstitial
PY
The bash PUBLIC_URL and the browser-harness Python heredoc are separate processes — interpolate the URL into the heredoc (note the unquoted <<PY above so $PUBLIC_URL expands), or read it from the temp file. On the cloud browser there's no user tab to clobber, so goto_url is fine. If you must change daemon options later (e.g. toggle the proxy), restart_daemon("qa") first — start_remote_daemon errors if a daemon for that name is already running.
3. Run the QA loop below with BU_NAME=qa on every browser-harness call.
4. Tear down when done. Kill the tunnel (pkill -f "ngrok http 3000") and stop the cloud browser — it bills until its timeout. Call stop_remote_daemon("qa") (that's the function name — not stop_daemon); it PATCHes the browser to stop via the saved BU_BROWSER_ID. Only touch the daemon/browser you created: stale numbered daemons or sockets from prior runs (bu-qa1…) may belong to other sessions — pick a clean, unique BU_NAME and don't stop browsers you didn't start.
Gotchas specific to this path (all field-hit — don't relearn them live):
403 host not allowed → use --host-header=rewrite (see step 1). The most common first failure against a Vite/Next/webpack dev server.chrome-error:// / SSL error but curl works → disable the BU proxy with proxyCountryCode=None (see step 2). To diagnose, navigate the cloud browser to https://example.com first: if that loads but your tunnel doesn't, it's the proxy.http://localhost:3000/... or pins its API/CORS to the localhost:PORT origin, those calls fail from the ngrok origin — a tunnel artifact, not a real bug. Don't score it against the app. When the real backend is unreachable this way, look for a mock/demo mode (e.g. a ?mock=N query param or a fixtures flag) so you can still exercise the UI, and say in the report which paths were mock-only..env (e.g. VITE_BU_API_BASE), so a prod credential 401s. Confirm which backend the instance targets before scoring an auth failure as a bug — the app may be faithfully rendering a legitimate upstream error.Network.setExtraHTTPHeaders applies only to the current target, so any new tab the app opens (clicking a tile, target=_blank, OAuth popups, window.open) starts without it and lands on ngrok's ERR_NGROK_6024 "You are about to visit" page. Re-apply cdp("Network.setExtraHTTPHeaders", headers={"ngrok-skip-browser-warning":"true"}) on each new target before reading it — or sidestep the whole class by tunneling with cloudflared, which has no interstitial. Either way, don't QA the warning page.start_remote_daemon(name) with a distinct BU_NAME each, run as separate subagents (browser-harness's SKILL.md covers the remote-browser fan-out pattern).Run every browser-harness call below against the cloud browser you started above (i.e. with its BU_NAME, e.g. BU_NAME=qa browser-harness <<'PY' …), pointed at the tunnel's public URL (or the public site) — not the bare local daemon.
new_tab(url) → wait_for_load() → capture_screenshot(), then click/type your way through. Screenshot after every meaningful action and verify the page actually changed the way you expected — don't assume a click worked. Two reliability notes for cloud browsers: (a) click from DOM coordinates, not scaled-screenshot pixels — if you downscale a shot (e.g. max_dim=1800 on a 1920-wide viewport) the pixels no longer map 1:1, so clicks miss; read targets with js("…getBoundingClientRect()…") and click_at_xy those. (b) Clear an input before typing (select-all + delete) — typing into an already-filled field concatenates and produces doubled values (a bu_…bu_… key), which then looks like an app auth failure but is your own artifact.drain_events() and scan for JS errors (Runtime.exceptionThrown), console errors (Runtime.consoleAPICalled with type == "error"), and failed requests (Network.responseReceived with status ≥ 400, Network.loadingFailed). A page that looks fine but throws on every click is not a 5.browser-harness <<'PY'
new_tab("https://example.com/signup")
wait_for_load()
capture_screenshot("/tmp/qa-01-landing.png", max_dim=1800)
# ... interact: click_at_xy, type_text, press_key("Enter") ...
# pull non-visual failures
errs = [e for e in drain_events() if (
e["method"] == "Runtime.exceptionThrown"
or (e["method"] == "Runtime.consoleAPICalled" and e["params"].get("type") == "error")
or (e["method"] == "Network.loadingFailed")
or (e["method"] == "Network.responseReceived"
and e["params"]["response"]["status"] >= 400)
)]
print(len(errs), "error events")
for e in errs[:10]:
print(e["method"], e["params"])
PY
| Score | Meaning |
|---|---|
| 5 | Task completes flawlessly. No errors, no friction, responsive and polished. A real user would have zero complaints. |
| 4 | Task completes. Minor cosmetic or UX nits (a slow load, awkward copy, one console warning) but nothing that blocks or confuses. |
| 3 | Task completes, but with real friction — a confusing step, a workaround needed, a non-blocking error, or a rough edge case. Usable, not good. |
| 2 | Task only partially works. A significant bug blocks part of the flow, or success requires luck/retries. Most users would get stuck. |
| 1 | Task cannot be completed. Critical failure: dead button, hard crash, infinite spinner, page won't load, data lost. |
Anchor the score to task completion first, then modify for errors and polish. "It worked but threw three console errors" is a 3–4, not a 5. "It looked beautiful but the submit button does nothing" is a 1, not a 4 — looks don't rescue a broken flow.
When the prompt asks about several things (e.g. "test search and filters"), score each sub-task, then report an overall score that reflects the weakest critical path — don't average a broken checkout up to a 3 because the homepage was nice.
Return a compact, skimmable verdict. Lead with the number.
Score: 3/5
Task: Sign up with a new email and reach the dashboard.
Result: Completed, but with friction.
What worked:
- Form accepted valid input, account created, redirected to dashboard.
Issues:
- [blocker?] no — "Email already in use" error rendered as raw "[object Object]" (saw it on retry).
- [console] TypeError in analytics.js on every page (Runtime.exceptionThrown, see /tmp/qa-03.png).
- [ux] no loading indicator on submit; looked frozen for ~4s.
Edge cases tried: empty email (handled, inline error ✓), 8s submit latency (no spinner ✗).
Evidence: /tmp/qa-01-landing.png, /tmp/qa-03-error.png
Keep it honest and specific — "the submit button at the bottom of the form did nothing and logged a 500" beats "signup is broken". Cite screenshots and the actual error text so the score is defensible.
drain_events() — many failures (analytics crashes, 4xx/5xx APIs, unhandled promise rejections) are invisible in pixels.