skills/qa/references/browser-use-v2.md
Run the QA test as an autonomous Browser Use cloud agent instead of driving browser-harness step by step. It's purpose-built for QA: a judge evaluates pass/fail against expected behavior, and structured output forces the 1–5 score. It runs server-side, parallelizes, and returns step-by-step evidence (screenshots + actions).
Cost / credits: the v2 agent spends Browser Use credits — about $0.01 per task + ~$0.006 per
step (LLM) + $0.02/hr browser, drawn from the account's monthly allowance. (The Claude-subagent
backend in methodology.md spends no Browser Use task credits.) Recommend v2 for real QA; fall
back to the Claude subagent to avoid credits.
Note: the docs label the v2 API "legacy" and steer new projects to v3 — but the
judge+ structured-output evaluation features QA needs live on v2 (POST /api/v2/tasks), so that's what this backend uses.
POST https://api.browser-use.com/api/v2/tasks → 202 {id, sessionId}GET https://api.browser-use.com/api/v2/tasks/{id} → status ∈ created → started → finished | failed | stopped, plus output, judgeVerdict, judgement, steps[], cost.X-Browser-Use-API-Key.The v2 API authenticates with BROWSER_USE_API_KEY — the same key methodology.md step 0 resolves
(browser-harness's .env, the process env, or self-signup). The cleanest way to use
browser-harness's stored key is to run the calls inside a browser-harness heredoc, where
the key is already loaded into os.environ — no separate plumbing, no re-exporting. (Plain curl
with $BROWSER_USE_API_KEY also works if it's exported. The v2 task itself runs on a Browser Use
cloud browser, so no local Chrome is needed for the test — browser-harness here is just the key
store + HTTP runtime.)
Fill in task, startUrl (the public URL — tunnel a localhost target first), and
judgeGroundTruth (what success looks like), then run:
browser-harness <<'PY'
import os, json, time, urllib.request, urllib.error, subprocess, sys
KEY = os.environ.get("BROWSER_USE_API_KEY")
assert KEY, "no BROWSER_USE_API_KEY — resolve it per methodology.md step 0"
BASE = "https://api.browser-use.com/api/v2"
def open_local(url):
"""Open a URL in the user's LOCAL browser (Chrome first) so they can watch the cloud agent run.
This views the dashboard thread — it does NOT run the test locally; the task still runs in the cloud."""
cmds = ([["open", "-a", "Google Chrome", url], ["open", url]] if sys.platform == "darwin"
else [["google-chrome", url], ["xdg-open", url]])
for c in cmds:
try:
subprocess.Popen(c, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
return
except FileNotFoundError:
continue
def call(method, path, body=None):
req = urllib.request.Request(
BASE + path,
data=json.dumps(body).encode() if body is not None else None,
method=method,
headers={"X-Browser-Use-API-Key": KEY, "Content-Type": "application/json"},
)
try:
with urllib.request.urlopen(req) as r:
return json.load(r)
except urllib.error.HTTPError as e:
raise SystemExit(f"v2 API {e.code}: {e.read().decode()[:300]}")
# 1-5 score schema (structuredOutput must be a *stringified* JSON schema)
SCORE_SCHEMA = json.dumps({
"type": "object",
"properties": {
"score": {"type": "integer", "minimum": 1, "maximum": 5},
"verdict": {"type": "string"},
"worked": {"type": "array", "items": {"type": "string"}},
"issues": {"type": "array", "items": {"type": "string"}},
},
"required": ["score", "verdict"],
})
created = call("POST", "/tasks", {
"task": "QA TASK HERE — e.g. 'Add an item to the cart, go to checkout, and report whether "
"it completes. Score 1-5 (5=flawless, 1=broken) with what worked and any issues.'",
"startUrl": "https://PUBLIC-URL-UNDER-TEST", # tunnel localhost first; pass the public URL
"judge": True,
"judgeGroundTruth": "SUCCESS LOOKS LIKE — e.g. 'An order-confirmation / thank-you page is shown.'",
"structuredOutput": SCORE_SCHEMA,
"maxSteps": 50, # a CEILING, not a target — the agent stops when done, so this doesn't inflate cost;
# 50 gives room for a multi-step flow. Raise for long flows; cost = steps actually taken.
# optional: "llm": "browser-use-2.0", "vision": True,
# "sessionSettings": {"proxyCountryCode": "us", "enableRecording": True}
})
tid = created["id"]
sid = created["sessionId"]
watch_url = f"https://cloud.browser-use.com/thread/{sid}" # dashboard thread = the SESSION (not the task id)
print("created task", tid, "session", sid, flush=True)
print("watch:", watch_url, flush=True)
open_local(watch_url) # pop it open in the user's local Chrome to watch
while True: # poll to a terminal state
t = call("GET", "/tasks/" + tid)
if t["status"] in ("finished", "failed", "stopped"):
break
time.sleep(5)
print(json.dumps({
"status": t["status"],
"score_output": t.get("output"), # the structuredOutput JSON → the 1-5 score object
"judgeVerdict": t.get("judgeVerdict"), # True = passed the ground-truth check, False = failed
"judgement": t.get("judgement"), # judge's reasoning (stringified JSON report)
"cost_usd": t.get("cost"), # what this run spent
"num_steps": len(t.get("steps") or []),
}, indent=2))
PY
As soon as a task is created, open its dashboard thread in the user's local Chrome so they can watch the agent work — that's the open_local(...) call above. The watch URL is built from the session id:
https://cloud.browser-use.com/thread/{sessionId}
That /thread/<uuid> page is the v2 agent's run: the live browser plus the agent's step-by-step reasoning, screenshots, and (when finished) the judge verdict. Notes:
{id} is the sessionId from POST /tasks — not the task id. (Verified: GET /tasks/{sessionId} 404s; the dashboard groups a run by its session.) Always print the URL too, so the user has it even if the browser doesn't auto-open.liveUrl, the raw single-session CDP viewer on live.browser-use.com that dies when the session stops — the dashboard thread persists after the run, so it's the better link to hand the user.)open_local targets the user's local Chrome (open -a "Google Chrome" on macOS, google-chrome/xdg-open on Linux), falling back to the default browser.Report exactly as methodology.md's output format, sourced from the agent's result:
Score: N/5 ← the score field of the structured output — but judgeVerdict overrides it.
The structured score is the agent's self-report and can be wrong (an agent will happily score a
blank page 5/5). If judgeVerdict is False, the flow FAILED regardless of the self-score — cap
it at ≤2 and lead with the judge's failure_reason. (Real example: an agent self-scored x.ai/pricing
5/5 "fully functional"; the judge saw the page rendered blank and returned false. Trust the judge.)judgeVerdict (true = met the ground truth, false = didn't). The agent's
score/isSuccess are self-reports and are less reliable — judgeVerdict is authoritative.worked / issues arrays, cross-checked against
judgement (the judge's reasoning).steps[]: each has url, screenshotUrl, actions (and sometimes nextGoal —
may be empty). Cite the screenshotUrls of the key moments.cost so the user sees what the run spent.Report it in this format (the same one the Claude backend uses — self-contained here so you don't
need to open methodology.md):
Score: N/5
Task: <what you asked the agent to verify>
Result: <pass/fail from judgeVerdict + one line>
What worked:
- <from the structured `worked` array / judgement>
Issues:
- [tag] <from `issues` / judgement; empty if none>
Evidence: <key steps[].screenshotUrl links>
Cost: $X.XX (Browser Use v2 agent, <n> steps)
The whole point of v2 subagents is parallel coverage. To test several flows at once, create all
the tasks first (each POST /tasks returns immediately with an id), then poll them all —
they run concurrently in Browser Use cloud:
flows = [
{"task": "Test signup: …", "startUrl": URL, "judgeGroundTruth": "Account created, lands on dashboard."},
{"task": "Test checkout: …", "startUrl": URL, "judgeGroundTruth": "Order confirmation shown."},
{"task": "Test search + filters: …", "startUrl": URL, "judgeGroundTruth": "Filtered results update."},
]
ids = []
for f in flows:
c = call("POST", "/tasks", {**f, "judge": True, "structuredOutput": SCORE_SCHEMA, "maxSteps": 50})
open_local(f"https://cloud.browser-use.com/thread/{c['sessionId']}") # one Chrome tab per flow, to watch
ids.append((f["task"][:40], c["id"]))
results = {}
while len(results) < len(ids):
for label, tid in ids:
if tid in results: continue
t = call("GET", "/tasks/" + tid)
if t["status"] in ("finished", "failed", "stopped"):
self_score = (json.loads(t["output"]).get("score") if t.get("output") else None)
passed = t.get("judgeVerdict") is True # JUDGE is authoritative, not the self-score
results[tid] = {"label": label, "passed": passed, "self_score": self_score,
"score": (self_score if passed else min(self_score or 2, 2)), # judge=False caps it
"cost": t.get("cost")}
time.sleep(5)
# Per flow: PASS only if judgeVerdict is True. A flow where the agent self-scored high but
# judgeVerdict is False is a *caught failure* — flag it and score it low.
# Overall = the weakest flow (min of the judge-corrected scores) — never average a failed flow up.
Watch the concurrent-session cap (Free = 3): creating more than the cap at once yields 429 —
batch the creates to stay under it.
structuredOutput is a string — pass json.dumps(schema), not the schema object.methodology.md) and
pass the public startUrl; for a free-ngrok host, tell the agent in the task to click through
any "You are about to visit" interstitial.429 TooManyConcurrentActiveSessionsError — the account hit its concurrent-session cap
(Free = 3); wait or stop other sessions.maxSteps is a safety ceiling, not a cost lever — the agent stops when the task is complete, so the cap doesn't drive cost (this run capped at 15 but used 5 steps). Keep ~50 for headroom; raise for long flows.assert KEY does this; if it's missing, resolve it per methodology.md step 0 before calling POST /tasks.