Phase 8 — Manual QA by Actually Using It

Tests cover cases you thought of. Real usage covers the ones you didn't.

The single fastest way to ship a broken fix is to stop at "tests pass". Manual QA means interacting with the running system the way the user does, then comparing observed behavior to the original bug report.

Product-type playbook

Pick the row that matches the product. Do what it says. Do not substitute.

Product type	QA means…
CLI tool	Open `tmux`, run the actual command end-to-end, capture output. Paste the session transcript into the journal. Include exit code, stdout, stderr, side-effect check (files created/modified).
HTTP API	Start the real server, hit endpoints with `curl` or `httpie`, inspect response status + body + headers. Hit the specific endpoint that reproduced the bug. If there's auth, use real auth.
Browser-served web app	Drive a real browser via Playwright CLI. See tools/playwright-cli.md. Navigate the exact page/flow that reproduced the bug. Capture screenshot + DOM + network evidence. Do not substitute with curl — browsers have state (cookies, localStorage, service workers, client-side JS, viewport-dependent CSS) that curl does not have.
Agent / LLM pipeline	Run the same user prompt that originally failed. Capture the full turn — tool calls, messages, usage counters. Confirm non-zero usage (zero usage = still failing silently, see silent-failure check below).
Background worker / job queue	Trigger the job through the normal entry point (API call, cron tick, message publish), tail the worker logs, observe completion state in the queue or DB. Don't just call the worker function directly — the trigger path matters.
MCP server	Invoke the tool via its actual client (Claude Desktop, Cursor, etc. if available) or `mcp-cli`, not just the HTTP probe endpoint. The MCP handshake itself is sometimes where bugs live.
Native binary	Re-run the exact command that crashed / misbehaved. If the input was a file, use the same file. If the bug was exploitable, confirm the exploit repro via pwntools (see tools/pwntools.md). Capture exit code, signal if any, core dump if generated.
Bundled-app binary (Bun SEA, Node SEA, Electron, etc.)	Re-run the exact command. If the operation requires paid quota / blocked network, capture the app's debug log (`APP_DEBUG=1 APP_LOG_LEVEL=debug APP_LOG_FILE=/tmp/trace.log`) which usually emits the assembled request before sending. See methodology/partial-runtime-evidence.md for combining partial signals into a defensible verification.
Long-running daemon	Start fresh, let it run for the amount of time the bug originally took to manifest (not less), capture resource usage (memory, fd, cpu) throughout. Short-running QA misses resource leaks and cumulative state bugs.

Journal format

Every QA run goes in the journal under "Findings":

markdown

### Manual QA — <product type> (<ISO timestamp>)
- Scenario: <one line describing what you did>
- Command: `<exact invocation>`
- Observed output:

- Expected output: <what correct behavior looks like>
- Fix verified: yes / no / partial — <details>

If any QA step shows partial or regressed behavior, this is not "mostly done" — it's incomplete. Return to Phase 6.

The silent-failure check (always run)

Regardless of product type, audit the fix against these silent-failure patterns. If the original bug was a silent failure, the same pattern may exist in adjacent code that you haven't tested yet.

Universal silent-failure signals

HTTP 2xx with empty or default body
Response ok: true but a sub-field contains an error token (e.g. stopReason: "error", status: "failed")
usage.totalTokens === 0 on an LLM response
Process exit code 0 but stderr contains an exception traceback
Panic recovered and logged but ignored
Goroutine / task / promise rejection with no top-level handler
try { ... } catch { /* swallowed */ } or except: pass
Success response shape but semantic field indicates failure (e.g. error: null actually being error: "..." with falsy check)
Write returned success but read-back shows stale data
Job marked complete but side-effect did not happen
Cache hit path returned stale data and no refresh was triggered

Language-specific silent-failure signals

Check the runtime reference for additional patterns:

runtimes/python.md — asyncio task exceptions, bare except, logging.exception that goes nowhere
runtimes/node.md — unhandled promise rejections, void on async, swallowed .catch(() => {})
runtimes/rust.md — .unwrap_or_default(), let _ = result, error variants discarded
runtimes/go.md — if err != nil { return err } that never reaches user output, recovered panics, buffered channels that block silently
runtimes/native-binary.md — ignored return codes from libc, missing perror, alarm() / signal masks
runtimes/bundled-js-binary.md — process.env.X baked at build time, dead code from tree-shaking failures, worker sub-bundles diverging from main bundle

What to do when you find another silent-failure spot

Don't fix it. This is out of scope for the current bug.

Note it in the journal under a "Follow-ups" section with:

File:line
Pattern matched
Proposed fix sketch (one line)
Risk level (what happens if left unfixed)

Surface these to the user in the final message under "Next steps I didn't take".

The "fix verified" bar

"Fix verified" means: the exact original failing scenario, re-run, now produces the correct output. Not a similar scenario. Not a unit test of the fix. The original scenario.

If you can't re-run the original scenario (e.g. it required a specific data state that's gone), construct the closest equivalent and document the difference in the journal. Escalate to the user if the equivalent is materially different.