.agents/shared/test-loop.md
The portable core of autonomous, tested implementation. Both /implement (Claude Code)
and $implement (Codex) read and follow THIS file verbatim for the testing phase, so the
impl⇄test loop behaves identically across harnesses. The harness-specific wrappers own
project setup, task splitting, and the spawn/wait mechanics; this file owns everything from
"a single task's implementation is committed" onward.
#ifdef _DEBUG test code for the current task. Never part of an
implementation commit. Lives as a patch under the task folder between rounds.TASK_DIR — .ai/<project>/<letter>/ for this task.TASK_ID — stable id used in commit trailers (e.g. the project + letter).implementing.md) and its referenced
images (images/<file> design mockups / screenshots / graphic resources for this isolated task).
This is half of what the tests are designed against (the diff is the other half); the design READS
the images — they show what the result should look like.BUILD (build command), EXE (built binary path), MAX_ATTEMPTS (default 4). The test
account lives in out/Debug/ as the portable-data folders described under "Test account" below;
the wrapper has already confirmed the golden one exists (launch gate). All paths are relative to
the current checkout — no worktrees are created; the run happens in whatever repository slot it
was launched from.Precondition: the implementation for this task is committed in the current checkout (impl agents
commit; they do not stash). Record that commit's SHA as IMPL_SHA — the reset after each test run
returns the checkout to exactly it. The runner tracks the attempt number as its own state (attempt
starts at 1); the commit message carries no attempt marker. Commits follow "Commit message" below.
TEST_AUTHOR -> RUN -> ASSESS (adversarial — see "Assessing"):
APPROVED -> reset to the impl commit (drop overlay); delete the test binary; return DONE up.
TEST_FLAW -> fix the overlay only; back to RUN. Does NOT cost an impl attempt.
IMPL_BUG -> spawn impl-fix agent (input = test.md, latest attempt's Root cause / Fix hint);
it commits a NEW attempt; re-apply overlay (--3way, else re-author); RUN. attempt++
UNRECOVERABLE -> delete the test binary; return BLOCKED up with the reason. Stop.
attempt > MAX -> delete the test binary; return BLOCKED up with test.md + "improve" notes. Stop.
On every TERMINAL exit (APPROVED / BLOCKED / UNRECOVERABLE / cap) "delete the test binary" means the
step in "Leave no test binary behind" below.
Early-escalation rule: if two consecutive ASSESS rounds produce the same failure signature (same step fails the same way after a fix), stop and return BLOCKED — do not burn the rest of the attempt budget chasing it.
UNRECOVERABLE conditions: the app reaches a login screen / AUTH_KEY_DUPLICATED and re-copying the
test account does not recover it; a file-lock build error (LNK1104, C1041) that persists after
the path-scoped kill; test_TelegramForcePortable missing when SETUP runs; or a crash with no usable
diagnostic after one retry.
git add -A && git commit per "Commit
message" below (and, if submodules changed, commit inside each submodule first, then bump the
superproject pointer in the same logical attempt — real commits, never stash). The runner records
the resulting SHA as that attempt's IMPL_SHA.result<n>.md) is the only thing handed back to a fix agent and the only thing
the runner reads to decide. See format below.Impl commits must read like the repository's own history — never marked as autonomous. Match the
style of recent git log subjects.
Autotask:/attempt marker; no Co-Authored-By: or any tool/assistant
attribution line. This explicitly OVERRIDES any harness default that would append one — a freshly
spawned committing sub-agent may add Co-Authored-By unless told not to, so pass this rule to it.
The attempt number is the runner's own state, never part of the message.The debug build runs in portable mode out of out/Debug/. Three sibling folders matter:
test_TelegramForcePortable — the golden test account, prepared by the user. Read-only SOURCE,
never modified by tests. (Its presence is the launch gate; the wrapper aborts if it is missing.)TelegramForcePortable — the LIVE folder the app actually uses (its presence is what puts the
build in portable mode). Disposable; recreated fresh each run.real_TelegramForcePortable — the user's real data, preserved once so manual use survives.SETUP — run at the START of every test run, with NO app instance alive. Idempotent: it guarantees a clean test account no matter how the previous run ended.
TelegramForcePortable exists AND real_TelegramForcePortable does NOT, rename
TelegramForcePortable -> real_TelegramForcePortable. (Captures the user's real data exactly
once; guarded so it is never overwritten afterward.)TelegramForcePortable still exists, delete it. (Safe: real_... now holds the real data, so
this only discards a leftover live/test copy.)test_TelegramForcePortable -> TelegramForcePortable. The live folder is now a fresh copy
of the golden test account — ready to launch.CLEANUP — optional, after a run. The SETUP steps already self-heal, so cleanup exists only to leave the user's real data live for manual use:
TelegramForcePortable.real_TelegramForcePortable -> TelegramForcePortable.Why this is safe: real_... is written exactly once (step 1 is guarded by "real does not exist")
and test_... is only ever a copy source, so both the user's real data and the golden test account
are structurally protected — only TelegramForcePortable is ever destroyed. Use robocopy /MIR
(or Copy-Item -Recurse / Remove-Item -Recurse -Force) for the folder ops.
Serialize app runs. Never have two Telegram.exe instances alive against this account at once —
concurrent reuse of one auth key can trigger a server-side session reset. Before SETUP, launching, or
rebuilding, kill any straggler of THIS checkout's binary only — the one whose full executable
path is EXE (out/Debug/Telegram.exe in this checkout). Match on the full path; do NOT blanket-kill
every Telegram.exe on the machine. The user may be running a system-installed client or another
checkout's build against unrelated accounts — those use different auth keys, never conflict with this
account, and MUST be left alive. On Windows, scope the kill by path:
$exe = (Resolve-Path "$EXE").Path
Get-CimInstance Win32_Process -Filter "Name = 'Telegram.exe'" |
Where-Object { $_.ExecutablePath -eq $exe } |
ForEach-Object { Stop-Process -Id $_.ProcessId -Force }
taskkill /IM Telegram.exe /F is forbidden here and anywhere else in this loop — it is image-name-wide
and takes down the user's unrelated clients. Every "kill stragglers" / "taskkill" step below means
this path-scoped kill.
Avoid destructive calls. The overlay must never trigger logout / session-termination / account-deletion. Tests that genuinely need those use a separate burner account, not this one. (If a permanent destructive-call fuse is later added to the debug build, this is enforced in code; until then it is the test-author's responsibility.)
The single most important rule: tests are derived from what THIS task changed — not from generic project navigation, and not reused from a previous task. Different change → different checks. If two tasks produce the same screenshots and the same assertions, the second test is a no-op. Before writing any overlay:
implementing.md and its referenced design-mockup images (images/<file>); READ the images,
they are the source of truth for what the result should look like. (b) The change under test —
git show <IMPL_SHA> (the actual diff) and <TASK_DIR>/plan.md. List every concrete thing the
diff changed and every surface the task (description + "Observable result") says it affects..ai/<project>/...; the committed new file is git show <IMPL_SHA>:<path>; the old is
git show <IMPL_SHA>^:<path>). Render those references to PNG and compare the tight crop
against both. If the rendered target matches the OLD art — or you cannot tell them apart —
that is a FAIL, not a pass. (This is the check that catches a change that never took effect,
e.g. an asset that wasn't rebuilt into the binary.) For a task the wrapper marked
Visual: layout, matching-the-art is necessary but NOT sufficient — also verify the numeric
design contract in <TASK_DIR>/visual.md (sizes, spacings, alignment); see "Visual contract".<TASK_DIR>/test.md BEFORE running (format under "Test report"), so the
design is explicit and Actual/Result can be filled in per check afterward.When the wrapper marks a task Visual: layout, "looks right" is not a vibe — it is a small
computation, and the test MEASURES it. The wrapper's design-spec phase writes the contract to
<TASK_DIR>/visual.md; impl builds to it; this loop verifies it. (Tasks marked Visual: appearance
or unmarked use the ordinary visual/asset check above — this section does not apply.)
A mockup (usually mobile) gives RELATIONSHIPS, never pixels. The contract re-expresses those
relationships in desktop units by anchoring every quantity to a font metric or an existing tdesktop
.style token — so it auto-adjusts to desktop and reuses real components. The strongest anchor is an
existing widget: "the count badge IS the dialogs-list unread badge" pins font + height + padding to
st::dialogsUnread* and is self-correcting — far better than "a blue circle ~24px".
Write it as an ORDERED DERIVATION: each step resolves one quantity the next consumes, so impl and test are both mechanical. Example — a glyph-on-rounded-square icon + title + count, in a bubble:
Anchor: T = st::<title>.font->height ; Badge := the dialogs unread-badge metrics
1. glyphH = 1.4·T ±2px — white glyph box height (from T)
2. square = glyphH ÷ (2/3) ±2px — accent rounded-square side ; iconR = square·0.28
3. margin m (equal on square's top/left/bottom) ; bubbleH = square + 2·m ±1px
bubbleR = bubbleH/2 ; iconR : bubbleR must read as in-sync (icon proportionally smaller)
4. titleY = (bubbleH − T)/2 ±1px — title vertically centered in the bubble
5. badge = Badge (font+height+padding) ; vertically centered ; margins top=right=bottom equal ±1px
Then the RELATIONSHIP checks that catch what existence-checks miss — each falsifiable: square ≤ bubbleH (no overflow/overlap), the square's three margins equal, the two corner radii in sync, the
badge identical to a real chat-row unread badge. Note every mobile→desktop adjustment and which token
replaced each mobile measurement.
How TEST verifies it (numbers over eyes):
font->height and the
QRect of each piece (glyph, square, bubble, title, badge) — and assert each derivation line
arithmetically within tolerance. Live-widget geometry is the primary oracle; it deterministically
catches "icon taller than the bubble", "square overflows", "badge oversized / cramped". Where a
rect can't be logged, measure it from a tight crop by colour (accent square, badge, bubble outline
are separable).Visual: layout check APPROVES only when the measured geometry satisfies the contract; any line
out of tolerance is an IMPL_BUG (report measured-vs-target) and loops like any other.The overlay is ad-hoc, authored fresh against the CURRENT implementation, injected at the
highest level that still exercises the change (often a direct data-layer call like
item->applyEdition(...) rather than a faked MTP response). It must:
#ifdef _DEBUG blocks.live-data (use real account data) · live-mutate (really create an entity — prefer a
throwaway target, clean up after) · inject (build fake local state without the network) ·
mock-api (intercept specific requests, return canned responses — for payments/destructive).
Prefer inject over live-mutate to avoid account/server accumulation and flake.<TASK_DIR>/test_log.txt (open Append|Text, flush after each write) and
save screenshots to <TASK_DIR>/screenshots/. Delete the old log at the first step.screenshots/<name>_old.png, <name>_new.png) so the assessment is
a direct three-way comparison, not a memory test.TEST_STEP: <desc> · TEST_RESULT: PASS: <what> / TEST_RESULT: FAIL: <what> - <details> ·
SCREENSHOT: <full path> · TEST_COMPLETE (immediately before quit).QTimer at scenario start that force-quits (Core::Quit(), and if
needed std::abort after a flush) at a hard wall-clock cap (default 120s). This guarantees the
app never hangs holding a lock on the exe — independent of the runner's own timeout.TEST_COMPLETE then Core::Quit().Telegram's custom widgets (Ui::InputField, Ui::FlatLabel, Ui::RpWidget, boxes, buttons, …)
do NOT declare Q_OBJECT — they have no own meta-object. So QObject::findChildren<T*>() does
not filter by type for them: with no distinct meta-object it matches the nearest moc'd base
(QWidget), i.e. it returns every child widget blindly cast to T*. The moment you use one as
T (e.g. call InputField::setFocused() / rawTextEdit() on what is really a VerticalLayout) you
get a raw SIGSEGV — the debugger shows this with the wrong dynamic type. A clean rebuild does NOT
fix it; it is a real bug in the overlay, not a stale build.
findChildren<Ui::SomeCustomWidget*>(). Instead enumerate findChildren<QWidget*>()
(QWidget is Q_OBJECT, so that call is sound and returns all descendants) and
dynamic_cast<Ui::SomeCustomWidget*>() each, keeping the non-null results — C++ RTTI identifies the
real type regardless of Q_OBJECT. A reusable helper:
template <typename T>
[[nodiscard]] std::vector<T*> FindWidgets(QWidget *root) {
auto out = std::vector<T*>();
for (const auto w : root->findChildren<QWidget*>()) {
if (const auto t = dynamic_cast<T*>(w)) out.push_back(t);
}
return out;
}
Q_OBJECT types (QWidget, QLabel, QLineEdit, …) are safe to pass directly to
findChildren<T*>().The Windows launcher changes the working directory to the exe folder before the app runs, so a
relative overlay log path (<TASK_DIR>/test_log.txt) silently fails to write (QFile won't
create missing parents) — the run looks "clean" but produces no evidence. Resolve <TASK_DIR> to an
absolute path up front (e.g. QDir::current().absoluteFilePath(...) computed at inject time, or an
absolute path baked into the overlay) so flushes actually land; likewise for screenshots.
git diff > <TASK_DIR>/test-overlay.patch.
Then reset the checkout back to the implementation commit so it stays impl-only:
git reset --hard <IMPL_SHA> (and git submodule update --init --recursive if the overlay
touched submodules). The overlay never enters an impl commit.git apply --3way <TASK_DIR>/test-overlay.patch. This succeeds ~90% of the time when the tail change was small.test<n>.md (which records
intent: injection point, fake values, assertions) rather than fighting the conflict markers.
Scenario steps that only call public APIs should live in their own block so they never conflict;
only true in-situ injections land inside impl files.Build with BUILD. A single changed TU compiles fast; only the overlay-touched files + link
rebuild between rounds. On LNK1104/C1041, run the path-scoped kill (Test account → "Serialize
app runs"), wait, retry once; if it persists -> UNRECOVERABLE.
Codegen does not track resource mtimes. If the task changed only a resource the style codegen
consumes (an icon .svg, etc.) without touching a .style, an incremental build will NOT re-pack
it and the binary keeps the OLD asset. Before building such a task force regeneration — touch the
referencing .style (or clean the codegen output) — so the change actually ships. A render that
shows no difference from before is the symptom of skipping this.
Run: run the SETUP steps (Test account) -> launch EXE with -testagent in the background,
redirecting BOTH stdout and stderr to <TASK_DIR>/app_stderr.txt (see "Crashes & assertions"
below — this flag is what stops a crash from hanging on a modal dialog, and the redirect is what
captures the assertion text) -> start a hard wall-clock deadline (~90s) from launch -> poll
test_log.txt every ~5s -> on each SCREENSHOT: read the image and judge it -> detect
TEST_COMPLETE (success) or process death (crash) or no new output for the watchdog cap, or the
hard deadline elapsing (hang) -> path-scoped kill of any straggler (Test account → "Serialize app
runs") -> optional CLEANUP -> save the overlay (git diff > <TASK_DIR>/test-overlay.patch) ->
THEN git reset --hard <IMPL_SHA> (back to impl-only — the patch must be saved before this reset).
On Windows, launch and capture both streams like:
$exe = (Resolve-Path "$EXE").Path
Start-Process -FilePath $exe -ArgumentList '-testagent' `
-RedirectStandardError "$TASK_DIR/app_stderr.txt" `
-RedirectStandardOutput "$TASK_DIR/app_stdout.txt" -PassThru
-testagent)A Debug build normally turns a failed std::vector bounds check, a bad iterator, an assert(), a
pure-virtual call, or abort() into a modal Abort / Retry / Ignore dialog. That dialog blocks
the process forever — the agent sees no TEST_COMPLETE, no process death, just a hang until the
watchdog cap, and learns nothing about the cause. -testagent removes those dialogs. With it
set, the binary:
abort() message box (no button to press, never hangs);<TASK_DIR>/app_stderr.txt, tagged [testagent];-testagent implies -debug).Do NOT key the crash decision on exit code. Breakpad handles the crash and the process usually
exits 0 — exactly as tdesktop's own crash detection assumes. The reliable crash signals are: the
process is gone WITHOUT a TEST_COMPLETE marker, AND a fresh non-empty
<workdir>/tdata/working exists. So always pass -testagent, and on a crash gather diagnostics
in this order before deciding the verdict:
<TASK_DIR>/app_stderr.txt — the [testagent] assert: … line gives the failed expression and
file:line (e.g. vector(1931) : … vector subscript out of range). Usually enough to localize.<workdir>/tdata/working — the crash report the reporter wrote: the Assertion: /
CrtAssert: annotations, the failed file:line, and Caught signal … / minidump id. Plain text;
read it directly. <workdir> is the launch -workdir (in portable test runs,
out/Debug/TelegramForcePortable/).<workdir>/tdata/dumps/*.dmp — the minidump (full stack, needs symbols to read; note its path
in test.md, don't try to symbolize inline).A crash is an IMPL_BUG (the implementation tripped an assertion / dereferenced out of range), not
a TEST_FLAW, unless the overlay itself is what reached out of bounds — quote the [testagent] line
and the tdata/working excerpt in test.md as evidence, and feed the expression + file:line to the
impl-fix agent as the Root cause / Fix hint. Only a crash with NO usable diagnostic after one retry
is UNRECOVERABLE.
A run that never reaches TEST_COMPLETE and never dies is a hang. Two independent guards catch it:
-testagent force-enables the built-in DeadlockDetector — a
ping thread that, if the main/event loop stops responding (a genuine deadlock or an infinite loop
on the UI thread), raises Unexpected("Deadlock found!") from a side thread. That crashes through
the same reporter, so the frozen main-thread stack is captured in the minidump and the process
exits on its own (key on the tdata/working report, not the exit code) — same diagnostics path as
a crash above. No agent action needed beyond reading tdata/working / the dump. Detection is
within ~30–90s of the stall.Core::Quit(). For that the runner enforces a
hard wall-clock deadline (~90s) from launch and, when it elapses, does the path-scoped kill
regardless of output. No legitimate auto-test runs anywhere near a minute, so this cap is pure
backstop — but it is what guarantees the agent can never wedge forever.Classify by which guard tripped: a DeadlockDetector crash with a real main-thread stack in app code
is an IMPL_BUG; the external cap firing is almost always a TEST_FLAW (the overlay didn't
drive to TEST_COMPLETE/quit) — re-author the overlay — unless the captured stack/log shows the
implementation itself wedged, in which case it is an IMPL_BUG. Two external-cap kills in a row with
the same signature → BLOCKED (early-escalation rule).
The on-disk EXE (out/Debug/Telegram.exe) always contains the compiled overlay after a test run —
git reset --hard only reverts the source, not the built binary. So when the loop reaches a TERMINAL
verdict (APPROVED, BLOCKED, UNRECOVERABLE, or attempt cap), after the final path-scoped kill and
git reset --hard <IMPL_SHA>, delete the built EXE so no overlay-laden test binary is left for
the user to launch by mistake:
Remove-Item -Force "$EXE"
A clean, feature-ready binary is one BUILD away on demand. (Delete only on terminal exit — between
attempts the next round rebuilds the overlay, so the binary is reused there.)
ASSESS decides APPROVED / TEST_FLAW / IMPL_BUG. Default to not approved; a check passes only on positive, specific evidence — in the captured pixels or the log — that the change is present AND correct.
_old.png vs _new.png). Do not narrate expectations.Visual: layout task you DO
assert the rendered widget's measured geometry against the desktop-unit contract in visual.md
("Visual contract"). That is numeric and falsifiable, and it is exactly the check that catches the
wrong proportions/spacings an eye waves through.<TASK_DIR>/test.md) — human-readable, append per attemptThe file the human opens to see how testing went. The test-author writes the checks (Expected /
Oracle / Observed via) BEFORE running; ASSESS fills Actual / Result and the verdict. Append a new
## Attempt section each round — never overwrite prior attempts.
# Test report — <project>/<letter>: <title>
## Attempt <n> — commit <sha> — strategy <...> — verdict: <APPROVED|TEST_FLAW|IMPL_BUG|UNRECOVERABLE>
### Test 1 — <aspect of THIS change>
- Expected: <observable effect the change should produce>
- Oracle: <what would make this check FAIL>
- Observed via: <surface + how captured: tight crop of widget X; refs _old/_new>
- Actual: <what is literally visible / logged>
- Screenshots: screenshots/<after>.png (refs: _old.png, _new.png)
- Result: PASS | FAIL
### Test 2 — ...
### Verdict reasoning
<1-3 lines tying the checks to the verdict>
### Root cause / Fix hint (only if IMPL_BUG — the impl-fix agent reads this)
### Failure signature (one line, for early-escalation comparison)
TASK: <TASK_ID>
STATUS: <DONE|BLOCKED>
VERDICT: <APPROVED|reason if blocked>
ATTEMPTS: <n>
TOUCHED: <repo paths or none>
DISCOVERED: <new follow-up tasks to append to implementing.md, or none>
NOTES: <one or two lines, or none>
Detailed reasoning stays in .ai/ artifacts. The chat reply is only this block.