skills/terminal-bench-loop/SKILL.md
A repeatable operating skill for driving one Terminal-Bench problem to a passing smoke through Paperclip, with explicit issue topology, bounded runs, board-gated product fixes, and worktree continuity.
This skill is operational + diagnostic, not engineering. It coordinates issues, artifacts, and approvals around a Terminal-Bench loop. It does not authorize code changes — every accepted product fix lands as a separate implementation child issue after a board confirmation.
Canonical execution model: read doc/execution-semantics.md before starting a loop or moving any loop issue. Every loop issue must rest in a state the doc allows: terminal (done/cancelled), explicitly live (active run / queued wake), explicitly waiting (in_review with participant/interaction/approval), or explicit recovery/blocker (blocked with blockedByIssueIds and a named owner).
Trigger on an assignment whose title or body matches any of:
Also use when the user hands you an existing top-level loop issue and asks for the next iteration, diagnosis, or rerun.
paperclip-bench itself (Harbor adapter, wrapper, telemetry). Use normal engineering flow on that repo.Every loop iteration and every proposed product fix must hold these three invariants together. They come from /diagnose-why-work-stopped and the user has restated them across the liveness work:
in_review with nothing waiting on it.If a proposed iteration violates any of the three, drop it or rework it. State explicitly in the loop issue how each invariant is held this iteration.
Collect these on the top-level loop issue before iteration 1. Any input that cannot be supplied is a blocker — name the unblock owner and stop.
terminal-bench/fix-git). Multi-task suites are out of scope for this skill.inheritExecutionWorkspaceFromIssueId or equivalent.paperclip-bench invocation, including the PAPERCLIPAI_CMD (or equivalent) binding pinned to the Paperclip App worktree under test. Record verbatim on the loop issue.PAPERCLIP_HARBOR_RUNNER_CONFIG JSON (or equivalent config file) verbatim enough to preserve: assignee, heartbeat_strategy, agent_adapter / agent_adapters, reuse_host_home when local credentials are intentionally needed, and the stop budget. A bare Harbor command that creates BEN-1 as unassigned todo with zero heartbeat-enabled agents is a harness/setup failure, not a valid product diagnosis.paperclip-bench writes run artifacts (manifest, results.jsonl, Harbor raw job folders, redacted telemetry). Each iteration appends; nothing is overwritten.request_confirmation; CTO if delegated; never the loop driver alone).Record each input on the top-level loop issue (description or a dedicated inputs document). If any input changes mid-loop, note the change and the iteration it took effect.
The loop must be representable as a tree, not as prose in comments:
in_progress while an iteration is running, in_review only when a typed waiter sits directly on the loop parent (execution-policy participant, request_confirmation / ask_user_questions / suggest_tasks interaction, approval, or named human owner), blocked with blockedByIssueIds while a child issue is the gating work (iteration child holding the fix-proposal request_confirmation, or implementation, QA, or CTO review children), done on pass, or cancelled on board-rejection / budget exhaustion./diagnose-why-work-stopped), a fix-proposal document with a request_confirmation interaction, and — only after acceptance — implementation, QA, CTO review, and rerun children. Iteration children are blocked by their predecessors so the executor wakes them in order.inheritExecutionWorkspaceFromIssueId so the same worktree is amended and tested.Wire dependencies with blockedByIssueIds, never with prose like "blocked by X". When a dependent child is done, the executor auto-wakes the next.
Before opening or advancing a loop, read doc/execution-semantics.md. Use that document's terms intact when classifying loop-issue state: live path / waiting path / recovery path; post-run disposition; bounded continuation; productivity review; pause-hold; watchdog. Do not invent a new state.
Terminal-Bench loop: <task-name>. Description captures the inputs above, the iteration budget, and a link to the source issue.Iteration N: <task-name>. Its description repeats the inputs and references the loop parent. Block it on the prior iteration's terminal child (if any) so the executor cannot start two iterations in parallel.cancelled (budget exhausted) or in_review if the user must decide whether to extend the budget.PAPERCLIPAI_CMD (or the equivalent command binding) to the CLI entrypoint inside that worktree. Never let the smoke run against the operator's current Paperclip checkout.PAPERCLIP_HARBOR_RUNNER_CONFIG with the intended assignee, heartbeat strategy, agent adapter, credential/home mode, and stop budget. Do not treat a bare uvx harbor run ... as the canonical smoke if it omits the dispatch config; record that as a harness/setup miss and rerun with the recorded config.run document:
results.jsonl row, Harbor raw job folderPAPERCLIP_HARBOR_RUNNER_CONFIG or equivalent), including assignee and adapter typeApply the /diagnose-why-work-stopped pattern to the iteration's run, scoped to this loop only — do not pull in unrelated forensic boilerplate. Specifically:
(issue, status) combination that stopped progress. Quote evidence: run ids, comment timestamps, status transitions.Record the diagnosis on the iteration child as a diagnosis document. Do not propose code yet.
Based on the diagnosis, the iteration ends in exactly one of these terminal-for-iteration states:
plan document on the iteration child, then go to Step 6.blocked, set blockedByIssueIds to the blocker issue (creating one if needed), and name the unblock owner. Stop.cancelled with a comment that summarizes the run history and the reason for stopping.When the iteration ends in product fix proposed:
plan document with the proposed contract, the three-invariant check, the affected Paperclip surfaces, and the phased subtasks (implementation, QA, CTO review, rerun) — but do not create those subtasks.request_confirmation interaction on the iteration child (the same issue that owns the plan document), targeting the latest plan revision. Idempotency key: confirmation:{iterationIssueId}:plan:{revisionId}. Set continuationPolicy to wake_assignee.in_review. The typed waiter — the request_confirmation interaction — sits directly on it, so its in_review is healthy. Comment links the plan document and names the pending confirmation.blocked with blockedByIssueIds: [iterationChildId] and a comment naming the board (or whichever approver the approval policy designates) as the unblock owner. Do not move the loop parent to in_review here: the typed waiter lives on the iteration child, not on the parent, so the parent's wait path is the child blocker. This matches the topology rule that the loop parent only sits in in_review when a typed waiter is attached directly to the parent.blockedByIssueIds already points at the iteration child, so it does not need to change.blockedByIssueIds wired in order, and update the loop parent's blockedByIssueIds to point at the new gating child (typically the implementation child) so the parent stays blocked against real downstream work. The implementation child must inherit the Paperclip App execution workspace (inheritExecutionWorkspaceFromIssueId to the worktree-owning issue) so the fix lands in the same isolated worktree the smoke ran against.After implementation and QA complete (or immediately, in the non-product failure with retry case), the rerun child runs the same paperclip-bench invocation with PAPERCLIPAI_CMD still pinned to the Paperclip App worktree under test.
When the smoke passes:
blocked with blockedByIssueIds set to the QA / CTO review chain, and post a comment that names QA and CTO as the unblock owners and links the children. The loop parent stays blocked — not in_review — because the typed waiter lives on the children, not on the parent.in_review during this phase (for example because a board user has explicitly volunteered to drive the review), put a typed waiter directly on the parent — execution-policy participant, request_confirmation / ask_user_questions / suggest_tasks interaction, approval, or named human owner — and do not rely on the child chain alone. Do not combine in_review on the parent with QA/CTO children acting as the blocker; that is the ambiguous review shape this skill exists to prevent.results.jsonl, Harbor raw job, redacted telemetry) and the rerun reproducibility against the same worktree.The loop must stop, with state explicitly recorded on the loop issue, when any of these is true:
done.cancelled. Comment names the rejected proposal and the reason.cancelled (or in_review if the user must decide whether to extend the budget). Never silently start iteration N+1.blocked with blockedByIssueIds to the blocker issue and the unblock owner named.A loop must never end on a prose comment alone. Every stop is a status transition with a named next-action owner.
The loop must not test whatever Paperclip checkout happens to be current for the heartbeat. It must test the same isolated Paperclip App worktree where proposed fixes are applied.
inheritExecutionWorkspaceFromIssueId to that worktree-owning issue, so all subsequent loop work shares one workspace.PAPERCLIPAI_CMD (or the equivalent command binding) to the CLI entrypoint inside that worktree, and it carries the recorded dispatch runner config (PAPERCLIP_HARBOR_RUNNER_CONFIG or equivalent) needed to assign the benchmark issue and start the heartbeat. The benchmark command stored on the loop issue is the source of truth — if a heartbeat needs to run the smoke from a different shell, it copies the recorded command block verbatim, not only the Harbor invocation line.blocked and name the unblock owner (typically CodexCoder or the Paperclip App owner).Every loop issue, at the end of every heartbeat, must rest in one of:
done or cancelled. No further action.in_progress with an active run, an upcoming queued wake, or a child issue actively executing under it.in_review with a typed waiter — execution-policy participant, request_confirmation / ask_user_questions / suggest_tasks interaction, approval, or a named human owner.blocked with blockedByIssueIds set to a real blocking issue, plus a comment naming the unblock owner and the action needed.If a loop issue does not fit one of these on exit, the heartbeat is not done. Fix the state before exiting.
PAPERCLIPAI_CMD and verify the path before launching the run.PAPERCLIP_HARBOR_RUNNER_CONFIG (or equivalent) may boot Paperclip and create BEN-1, but leave it unassigned with zero heartbeat-enabled agents. That is not a Terminal-Bench product signal. Preserve and rerun the full command block, including assignee and adapter config.plan document. Do not push code in the diagnostic phase.in_review mean done. A loop or iteration child sitting in in_review with no participant, no interaction, no approval, and no human owner is a stop, not progress. Treat it as a liveness violation and route it./diagnose-why-work-stopped for a product-rule fix.PAPERCLIPAI_CMD binding, and dispatch runner config.results.jsonl, Harbor raw job folder, and stop reason./diagnose-why-work-stopped pattern, classifies every non-progressing issue, and checks the three invariants.request_confirmation is open against the latest plan revision.Run this smoke after installing or changing the skill, before treating it as operational for a live Terminal-Bench loop:
pnpm smoke:terminal-bench-loop-skill
The command uses the current Paperclip API token and company from PAPERCLIP_API_URL, PAPERCLIP_API_KEY, and PAPERCLIP_COMPANY_ID. When PAPERCLIP_TASK_ID is set, it attaches the smoke issues under that source issue and inherits its project/goal context. By default it cancels the short-lived smoke issues after verification; pass -- --keep to leave the verified blocked loop parent, in_review iteration child, and pending confirmation available for manual inspection.
The smoke is deterministic and intentionally non-comparable. It does not start Terminal-Bench, Harbor, an agent model, or a provider runtime. It verifies only the control-plane shape:
skills/terminal-bench-loop/SKILL.md contains the loop contract terms;run document;diagnosis document names the exact stop point and next-action owner;request_confirmation interaction is created and the iteration child rests in in_review with a typed waiting path rather than silent review.