Back to Plate

Auto Review

templates/plate-playground-template/.agents/skills/autoreview/SKILL.md

53.1.320.9 KB
Original Source

Auto Review

Run the bundled structured review helper as a closeout check. This is code review, not Guardian auto_review approval routing.

Codex review is the default when no engine is set. It uses gpt-5.5 by default, usually delivers the best review results, and should remain the normal final closeout engine. Claude review is optional and uses claude-fable-5 by default.

Use when:

  • user asks for Codex review / Claude review / Pi review / Droid review / OpenCode review / autoreview / second-model review
  • after non-trivial code edits, before final/commit/ship
  • reviewing a local branch or PR branch after fixes

Contract

  • Treat review output as advisory. Never blindly apply it.
  • Verify every finding by reading the real code path and adjacent files.
  • Read dependency docs/source/types when the finding depends on external behavior.
  • Reject unrealistic edge cases, speculative risks, broad rewrites, and fixes that over-complicate the codebase.
  • Prefer small fixes at the right ownership boundary; no refactor unless it clearly improves the bug class.
  • When an accepted finding shows a bug class or repeated pattern, inspect the current PR scope for sibling instances before fixing.
  • Fix the scoped bug class at once when practical; stop at touched surfaces, owner boundaries, and clear follow-up territory.
  • Keep going until structured review returns no accepted/actionable findings.
  • If a review-triggered fix changes code, rerun focused tests and rerun the structured review helper.
  • For security-audit suppression changes, verify accepted findings remain auditable: suppressed findings stay in structured output, active output keeps an unsuppressible suppression notice, and aggregate findings cannot hide unrelated active risk.
  • Never switch or override the requested review engine/model. If the review hits model capacity, retry the same command a few times with the same engine/model.
  • Be patient with large bundles. Structured review can take up to 30 minutes while the model call is active, especially with Codex tools or web search.
  • Treat heartbeat lines like review still running: ... elapsed=... pid=... as healthy progress, not a hang. Let the helper continue while heartbeats are advancing. Pass --stream-engine-output when live engine text is useful; Codex and Claude filter tool/file chatter, other engines pass raw output through.
  • Do not kill a review just because it has been quiet for 2-5 minutes, or because it is still running under the 30-minute window. Inspect the process only after missing multiple expected heartbeats, after 30 minutes, or after an obviously failed subprocess; prefer letting the same helper command finish.
  • Tools are useful in review mode. The helper allows read-only inspection tools and web search by default so reviewers can check dependency contracts, upstream docs, and current behavior.
  • Security perspective is always included, but it should not cripple legitimate functionality. Report security findings only when the change creates a concrete, actionable risk or removes an important safety check.
  • For regression provenance, keep roles separate: blamed code author, blamed PR author, PR merger/committer, current PR author, and PR/date. If no blamed PR is traceable, use the blamed commit as the provenance: commit SHA, date, and author username. Do not guess a merger or frame missing PR metadata as a separate finding.
  • If the blamed PR was merged by clawsweeper[bot] or another automation, identify the human trigger when practical. Check timeline/comments first; if rate-limited, use gitcrawl/cache or public PR HTML. Look for maintainer commands such as @clawsweeper automerge, /landpr, or labels/status comments that armed automerge. Report automerge triggered by @login; if not found, say trigger unknown.
  • Do not invoke built-in codex review, nested reviewers, or reviewer panels from inside the review. The helper builds one bundle, calls one selected engine, validates one structured result, and stops.
  • Stop as soon as the helper exits 0 with no accepted/actionable findings. Do not run an extra review just to get a nicer "clean" line, a second opinion, or clearer closeout wording.
  • Treat the helper's successful exit plus absence of actionable findings as the clean review result, even if the underlying Codex CLI output is terse.
  • Multi-reviewer panels are opt-in only. Use them when explicitly requested or when risk justifies the extra spend; the main agent still verifies every accepted finding before fixing.
  • If rejecting a finding as intentional/not worth fixing, add a brief inline code comment only when it explains a real invariant or ownership decision that future reviewers should know.
  • If gh/Gitcrawl reports database disk image is malformed, run gitcrawl doctor --json once to let the portable cache repair before retrying review; do not bypass the shim unless repair fails and freshness requires live GitHub.
  • If Gitcrawl reports a portable manifest mismatch, source/runtime DB health error, or stale portable-store checkout, run gitcrawl doctor --json and inspect source_db_health, runtime_db_health, and portable_store_status before falling back to live GitHub.
  • Do not push just to review. Push only when the user requested push/ship/PR update.

Skill Path (set once)

Set the skill script paths once, then use "$AUTOREVIEW" and "$AUTOREVIEW_HARNESS" in the examples below.

Choose one:

bash
# Project-local skill in the current repo:
export AUTOREVIEW=".agents/skills/autoreview/scripts/autoreview"
export AUTOREVIEW_HARNESS=".agents/skills/autoreview/scripts/test-review-harness"
bash
# Source checkout of openclaw/agent-skills:
export AUTOREVIEW="skills/autoreview/scripts/autoreview"
export AUTOREVIEW_HARNESS="skills/autoreview/scripts/test-review-harness"
bash
# Global skill:
export AGENTS_HOME="${AGENTS_HOME:-$HOME/.agents}"
export AUTOREVIEW="$AGENTS_HOME/skills/autoreview/scripts/autoreview"
export AUTOREVIEW_HARNESS="$AGENTS_HOME/skills/autoreview/scripts/test-review-harness"

When using Claude Code, set AGENTS_HOME="$HOME/.claude" for global skills. Project-local skills live under .claude/skills/ in the current repo.

Pick Target

Dirty local work:

bash
"$AUTOREVIEW" --mode local

Use this only when the patch is actually unstaged/staged/untracked in the current checkout. --mode uncommitted is accepted as an alias for --mode local. For committed, pushed, or PR work, point the helper at the commit or branch diff instead; do not force dirty modes just because the helper docs mention dirty work first. A clean local review only proves there is no local patch.

Branch/PR work:

bash
"$AUTOREVIEW" --mode branch --base origin/main

Optional review context is first-class:

bash
"$AUTOREVIEW" --mode branch --base origin/main --prompt-file /tmp/review-notes.md --dataset /tmp/evidence.json

If an open PR exists, use its actual base:

bash
base=$(gh pr view --json baseRefName --jq .baseRefName)
"$AUTOREVIEW" --mode branch --base "origin/$base"

Committed single change:

bash
"$AUTOREVIEW" --mode commit --commit HEAD

Use commit review for already-landed or already-pushed work on main. Reviewing clean main against origin/main is usually an empty diff after push. For a small stack, review each commit explicitly or review the branch before merging with --base.

Parallel Closeout

Format first if formatting can change line locations. Then it is OK to run tests and review in parallel:

bash
"$AUTOREVIEW" --parallel-tests "<focused test command>"

On Windows, the default --parallel-tests shell preserves the platform cmd.exe semantics used by Python shell=True. Use --parallel-tests-shell powershell or --parallel-tests-shell pwsh when the focused test command is PowerShell-specific.

Tradeoff: tests may force code changes that stale the review. If tests or review lead to code edits, rerun the affected tests and rerun review until no accepted/actionable findings remain. Once that rerun exits cleanly, stop; do not spend another long review cycle on redundant confirmation.

Review Panels

Run multiple reviewers against one frozen bundle:

bash
"$AUTOREVIEW" --reviewers codex,claude,pi,droid

--panel is shorthand for Codex plus Claude unless --engine changes the first reviewer:

bash
"$AUTOREVIEW" --panel

Set reviewer models and thinking/effort explicitly:

bash
"$AUTOREVIEW" --reviewers codex,claude --model codex=gpt-5.5 --thinking codex=high --model claude=claude-fable-5 --thinking claude=max

Inline syntax is also supported for simple model IDs:

bash
"$AUTOREVIEW" --reviewers codex:gpt-5.5:high,claude:claude-fable-5:max

For models with slashes or extra colons, prefer keyed form:

bash
"$AUTOREVIEW" --engine pi --model anthropic/claude-sonnet-4 --thinking high
"$AUTOREVIEW" --engine opencode --model opencode/north-mini-code-free --thinking high
"$AUTOREVIEW" --engine droid --model claude-opus-4-8 --thinking low
"$AUTOREVIEW" --reviewers codex,pi --model codex=gpt-5.5 --model pi=anthropic/claude-sonnet-4
"$AUTOREVIEW" --reviewers codex,opencode --model codex=gpt-5.5 --model opencode=opencode/north-mini-code-free
"$AUTOREVIEW" --reviewers codex,droid --model codex=gpt-5.5 --model droid=claude-opus-4-8

Models and thinking

The helper accepts --model globally or per engine (engine=model) and --thinking globally or per engine (engine=level). Repeat either flag for multiple reviewers.

Recommended model defaults:

EngineDefault modelSource note
codex (default)gpt-5.5OpenAI's current GPT-5.5 alias
claudeclaude-fable-5Anthropic's most capable widely released Claude model

CLI flags and environment variables override these defaults. Droid, Copilot, Pi, and OpenCode do not get built-in model defaults here because their provider catalogs are external to the Codex/Claude closeout path and may vary by installation.

EngineModel flagExample model IDsThinking flagAccepted levels
codex (default)codex --model X exec ...gpt-5.5, gpt-5.5-2026-04-23-c model_reasoning_effort=Ynone, minimal, low, medium, high, xhigh
claudeclaude --model Xclaude-fable-5, claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5--effort Ylow, medium, high, xhigh, max
droiddroid exec --model Xclaude-opus-4-8, Factory model IDs-r, --reasoning-effort Yoff, none, low, medium, high
copilotcopilot --model Xgpt-5.2, Copilot model aliasesnot supportedn/a
pipi --model Xanthropic/claude-sonnet-4, openai/gpt-4o--thinking Yoff, minimal, low, medium, high, xhigh
opencodeopencode run -m Xopencode/north-mini-code-free, OpenCode provider/model IDs--variant Yminimal, low, medium, high, max

Claude also supports --fallback-model a,b for availability-based fallback chains (model-config). Current Claude docs note that auth, billing, rate-limit, request-size, and transport errors do not trigger fallback, and the changelog documents interactive-session support in v2.1.166.

Examples matching current main behavior:

bash
# Codex with explicit model and reasoning
"$AUTOREVIEW" --engine codex --model gpt-5.5 --thinking high

# Claude Code aliases or full model names, with optional availability fallback
"$AUTOREVIEW" --engine claude --model claude-fable-5 --thinking max
"$AUTOREVIEW" --engine claude --model claude-fable-5 --fallback-model claude-opus-4-8,claude-sonnet-4-6

# Factory Droid with explicit model and reasoning effort
"$AUTOREVIEW" --engine droid --model claude-opus-4-8 --thinking low

# GitHub Copilot (model only; no thinking knob)
"$AUTOREVIEW" --engine copilot --model gpt-5.2

# Pi with explicit model and thinking level
"$AUTOREVIEW" --engine pi --model anthropic/claude-sonnet-4 --thinking high --pi-bin pi

# OpenCode with explicit provider/model and variant
"$AUTOREVIEW" --engine opencode --model opencode/north-mini-code-free --thinking high

Environment defaults

CLI flags take precedence over environment variables.

VariablePurpose
AUTOREVIEW_MODELOverride the built-in default --model for all engines
AUTOREVIEW_THINKINGDefault --thinking for all engines
AUTOREVIEW_FALLBACK_MODELDefault Claude --fallback-model chain
AUTOREVIEW_<ENGINE>_MODELPer-engine model override, for example AUTOREVIEW_CODEX_MODEL=gpt-5.5
AUTOREVIEW_<ENGINE>_THINKINGPer-engine thinking override
AUTOREVIEW_CLAUDE_FALLBACK_MODELClaude-only fallback chain

Codex maps thinking to model_reasoning_effort. Claude maps thinking to --effort. Droid maps thinking to -r, --reasoning-effort. Pi maps thinking to --thinking. OpenCode maps thinking to --variant. Copilot rejects --thinking. Only Claude accepts --fallback-model; global CLI/env fallback requires at least one Claude reviewer, and engine-specific fallback overrides require that reviewer to be selected. Non-Claude fallback overrides, including AUTOREVIEW_<NONCLAUDE>_FALLBACK_MODEL, fail closed instead of being silently ignored.

Review engine isolation

When autoreview runs inside the repository under review, external reviewer CLIs must not load project-local trust or configuration that the branch controls.

EngineIsolation flagsReference
codex-c project_doc_max_bytes=0, repo trust_level="untrusted", exec --ignore-user-config --ignore-rules, plus read-only sandboxCodex CLI exec --help
claude--safe-mode --setting-sources user --strict-mcp-config --disallowedTools mcp__* plus explicit --allowedTools (--safe-mode requires Claude Code v2.1.169+)Claude Code CLI reference
pi--no-approve --no-session --no-context-files --no-extensions --no-skills --no-prompt-templates --no-themes, plus read-only tool allowlistPi CLI --help; requires Pi v0.79.0+
opencodeopencode run --dir <repo> --pure --format json, prompt over stdin, neutral subprocess cwd, injected deny-by-default permissions, project config disabledOpenCode CLI --help

Codex --ignore-user-config skips config loading for the exec run while keeping CODEX_HOME auth usable. The explicit repo trust override and zero project-doc budget keep reviewed-repo AGENTS.md and .codex/ trust surfaces out of the review prompt. --ignore-rules skips user/project execpolicy rules. Claude --safe-mode disables project hooks, skills, plugins, MCP servers, and CLAUDE.md while preserving normal authentication, model selection, built-in tools, and permissions; managed settings policy can still apply. --setting-sources user avoids project/local settings from the reviewed checkout, and current Claude Code docs note the project-skill blocking behavior was fixed in v2.1.69. --strict-mcp-config and --disallowedTools mcp__* keep MCP unavailable to the review run. --bare is not used here because Claude's headless docs say it skips OAuth and keychain reads. Pi --no-approve ignores project-local files for one run; the helper requires Pi v0.79.0+ plus help output that advertises every required isolation flag because older legacy binaries can ignore unknown flags. The current package is @earendil-works/pi-coding-agent; deprecated @mariozechner/pi-coding-agent 0.73.x is intentionally rejected. Pi version/help probes and the review command run from neutral temporary directories, not the reviewed repo. Pi --no-context-files removes AGENTS.md/CLAUDE.md, the resource-disable flags keep .pi extensions, skills, prompts, and themes out of the run, --no-session avoids writing review sessions, and the read-only allowlist omits bash, edit, and write. OpenCode starts from a neutral temporary directory, points at the reviewed repo with --dir, disables project config through OPENCODE_DISABLE_PROJECT_CONFIG=1, and injects OPENCODE_CONFIG_CONTENT; permissions default to deny, allow read/grep/glob, preserve OpenCode's .env ask rules, and gate websearch/webfetch with --no-web-search. The injected config also clears command/instruction/plugin arrays and disables write/edit/bash/task/skill/todowrite tools without changing user auth storage. The helper sends the review prompt over stdin rather than argv and extracts the final structured JSON from type: "text" events. OpenCode rejects --no-tools.

Context Efficiency

Run the helper directly so target selection, engine choice, structured validation, and exit status all stay in one path. If output is noisy, summarize the completed helper output after it returns; do not ask another agent or reviewer to rerun the review.

Helper

After setting AUTOREVIEW and AUTOREVIEW_HARNESS above:

bash
"$AUTOREVIEW" --help

The smoke harness has thin shell wrappers over a shared Python implementation:

bash
"$AUTOREVIEW_HARNESS" --fixture benign --engine codex

On native Windows, invoke the extensionless Python helper through Python:

powershell
python skills\autoreview\scripts\autoreview --help

and the smoke harness:

powershell
skills\autoreview\scripts\test-review-harness.ps1 -Fixture benign -Engine codex

The helper:

  • chooses dirty local changes first
  • accepts --mode uncommitted as an alias for --mode local
  • otherwise uses current PR base if gh pr view works
  • otherwise uses origin/main for non-main branches
  • supports --engine codex, claude, droid, copilot, pi, and opencode; default is AUTOREVIEW_ENGINE or codex; Codex should remain the default when nothing is set
  • resolves bare git, gh, reviewer, and PowerShell shell commands from absolute PATH entries only, never from the reviewed checkout; explicit relative --*-bin paths are resolved from the reviewed repository root
  • use --mode commit --commit <ref> for already-committed work, especially clean main after landing
  • should be left in --mode auto or forced to --mode branch for PR/branch work; do not force --mode local after committing
  • writes only to stdout unless --output, --json-output, or live streamed engine stderr is set
  • supports --dry-run, --parallel-tests, --parallel-tests-shell, --prompt, --prompt-file, --dataset, --no-tools, --no-web-search, and commit refs
  • supports --stream-engine-output or AUTOREVIEW_STREAM_ENGINE_OUTPUT=1 for live engine text while preserving structured validation; Codex and Claude hide tool/file event details, emit compact activity summaries, and report usage at turn completion
  • supports opt-in review panels with --panel / --reviewers, plus per-engine --model, --thinking, and Claude --fallback-model
  • uses built-in model defaults codex=gpt-5.5 and claude=claude-fable-5; honors AUTOREVIEW_MODEL, AUTOREVIEW_THINKING, AUTOREVIEW_FALLBACK_MODEL, and per-engine AUTOREVIEW_<ENGINE>_MODEL / AUTOREVIEW_<ENGINE>_THINKING environment overrides when CLI flags are omitted
  • allows read-only tools and web search by default where the selected CLI supports them; forbids nested review in the prompt; Codex is run through codex exec with read-only sandbox, reviewed-repo instruction/config/rule isolation flags, and structured output
  • runs Claude with --safe-mode (v2.1.169+), --setting-sources user, MCP disabled, explicit allowed tools, and --fallback-model when set, so reviewed-repo hooks/skills/MCP do not affect the review run while normal auth still works; managed settings policy can still apply
  • runs Droid with droid exec in read-only mode, forwards --model and -r, --reasoning-effort, and switches --output-format to stream-json when streaming is enabled
  • runs Pi v0.79.0+ from neutral temporary directories with --no-approve, --no-session, disabled Pi context/resource loading, and built-in read-only tools (read,grep,find,ls) when tools are enabled
  • runs OpenCode with opencode run --dir <repo> --pure --format json from a neutral temporary directory, forwards --model and --variant, injects deny-by-default permissions, disables project config loading, and passes the review prompt over stdin
  • prints review still running: <engine> elapsed=<seconds>s pid=<pid> to stderr at long-running intervals while waiting for the selected review engine, unless streamed output or compact Codex activity has been visible recently
  • prints autoreview clean: no accepted/actionable findings reported when the selected review command exits 0
  • exits nonzero when accepted/actionable findings are present

Final Report

Include:

  • review command used
  • tests/proof run
  • findings accepted/rejected, briefly why
  • the clean review result from the final helper/review run, or why a remaining finding was consciously rejected

Do not run another review solely to improve the final report wording. If the final helper run exited 0 and produced no accepted/actionable findings, report that exact run as clean.