docs/concepts/mantis.md
Mantis is the OpenClaw end-to-end verification system for bugs that need a real runtime, a real transport, and visible proof. It runs a scenario against a known bad ref, captures evidence, runs the same scenario against a candidate ref, and publishes the comparison as artifacts that a maintainer can inspect from a PR or from a local command.
Mantis starts with Discord because Discord gives us a high-value first lane: real bot auth, real guild channels, reactions, threads, native commands, and a browser UI where humans can visually confirm what the transport showed.
Mantis lives in the OpenClaw QA stack.
pnpm openclaw qa mantis.This boundary keeps transport knowledge in OpenClaw, machine scheduling in Crabbox, and maintainer workflow glue in ClawSweeper.
The first local command verifies the Discord bot, guild, channel, message send, reaction send, and artifact path:
pnpm openclaw qa mantis discord-smoke \
--output-dir .artifacts/qa-e2e/mantis/discord-smoke
The local before and after runner accepts this shape:
pnpm openclaw qa mantis run \
--transport discord \
--scenario discord-status-reactions-tool-only \
--baseline origin/main \
--candidate HEAD \
--output-dir .artifacts/qa-e2e/mantis/local-discord-status-reactions
The runner creates detached baseline and candidate worktrees under the output
directory, installs dependencies, builds each ref, runs the scenario with
--allow-failures, then writes baseline/, candidate/, comparison.json,
and mantis-report.md. For the first Discord scenario, a successful verification
means baseline status is fail and candidate status is pass.
The first VM/browser primitive is the desktop smoke:
pnpm openclaw qa mantis desktop-browser-smoke \
--output-dir .artifacts/qa-e2e/mantis/desktop-browser
It leases or reuses a Crabbox desktop machine, starts a visible browser inside the
VNC session, captures the desktop, pulls artifacts back to the local output
directory, and writes the reconnect command into the report. The command defaults
to the Hetzner provider because it is the first provider with working desktop/VNC
coverage in the Mantis lane. Override it with --provider, --crabbox-bin, or
OPENCLAW_MANTIS_CRABBOX_PROVIDER when running against another Crabbox fleet.
Useful desktop smoke flags:
--lease-id <cbx_...> or OPENCLAW_MANTIS_CRABBOX_LEASE_ID reuses a warmed desktop.--browser-url <url> changes the page opened in the visible browser.--html-file <path> renders a repo-local HTML artifact in the visible browser. Mantis uses this to capture the generated Discord status-reaction timeline through a real Crabbox desktop.--keep-lease or OPENCLAW_MANTIS_KEEP_VM=1 keeps a newly created passing lease open for VNC inspection. Failed runs keep the lease by default when one was created so an operator can reconnect.--class, --idle-timeout, and --ttl tune machine size and lease lifetime.The first full desktop transport primitive is the Slack desktop smoke:
pnpm openclaw qa mantis slack-desktop-smoke \
--output-dir .artifacts/qa-e2e/mantis/slack-desktop \
--gateway-setup \
--scenario slack-canary \
--keep-lease
It leases or reuses a Crabbox desktop machine, syncs the current checkout into
the VM, runs pnpm openclaw qa slack inside that VM, opens Slack Web in the VNC
browser, captures the visible desktop, and copies both the Slack QA artifacts and
the VNC screenshot back to the local output directory. This is the first Mantis
shape where the SUT OpenClaw gateway and the browser both live inside the same
Linux desktop VM.
With --gateway-setup, the command prepares a persistent disposable OpenClaw
home at $HOME/.openclaw-mantis/slack-openclaw, patches Slack Socket Mode
configuration for the selected channel, starts openclaw gateway run on port
38973, and keeps Chrome running in the VNC session. This is the "leave me a
Linux desktop with Slack and a claw running" mode; the bot-to-bot Slack QA lane
remains the default when --gateway-setup is omitted.
Required inputs for --credential-source env:
OPENCLAW_QA_SLACK_CHANNEL_IDOPENCLAW_QA_SLACK_DRIVER_BOT_TOKENOPENCLAW_QA_SLACK_SUT_BOT_TOKENOPENCLAW_QA_SLACK_SUT_APP_TOKENOPENCLAW_LIVE_OPENAI_KEY for the remote model lane. If only
OPENAI_API_KEY is set locally, Mantis maps it to OPENCLAW_LIVE_OPENAI_KEY
before invoking Crabbox so Crabbox's OPENCLAW_* env forwarding can carry it
into the VM.Useful Slack desktop flags:
--lease-id <cbx_...> reruns against a machine where an operator already logged in to Slack Web through VNC.--gateway-setup starts a persistent OpenClaw Slack gateway in the VM instead of only running the bot-to-bot QA lane.--slack-url <url> opens a specific Slack Web URL. Without it, Mantis derives https://app.slack.com/client/<team>/<channel> from Slack auth.test when the SUT bot token is available.--slack-channel-id <id> controls the Slack channel allowlist used by gateway setup.OPENCLAW_MANTIS_SLACK_BROWSER_PROFILE_DIR controls the persistent Chrome profile inside the VM. The default is $HOME/.config/openclaw-mantis/slack-chrome-profile, so a manual Slack Web login survives reruns on the same lease.--credential-source convex --credential-role ci uses the shared credential pool instead of direct Slack env tokens.--provider-mode, --model, --alt-model, and --fast pass through to the Slack live lane.The GitHub smoke workflow is Mantis Discord Smoke. The before and after GitHub
workflow for the first real scenario is Mantis Discord Status Reactions. It
accepts:
baseline_ref: the ref expected to reproduce queued-only behavior.candidate_ref: the ref expected to show queued -> thinking -> done.It checks out the workflow harness ref, builds separate baseline and candidate
worktrees, runs discord-status-reactions-tool-only against each worktree, and
uploads baseline/, candidate/, comparison.json, and mantis-report.md as
Actions artifacts. It also renders each lane's timeline HTML in a Crabbox
desktop browser and publishes those VNC screenshots beside the deterministic
timeline PNGs in the PR comment. The workflow builds the Crabbox CLI from
openclaw/crabbox main so it can use the current desktop/browser lease flags
before the next Crabbox binary release is cut.
You can also trigger the status-reactions run directly from a PR comment:
@Mantis discord status reactions
The comment trigger is intentionally narrow. It only runs on pull request comments from users with write, maintain, or admin access, and it only recognizes Discord status-reaction requests. By default it uses the known bad baseline ref and the current PR head SHA as the candidate. Maintainers can override either ref:
@Mantis discord status reactions baseline=origin/main candidate=HEAD
ClawSweeper command examples:
@clawsweeper mantis discord discord-status-reactions-tool-only
@clawsweeper verify e2e discord
The first command is explicit and scenario-focused. The second can later map a PR or issue to recommended Mantis scenarios from labels, changed files, and ClawSweeper review findings.
The scenario should be able to fail in two different ways:
The final report must separate these cases so maintainers do not confuse a flaky environment with product behavior.
The first scenario should target Discord status reactions in guild channels where
the source reply delivery mode is message_tool_only.
Why it is a good Mantis seed:
Expected scenario shape:
id: discord-status-reactions-tool-only
transport: discord
baseline:
expect:
reproduced: true
candidate:
expect:
fixed: true
config:
messages:
ackReaction: "👀"
ackReactionScope: "group-mentions"
groupChat:
visibleReplies: "message_tool"
statusReactions:
enabled: true
timing:
debounceMs: 0
discord:
requireMention: true
notifyChannel: operator-notify
evidence:
rest:
messageReactions: true
browser:
screenshotMessageRow: true
Baseline evidence should show the queued acknowledgement reaction but no
lifecycle transition in tool-only mode. Candidate evidence should show lifecycle
status reactions running when messages.statusReactions.enabled is explicitly
true.
The executable first slice is the opt-in Discord live QA scenario:
pnpm openclaw qa discord \
--scenario discord-status-reactions-tool-only \
--provider-mode live-frontier \
--model openai/gpt-5.4 \
--alt-model openai/gpt-5.4 \
--fast \
--output-dir .artifacts/qa-e2e/mantis/discord-status-reactions-candidate
It configures the SUT with always-on guild handling, visibleReplies: "message_tool", ackReaction: "👀", and explicit status reactions. The oracle
polls the real Discord triggering message and expects the observed sequence
👀 -> 🤔 -> 👍. Artifacts include discord-qa-reaction-timelines.json,
discord-status-reactions-tool-only-timeline.html, and
discord-status-reactions-tool-only-timeline.png.
Mantis should build on the existing private QA stack instead of starting from zero:
pnpm openclaw qa discord already runs a live Discord lane with driver and
SUT bots..artifacts/qa-e2e/.The first Mantis implementation can be a thin before/after runner over these pieces, plus one visual evidence layer.
Every run writes a stable artifact directory:
.artifacts/qa-e2e/mantis/<run-id>/
mantis-report.md
mantis-summary.json
baseline/
summary.json
discord-message.json
screenshot-message-row.png
gateway-debug/
candidate/
summary.json
discord-message.json
screenshot-message-row.png
gateway-debug/
comparison.json
run.log
mantis-summary.json should be the machine-readable source of truth. The
Markdown report is for PR comments and human review.
The summary must include:
Screenshots are evidence, not secrets. They still need redaction discipline: private channel names, user names, or message content may appear. For public PRs, prefer GitHub Actions artifact links over inline images until the redaction story is stronger.
The browser lane has two modes:
The Discord observer browser profile should be persistent enough to avoid logging in for every run, but isolated from personal browser state. A profile belongs to the Mantis machine pool, not to a developer laptop.
When Mantis gets stuck, it posts a Discord status message with:
The first private deployment can post these messages to the existing operator channel and move to a dedicated Mantis channel later.
Mantis should prefer AWS through Crabbox for the first remote implementation. Crabbox gives us warmed machines, lease tracking, hydration, logs, results, and cleanup. If AWS capacity is too slow or unavailable, add a Hetzner provider behind the same machine interface.
Minimum VM requirements:
The VM should not keep long-lived raw secrets outside the expected credential or browser profile stores.
Secrets live in GitHub organization or repository secrets for remote runs, and in a local operator-controlled secret file for local runs.
Recommended secret names:
OPENCLAW_QA_DISCORD_MANTIS_BOT_TOKENOPENCLAW_QA_DISCORD_DRIVER_BOT_TOKENOPENCLAW_QA_DISCORD_SUT_BOT_TOKENOPENCLAW_QA_DISCORD_GUILD_IDOPENCLAW_QA_DISCORD_CHANNEL_IDOPENCLAW_QA_DISCORD_NOTIFY_CHANNEL_IDOPENCLAW_QA_REDACT_PUBLIC_METADATA=1 for public GitHub artifact uploadsOPENCLAW_QA_CONVEX_SITE_URLOPENCLAW_QA_CONVEX_SECRET_CIOPENCLAW_QA_MANTIS_CRABBOX_COORDINATOROPENCLAW_QA_MANTIS_CRABBOX_COORDINATOR_TOKENLong term, the Convex credential pool should remain the normal source for live
transport credentials. GitHub secrets bootstrap the broker and fallback lanes.
The Discord status-reactions workflow maps the Mantis Crabbox secrets back to
the CRABBOX_COORDINATOR and CRABBOX_COORDINATOR_TOKEN environment variables
that the Crabbox CLI expects. The plain CRABBOX_* GitHub secret names remain
accepted as a compatibility fallback.
The Mantis runner must never print:
Public artifact uploads should also redact Discord target metadata such as bot,
guild, channel, and message ids. The GitHub smoke workflow enables
OPENCLAW_QA_REDACT_PUBLIC_METADATA=1 for this reason.
If a token is accidentally pasted into an issue, PR, chat, or log, rotate it after the new secret has been stored.
Mantis workflows should upload the full evidence bundle as a short-lived Actions
artifact. When the workflow is run for a bug report or fix PR, it should also
publish the redacted PNG screenshots to the qa-artifacts branch and upsert a
comment on that bug or fix PR with inline before/after screenshots. Do not post
the primary proof only on a generic QA automation PR. Raw logs, observed
messages, and other bulky evidence stay in the Actions artifact.
Production workflows should post those comments with the Mantis GitHub App, not
with github-actions[bot]. Store the app id and private key as
MANTIS_GITHUB_APP_ID and MANTIS_GITHUB_APP_PRIVATE_KEY GitHub Actions
secrets. The workflow uses a hidden marker as the upsert key, updates that
comment when the token can edit it, and creates a new Mantis-owned comment when
an older bot-owned marker cannot be edited.
The PR comment should be short and visual:
Mantis Discord Status Reactions QA
Summary: Mantis reran the reported Discord status-reaction bug against the known
bad baseline and the candidate fix. The baseline reproduced the bug, while the
candidate showed the expected queued -> thinking -> done sequence.
- Scenario: `discord-status-reactions-tool-only`
- Run: <workflow run link>
- Artifact: <artifact link>
- Baseline: `<status>` at `<sha>`
- Candidate: `<status>` at `<sha>`
| Baseline | Candidate |
| ------------------- | ------------------- |
| <inline screenshot> | <inline screenshot> |
When the run fails because the harness failed, the comment must say that instead of implying the candidate failed.
A private deployment may already have a Mantis Discord application. Reuse that application instead of creating another app when it has the right bot permissions and can be safely rotated.
Set the initial operator notification channel through secrets or deployment configuration. It can point at an existing maintainer or operations channel first, then move to a dedicated Mantis channel once one exists.
Do not put guild ids, channel ids, bot tokens, browser cookies, or VNC passwords in this document. Store them in GitHub secrets, the credential broker, or the operator's local secret store.
A Mantis scenario should declare:
Scenarios should prefer small, typed oracles:
Vision checks should be additive. If a platform API can prove the bug, use the API as the pass/fail oracle and keep screenshots for human confidence.
After Discord, the same runner can add:
gog where connectors are not
enough.Each transport should have one cheap smoke scenario and one or more bug-class scenarios. Expensive visual scenarios should stay opt-in.