docs/help/testing.md
OpenClaw has three Vitest suites (unit/integration, e2e, live) and a small set of Docker runners. This doc is a "how we test" guide:
pnpm openclaw qa matrix.This page covers running the regular test suites and Docker/Parallels runners. The QA-specific runners section below (QA-specific runners) lists the concrete qa invocations and points back at the references above.
</Note>
Most days:
pnpm build && pnpm check && pnpm check:test-types && pnpm testpnpm test:maxpnpm test:watchpnpm test extensions/discord/src/monitor/message-handler.preflight.test.tspnpm qa:lab:uppnpm openclaw qa suite --runner multipass --scenario channel-chat-baselineWhen you touch tests or want extra confidence:
pnpm test:coveragepnpm test:e2eWhen debugging real providers/models (requires real creds):
pnpm test:livepnpm test:live -- src/agents/models.profiles.live.test.tsOpenClaw Performance with
live_gpt54=true for a real openai/gpt-5.4 agent turn or
deep_profile=true for Kova CPU/heap/trace artifacts. Daily scheduled runs
publish mock-provider, deep-profile, and GPT 5.4 lane artifacts to
openclaw/clawgrit-reports when CLAWGRIT_REPORTS_TOKEN is configured. The
mock-provider report also includes source-level gateway boot, memory,
plugin-pressure, repeated fake-model hello-loop, and CLI startup numbers.pnpm test:docker:live-models
image input also run a tiny image turn.
Disable the extra probes with OPENCLAW_LIVE_MODEL_FILE_PROBE=0 or
OPENCLAW_LIVE_MODEL_IMAGE_PROBE=0 when isolating provider failures.OpenClaw Scheduled Live And E2E Checks and manual
OpenClaw Release Checks both call the reusable live/E2E workflow with
include_live_suites: true, which includes separate Docker live model
matrix jobs sharded by provider.OpenClaw Live And E2E Checks (Reusable)
with include_live_suites: true and live_models_only: true.scripts/ci-hydrate-live-auth.sh
plus .github/workflows/openclaw-live-and-e2e-checks-reusable.yml and its
scheduled/release callers.pnpm test:docker:live-codex-bind
/codex bind, exercises /codex fast and
/codex permissions, then verifies a plain reply and an image attachment
route through the native plugin binding instead of ACP.pnpm test:docker:live-codex-harness
/codex status and /codex models, and by default exercises image,
cron MCP, sub-agent, and Guardian probes. Disable the sub-agent probe with
OPENCLAW_LIVE_CODEX_HARNESS_SUBAGENT_PROBE=0 when isolating other Codex
app-server failures. For a focused sub-agent check, disable the other probes:
OPENCLAW_LIVE_CODEX_HARNESS_IMAGE_PROBE=0 OPENCLAW_LIVE_CODEX_HARNESS_MCP_PROBE=0 OPENCLAW_LIVE_CODEX_HARNESS_GUARDIAN_PROBE=0 OPENCLAW_LIVE_CODEX_HARNESS_SUBAGENT_PROBE=1 pnpm test:docker:live-codex-harness.
This exits after the sub-agent probe unless
OPENCLAW_LIVE_CODEX_HARNESS_SUBAGENT_ONLY=0 is set.pnpm test:live:crestodian-rescue-channel
/crestodian status, queues a persistent model
change, replies /crestodian yes, and verifies the audit/config write path.pnpm test:docker:crestodian-planner
PATH
and verifies the fuzzy planner fallback translates into an audited typed
config write.pnpm test:docker:crestodian-first-run
openclaw to
Crestodian, applies setup/model/agent/Discord plugin + SecretRef writes,
validates config, and verifies audit entries. The same Ring 0 setup path is
also covered in QA Lab by
pnpm openclaw qa suite --scenario crestodian-ring-zero-setup.MOONSHOT_API_KEY set, run
openclaw models list --provider moonshot --json, then run an isolated
openclaw agent --local --session-id live-kimi-cost --message 'Reply exactly: KIMI_LIVE_OK' --thinking off --json
against moonshot/kimi-k2.6. Verify the JSON reports Moonshot/K2.6 and the
assistant transcript stores normalized usage.cost.These commands sit beside the main test suites when you need QA-lab realism:
CI runs QA Lab in dedicated workflows. Agentic parity is nested under
QA-Lab - All Lanes and release validation, not a standalone PR workflow.
Broad validation should use Full Release Validation with
rerun_group=qa-parity or the release-checks QA group. Stable/default release
checks keep exhaustive live/Docker soak behind run_release_soak=true; the
full profile forces soak on. QA-Lab - All Lanes
runs nightly on main and from manual dispatch with the mock parity lane, live
Matrix lane, Convex-managed live Telegram lane, and Convex-managed live Discord
lane as parallel jobs. Scheduled QA and release checks pass Matrix
--profile fast explicitly, while the Matrix CLI and manual workflow input
default remain all; manual dispatch can shard all into transport,
media, e2ee-smoke, e2ee-deep, and e2ee-cli jobs. OpenClaw Release Checks runs parity plus the fast Matrix and Telegram lanes before release
approval, using mock-openai/gpt-5.5 for release transport checks so they stay
deterministic and avoid normal provider-plugin startup. These live transport
gateways disable memory search; memory behavior stays covered by the QA parity
suites.
Full release live media shards use
ghcr.io/openclaw/openclaw-live-media-runner:ubuntu-24.04, which already has
ffmpeg and ffprobe. Docker live model/backend shards use the shared
ghcr.io/openclaw/openclaw-live-test:<sha> image built once per selected
commit, then pull it with OPENCLAW_SKIP_DOCKER_BUILD=1 instead of rebuilding
inside every shard.
pnpm openclaw qa suite
qa-channel defaults to concurrency 4 (bounded by the
selected scenario count). Use --concurrency <count> to tune the worker
count, or --concurrency 1 for the older serial lane.--allow-failures when you
want artifacts without a failing exit code.live-frontier, mock-openai, and aimock.
aimock starts a local AIMock-backed provider server for experimental
fixture and protocol-mock coverage without replacing the scenario-aware
mock-openai lane.pnpm test:plugins:kitchen-sink-live
/healthz and /readyz, records gateway CPU/RSS
evidence, runs a live OpenAI turn, and checks adversarial diagnostics.
Requires live OpenAI auth such as OPENAI_API_KEY. In hydrated Testbox
sessions it automatically sources the Testbox live-auth profile when the
openclaw-testbox-env helper is present.pnpm test:gateway:cpu-scenarios
channel-chat-baseline, memory-failure-fallback,
gateway-restart-inflight-run) and writes a combined CPU observation
summary under .artifacts/gateway-cpu-scenarios/.--cpu-core-warn
plus --hot-wall-warn-ms), so short startup bursts are recorded as metrics
without looking like the minutes-long gateway peg regression.dist artifacts; run a build first when the checkout does not
already have fresh runtime output.pnpm openclaw qa suite --runner multipass
qa suite on the host.qa suite.CODEX_HOME
when present..artifacts/qa-e2e/....pnpm qa:lab:up
pnpm test:docker:npm-onboard-channel-agent
OPENCLAW_NPM_ONBOARD_CHANNEL=discord to run the same packaged-install
lane with Discord.pnpm test:docker:session-runtime-context
openclaw doctor --fix rewrites it to the active branch with a backup.pnpm test:docker:npm-telegram-live
OPENCLAW_NPM_TELEGRAM_PACKAGE_SPEC=openclaw@beta; set
OPENCLAW_NPM_TELEGRAM_PACKAGE_TGZ=/path/to/openclaw-current.tgz or
OPENCLAW_CURRENT_PACKAGE_TGZ to test a resolved local tarball instead of
installing from the registry.pnpm openclaw qa telegram. For CI/release automation, set
OPENCLAW_NPM_TELEGRAM_CREDENTIAL_SOURCE=convex plus
OPENCLAW_QA_CONVEX_SITE_URL and the role secret. If
OPENCLAW_QA_CONVEX_SITE_URL and a Convex role secret are present in CI,
the Docker wrapper selects Convex automatically.OPENCLAW_NPM_TELEGRAM_SKIP_CREDENTIAL_PREFLIGHT=1
only when deliberately debugging pre-credential setup.OPENCLAW_NPM_TELEGRAM_CREDENTIAL_ROLE=ci|maintainer overrides the shared
OPENCLAW_QA_CREDENTIAL_ROLE for this lane only.NPM Telegram Beta E2E. It does not run on merge. The workflow uses the
qa-live-shared environment and Convex CI credential leases.Package Acceptance for side-run product proof
against one candidate package. It accepts a trusted ref, published npm spec,
HTTPS tarball URL plus SHA-256, or tarball artifact from another run, uploads
the normalized openclaw-current.tgz as package-under-test, then runs the
existing Docker E2E scheduler with smoke, package, product, full, or custom
lane profiles. Set telegram_mode=mock-openai or live-frontier to run the
Telegram QA workflow against the same package-under-test artifact.
gh workflow run package-acceptance.yml --ref main \
-f source=npm \
-f package_spec=openclaw@beta \
-f suite_profile=product \
-f telegram_mode=mock-openai
gh workflow run package-acceptance.yml --ref main \
-f source=url \
-f package_url=https://registry.npmjs.org/openclaw/-/openclaw-VERSION.tgz \
-f package_sha256=<sha256> \
-f suite_profile=package
gh workflow run package-acceptance.yml --ref main \
-f source=artifact \
-f artifact_run_id=<run-id> \
-f artifact_name=<artifact-name> \
-f suite_profile=smoke
pnpm test:docker:plugins
openclaw update --tag <candidate>, and verifies the candidate's
post-update doctor cleans legacy plugin dependency debris without a
harness-side postinstall repair.pnpm test:parallels:npm-update
Runs the native packaged-install update smoke across Parallels guests. Each
selected platform first installs the requested baseline package, then runs
the installed openclaw update command in the same guest and verifies the
installed version, update status, gateway readiness, and one local agent
turn.
Use --platform macos, --platform windows, or --platform linux while
iterating on one guest. Use --json for the summary artifact path and
per-lane status.
The OpenAI lane uses openai/gpt-5.5 for the live agent-turn proof by
default. Pass --model <provider/model> or set
OPENCLAW_PARALLELS_OPENAI_MODEL when deliberately validating another
OpenAI model.
Wrap long local runs in a host timeout so Parallels transport stalls cannot consume the rest of the testing window:
timeout --foreground 150m pnpm test:parallels:npm-update -- --json
timeout --foreground 90m pnpm test:parallels:npm-update -- --platform windows --json
The script writes nested lane logs under /tmp/openclaw-parallels-npm-update.*.
Inspect windows-update.log, macos-update.log, or linux-update.log
before assuming the outer wrapper is hung.
Windows update can spend 10 to 15 minutes in post-update doctor and package update work on a cold guest; that is still healthy when the nested npm debug log is advancing.
Do not run this aggregate wrapper in parallel with individual Parallels macOS, Windows, or Linux smoke lanes. They share VM state and can collide on snapshot restore, package serving, or guest gateway state.
The post-update proof runs the normal bundled plugin surface because capability facades such as speech, image generation, and media understanding are loaded through bundled runtime APIs even when the agent turn itself only checks a simple text response.
pnpm openclaw qa aimock
pnpm openclaw qa matrix
qa-lab.pnpm openclaw qa telegram
OPENCLAW_QA_TELEGRAM_GROUP_ID, OPENCLAW_QA_TELEGRAM_DRIVER_BOT_TOKEN, and OPENCLAW_QA_TELEGRAM_SUT_BOT_TOKEN. The group id must be the numeric Telegram chat id.--credential-source convex for shared pooled credentials. Use env mode by default, or set OPENCLAW_QA_CREDENTIAL_SOURCE=convex to opt into pooled leases.--allow-failures when you
want artifacts without a failing exit code.@BotFather for both bots and ensure the driver bot can observe group bot traffic..artifacts/qa-e2e/.... Replying scenarios include RTT from driver send request to observed SUT reply.Live transport lanes share one standard contract so new transports do not drift; the per-lane coverage matrix lives in QA overview → Live transport coverage. qa-channel is the broad synthetic suite and is not part of that matrix.
When --credential-source convex (or OPENCLAW_QA_CREDENTIAL_SOURCE=convex) is enabled for
openclaw qa telegram, QA lab acquires an exclusive lease from a Convex-backed pool, heartbeats
that lease while the lane is running, and releases the lease on shutdown.
Reference Convex project scaffold:
qa/convex-credential-broker/Required env vars:
OPENCLAW_QA_CONVEX_SITE_URL (for example https://your-deployment.convex.site)OPENCLAW_QA_CONVEX_SECRET_MAINTAINER for maintainerOPENCLAW_QA_CONVEX_SECRET_CI for ci--credential-role maintainer|ciOPENCLAW_QA_CREDENTIAL_ROLE (defaults to ci in CI, maintainer otherwise)Optional env vars:
OPENCLAW_QA_CREDENTIAL_LEASE_TTL_MS (default 1200000)OPENCLAW_QA_CREDENTIAL_HEARTBEAT_INTERVAL_MS (default 30000)OPENCLAW_QA_CREDENTIAL_ACQUIRE_TIMEOUT_MS (default 90000)OPENCLAW_QA_CREDENTIAL_HTTP_TIMEOUT_MS (default 15000)OPENCLAW_QA_CONVEX_ENDPOINT_PREFIX (default /qa-credentials/v1)OPENCLAW_QA_CREDENTIAL_OWNER_ID (optional trace id)OPENCLAW_QA_ALLOW_INSECURE_HTTP=1 allows loopback http:// Convex URLs for local-only development.OPENCLAW_QA_CONVEX_SITE_URL should use https:// in normal operation.
Maintainer admin commands (pool add/remove/list) require
OPENCLAW_QA_CONVEX_SECRET_MAINTAINER specifically.
CLI helpers for maintainers:
pnpm openclaw qa credentials doctor
pnpm openclaw qa credentials add --kind telegram --payload-file qa/telegram-credential.json
pnpm openclaw qa credentials list --kind telegram
pnpm openclaw qa credentials remove --credential-id <credential-id>
Use doctor before live runs to check the Convex site URL, broker secrets,
endpoint prefix, HTTP timeout, and admin/list reachability without printing
secret values. Use --json for machine-readable output in scripts and CI
utilities.
Default endpoint contract (OPENCLAW_QA_CONVEX_SITE_URL + /qa-credentials/v1):
POST /acquire
{ kind, ownerId, actorRole, leaseTtlMs, heartbeatIntervalMs }{ status: "ok", credentialId, leaseToken, payload, leaseTtlMs?, heartbeatIntervalMs? }{ status: "error", code: "POOL_EXHAUSTED" | "NO_CREDENTIAL_AVAILABLE", ... }POST /heartbeat
{ kind, ownerId, actorRole, credentialId, leaseToken, leaseTtlMs }{ status: "ok" } (or empty 2xx)POST /release
{ kind, ownerId, actorRole, credentialId, leaseToken }{ status: "ok" } (or empty 2xx)POST /admin/add (maintainer secret only)
{ kind, actorId, payload, note?, status? }{ status: "ok", credential }POST /admin/remove (maintainer secret only)
{ credentialId, actorId }{ status: "ok", changed, credential }{ status: "error", code: "LEASE_ACTIVE", ... }POST /admin/list (maintainer secret only)
{ kind?, status?, includePayload?, limit? }{ status: "ok", credentials, count }Payload shape for Telegram kind:
{ groupId: string, driverToken: string, sutToken: string }groupId must be a numeric Telegram chat id string.admin/add validates this shape for kind: "telegram" and rejects malformed payloads.The architecture and scenario-helper names for new channel adapters live in QA overview → Adding a channel. The minimum bar: implement the transport runner on the shared qa-lab host seam, declare qaRunners in the plugin manifest, mount as openclaw qa <runner>, and author scenarios under qa/scenarios/.
Think of the suites as “increasing realism” (and increasing flakiness/cost):
pnpm testvitest.full-*.config.ts shard set and may expand multi-project shards into per-project configs for parallel schedulingsrc/**/*.test.ts, packages/**/*.test.ts, and test/**/*.test.ts; UI unit tests run in the dedicated unit-ui shardapi.js and
runtime-api.js fallback behavior with generated tiny plugin fixtures, not
real bundled plugin source APIs. Real plugin API loads belong in
plugin-owned contract/integration suites.- Untargeted `pnpm test` runs twelve smaller shard configs (`core-unit-fast`, `core-unit-src`, `core-unit-security`, `core-unit-ui`, `core-unit-support`, `core-support-boundary`, `core-contracts`, `core-bundled`, `core-runtime`, `agentic`, `auto-reply`, `extensions`) instead of one giant native root-project process. This cuts peak RSS on loaded machines and avoids auto-reply/extension work starving unrelated suites.
- `pnpm test --watch` still uses the native root `vitest.config.ts` project graph, because a multi-shard watch loop is not practical.
- `pnpm test`, `pnpm test:watch`, and `pnpm test:perf:imports` route explicit file/directory targets through scoped lanes first, so `pnpm test extensions/discord/src/monitor/message-handler.preflight.test.ts` avoids paying the full root project startup tax.
- `pnpm test:changed` expands changed git paths into cheap scoped lanes by default: direct test edits, sibling `*.test.ts` files, explicit source mappings, and local import-graph dependents. Config/setup/package edits do not broad-run tests unless you explicitly use `OPENCLAW_TEST_CHANGED_BROAD=1 pnpm test:changed`.
- `pnpm check:changed` is the normal smart local check gate for narrow work. It classifies the diff into core, core tests, extensions, extension tests, apps, docs, release metadata, live Docker tooling, and tooling, then runs the matching typecheck, lint, and guard commands. It does not run Vitest tests; call `pnpm test:changed` or explicit `pnpm test <target>` for test proof. Release metadata-only version bumps run targeted version/config/root-dependency checks, with a guard that rejects package changes outside the top-level version field.
- Live Docker ACP harness edits run focused checks: shell syntax for the live Docker auth scripts and a live Docker scheduler dry-run. `package.json` changes are included only when the diff is limited to `scripts["test:docker:live-*"]`; dependency, export, version, and other package-surface edits still use the broader guards.
- Import-light unit tests from agents, commands, plugins, auto-reply helpers, `plugin-sdk`, and similar pure utility areas route through the `unit-fast` lane, which skips `test/setup-openclaw-runtime.ts`; stateful/runtime-heavy files stay on the existing lanes.
- Selected `plugin-sdk` and `commands` helper source files also map changed-mode runs to explicit sibling tests in those light lanes, so helper edits avoid rerunning the full heavy suite for that directory.
- `auto-reply` has dedicated buckets for top-level core helpers, top-level `reply.*` integration tests, and the `src/auto-reply/reply/**` subtree. CI further splits the reply subtree into agent-runner, dispatch, and commands/state-routing shards so one import-heavy bucket does not own the full Node tail.
- Normal PR/main CI intentionally skips the extension batch sweep and release-only `agentic-plugins` shard. Full Release Validation dispatches the separate `Plugin Prerelease` child workflow for those plugin/extension-heavy suites on release candidates.
- When you change message-tool discovery inputs or compaction runtime
context, keep both levels of coverage.
- Add focused helper regressions for pure routing and normalization
boundaries.
- Keep the embedded runner integration suites healthy:
`src/agents/pi-embedded-runner/compact.hooks.test.ts`,
`src/agents/pi-embedded-runner/run.overflow-compaction.test.ts`, and
`src/agents/pi-embedded-runner/run.overflow-compaction.loop.test.ts`.
- Those suites verify that scoped ids and compaction behavior still flow
through the real `run.ts` / `compact.ts` paths; helper-only tests are
not a sufficient substitute for those integration paths.
- Base Vitest config defaults to `threads`.
- The shared Vitest config fixes `isolate: false` and uses the
non-isolated runner across the root projects, e2e, and live configs.
- The root UI lane keeps its `jsdom` setup and optimizer, but runs on the
shared non-isolated runner too.
- Each `pnpm test` shard inherits the same `threads` + `isolate: false`
defaults from the shared Vitest config.
- `scripts/run-vitest.mjs` adds `--no-maglev` for Vitest child Node
processes by default to reduce V8 compile churn during big local runs.
Set `OPENCLAW_VITEST_ENABLE_MAGLEV=1` to compare against stock V8
behavior.
- `pnpm changed:lanes` shows which architectural lanes a diff triggers.
- The pre-commit hook is formatting-only. It restages formatted files and
does not run lint, typecheck, or tests.
- Run `pnpm check:changed` explicitly before handoff or push when you
need the smart local check gate.
- `pnpm test:changed` routes through cheap scoped lanes by default. Use
`OPENCLAW_TEST_CHANGED_BROAD=1 pnpm test:changed` only when the agent
decides a harness, config, package, or contract edit really needs broader
Vitest coverage.
- `pnpm test:max` and `pnpm test:changed:max` keep the same routing
behavior, just with a higher worker cap.
- Local worker auto-scaling is intentionally conservative and backs off
when the host load average is already high, so multiple concurrent
Vitest runs do less damage by default.
- The base Vitest config marks the projects/config files as
`forceRerunTriggers` so changed-mode reruns stay correct when test
wiring changes.
- The config keeps `OPENCLAW_VITEST_FS_MODULE_CACHE` enabled on supported
hosts; set `OPENCLAW_VITEST_FS_MODULE_CACHE_PATH=/abs/path` if you want
one explicit cache location for direct profiling.
- `pnpm test:perf:imports` enables Vitest import-duration reporting plus
import-breakdown output.
- `pnpm test:perf:imports:changed` scopes the same profiling view to
files changed since `origin/main`.
- Shard timing data is written to `.artifacts/vitest-shard-timings.json`.
Whole-config runs use the config path as the key; include-pattern CI
shards append the shard name so filtered shards can be tracked
separately.
- When one hot test still spends most of its time in startup imports,
keep heavy dependencies behind a narrow local `*.runtime.ts` seam and
mock that seam directly instead of deep-importing runtime helpers just
to pass them through `vi.mock(...)`.
- `pnpm test:perf:changed:bench -- --ref <git-ref>` compares routed
`test:changed` against the native root-project path for that committed
diff and prints wall time plus macOS max RSS.
- `pnpm test:perf:changed:bench -- --worktree` benchmarks the current
dirty tree by routing the changed file list through
`scripts/test-projects.mjs` and the root Vitest config.
- `pnpm test:perf:profile:main` writes a main-thread CPU profile for
Vitest/Vite startup and transform overhead.
- `pnpm test:perf:profile:runner` writes runner CPU+heap profiles for the
unit suite with file parallelism disabled.
pnpm test:stability:gatewayvitest.gateway.config.ts, forced to one workerdiagnostics.stability over the Gateway WS RPCpnpm test:e2evitest.e2e.config.tssrc/**/*.e2e.test.ts, test/**/*.e2e.test.ts, and bundled-plugin E2E tests under extensions/threads with isolate: false, matching the rest of the repo.OPENCLAW_E2E_WORKERS=<n> to force worker count (capped at 16).OPENCLAW_E2E_VERBOSE=1 to re-enable verbose console output.pnpm test:e2e:openshellextensions/openshell/src/backend.e2e.test.tssandbox ssh-config + SSH execpnpm test:e2e runopenshell CLI plus a working Docker daemonHOME / XDG_CONFIG_HOME, then destroys the test gateway and sandboxOPENCLAW_E2E_OPENSHELL=1 to enable the test when running the broader e2e suite manuallyOPENCLAW_E2E_OPENSHELL_COMMAND=/path/to/openshell to point at a non-default CLI binary or wrapper scriptpnpm test:livevitest.live.config.tssrc/**/*.live.test.ts, test/**/*.live.test.ts, and bundled-plugin live tests under extensions/pnpm test:live (sets OPENCLAW_LIVE_TEST=1)~/.profile to pick up missing API keys.HOME and copy config/auth material into a temp test home so unit fixtures cannot mutate your real ~/.openclaw.OPENCLAW_LIVE_USE_REAL_HOME=1 only when you intentionally need live tests to use your real home directory.pnpm test:live now defaults to a quieter mode: it keeps [live] ... progress output, but suppresses the extra ~/.profile notice and mutes gateway bootstrap logs/Bonjour chatter. Set OPENCLAW_LIVE_TEST_QUIET=0 if you want the full startup logs back.*_API_KEYS with comma/semicolon format or *_API_KEY_1, *_API_KEY_2 (for example OPENAI_API_KEYS, ANTHROPIC_API_KEYS, GEMINI_API_KEYS) or per-live override via OPENCLAW_LIVE_*_KEY; tests retry on rate limit responses.vitest.live.config.ts disables Vitest console interception so provider/gateway progress lines stream immediately during live runs.OPENCLAW_LIVE_HEARTBEAT_MS.OPENCLAW_LIVE_GATEWAY_HEARTBEAT_MS.Use this decision table:
pnpm test (and pnpm test:coverage if you changed a lot)pnpm test:e2epnpm test:liveFor the live model matrix, CLI backend smokes, ACP smokes, Codex app-server harness, and all media-provider live tests (Deepgram, BytePlus, ComfyUI, image, music, video, media harness) — plus credential handling for live runs — see Testing live suites. For the dedicated update and plugin validation checklist, see Testing updates and plugins.
These Docker runners split into two buckets:
test:docker:live-models and test:docker:live-gateway run only their matching profile-key live file inside the repo Docker image (src/agents/models.profiles.live.test.ts and src/gateway/gateway-models.profiles.live.test.ts), mounting your local config dir and workspace (and sourcing ~/.profile if mounted). The matching local entrypoints are test:live:models-profiles and test:live:gateway-profiles.test:docker:live-models defaults to OPENCLAW_LIVE_MAX_MODELS=12, and
test:docker:live-gateway defaults to OPENCLAW_LIVE_GATEWAY_SMOKE=1,
OPENCLAW_LIVE_GATEWAY_MAX_MODELS=8,
OPENCLAW_LIVE_GATEWAY_STEP_TIMEOUT_MS=45000, and
OPENCLAW_LIVE_GATEWAY_MODEL_TIMEOUT_MS=90000. Override those env vars when you
explicitly want the larger exhaustive scan.test:docker:all builds the live Docker image once via test:docker:live-build, packs OpenClaw once as an npm tarball through scripts/package-openclaw-for-docker.mjs, then builds/reuses two scripts/e2e/Dockerfile images. The bare image is only the Node/Git runner for install/update/plugin-dependency lanes; those lanes mount the prebuilt tarball. The functional image installs the same tarball into /app for built-app functionality lanes. Docker lane definitions live in scripts/lib/docker-e2e-scenarios.mjs; planner logic lives in scripts/lib/docker-e2e-plan.mjs; scripts/test-docker-all.mjs executes the selected plan. The aggregate uses a weighted local scheduler: OPENCLAW_DOCKER_ALL_PARALLELISM controls process slots, while resource caps keep heavy live, npm-install, and multi-service lanes from all starting at once. If a single lane is heavier than the active caps, the scheduler can still start it when the pool is empty and then keeps it running alone until capacity is available again. Defaults are 10 slots, OPENCLAW_DOCKER_ALL_LIVE_LIMIT=9, OPENCLAW_DOCKER_ALL_NPM_LIMIT=10, and OPENCLAW_DOCKER_ALL_SERVICE_LIMIT=7; tune OPENCLAW_DOCKER_ALL_WEIGHT_LIMIT or OPENCLAW_DOCKER_ALL_DOCKER_LIMIT only when the Docker host has more headroom. The runner performs a Docker preflight by default, removes stale OpenClaw E2E containers, prints status every 30 seconds, stores successful lane timings in .artifacts/docker-tests/lane-timings.json, and uses those timings to start longer lanes first on later runs. Use OPENCLAW_DOCKER_ALL_DRY_RUN=1 to print the weighted lane manifest without building or running Docker, or node scripts/test-docker-all.mjs --plan-json to print the CI plan for selected lanes, package/image needs, and credentials.Package Acceptance is the GitHub-native package gate for "does this installable tarball work as a product?" It resolves one candidate package from source=npm, source=ref, source=url, or source=artifact, uploads it as package-under-test, then runs the reusable Docker E2E lanes against that exact tarball instead of repacking the selected ref. Profiles are ordered by breadth: smoke, package, product, and full. See Testing updates and plugins for the package/update/plugin contract, published-upgrade survivor matrix, release defaults, and failure triage.scripts/check-cli-bootstrap-imports.mjs after tsdown. The guard walks the static built graph from dist/entry.js and dist/cli/run-main.js and fails if pre-dispatch startup imports package dependencies such as Commander, prompt UI, undici, or logging before command dispatch; it also keeps the bundled gateway run chunk under budget and rejects static imports of known cold gateway paths. Packaged CLI smoke also covers root help, onboard help, doctor help, status, config schema, and a model-list command.2026.4.25 (2026.4.25-beta.* included). Through that cutoff, the harness tolerates only shipped-package metadata gaps: omitted private QA inventory entries, missing gateway install --wrapper, missing patch files in the tarball-derived git fixture, missing persisted update.channel, legacy plugin install-record locations, missing marketplace install-record persistence, and config metadata migration during plugins update. For packages after 2026.4.25, those paths are strict failures.test:docker:openwebui, test:docker:onboard, test:docker:npm-onboard-channel-agent, test:docker:update-channel-switch, test:docker:upgrade-survivor, test:docker:published-upgrade-survivor, test:docker:session-runtime-context, test:docker:agents-delete-shared-workspace, test:docker:gateway-network, test:docker:browser-cdp-snapshot, test:docker:mcp-channels, test:docker:pi-bundle-mcp-tools, test:docker:cron-mcp-cleanup, test:docker:plugins, test:docker:plugin-update, test:docker:plugin-lifecycle-matrix, and test:docker:config-reload boot one or more real containers and verify higher-level integration paths.The live-model Docker runners also bind-mount only the needed CLI auth homes (or all supported ones when the run is not narrowed), then copy them into the container home before the run so external-CLI OAuth can refresh tokens without mutating the host auth store:
pnpm test:docker:live-models (script: scripts/test-live-models-docker.sh)pnpm test:docker:live-acp-bind (script: scripts/test-live-acp-bind-docker.sh; covers Claude, Codex, and Gemini by default, with strict Droid/OpenCode coverage via pnpm test:docker:live-acp-bind:droid and pnpm test:docker:live-acp-bind:opencode)pnpm test:docker:live-cli-backend (script: scripts/test-live-cli-backend-docker.sh)pnpm test:docker:live-codex-harness (script: scripts/test-live-codex-harness-docker.sh)pnpm test:docker:live-gateway (script: scripts/test-live-gateway-models-docker.sh)pnpm qa:otel:smoke is a private QA source-checkout lane. It is intentionally not part of package Docker release lanes because the npm tarball omits QA Lab.pnpm test:docker:openwebui (script: scripts/e2e/openwebui-docker.sh)pnpm test:docker:onboard (script: scripts/e2e/onboard-docker.sh)pnpm test:docker:npm-onboard-channel-agent installs the packed OpenClaw tarball globally in Docker, configures OpenAI via env-ref onboarding plus Telegram by default, runs doctor, and runs one mocked OpenAI agent turn. Reuse a prebuilt tarball with OPENCLAW_CURRENT_PACKAGE_TGZ=/path/to/openclaw-*.tgz, skip the host rebuild with OPENCLAW_NPM_ONBOARD_HOST_BUILD=0, or switch channel with OPENCLAW_NPM_ONBOARD_CHANNEL=discord or OPENCLAW_NPM_ONBOARD_CHANNEL=slack.pnpm test:docker:update-channel-switch installs the packed OpenClaw tarball globally in Docker, switches from package stable to git dev, verifies the persisted channel and plugin post-update work, then switches back to package stable and checks update status.pnpm test:docker:upgrade-survivor installs the packed OpenClaw tarball over a dirty old-user fixture with agents, channel config, plugin allowlists, stale plugin dependency state, and existing workspace/session files. It runs package update plus non-interactive doctor without live provider or channel keys, then starts a loopback Gateway and checks config/state preservation plus startup/status budgets.pnpm test:docker:published-upgrade-survivor installs openclaw@latest by default, seeds realistic existing-user files, configures that baseline with a baked command recipe, validates the resulting config, updates that published install to the candidate tarball, runs non-interactive doctor, writes .artifacts/upgrade-survivor/summary.json, then starts a loopback Gateway and checks configured intents, state preservation, startup, /healthz, /readyz, and RPC status budgets. Override one baseline with OPENCLAW_UPGRADE_SURVIVOR_BASELINE_SPEC, ask the aggregate scheduler to expand exact baselines with OPENCLAW_UPGRADE_SURVIVOR_BASELINE_SPECS such as all-since-2026.4.23, and expand issue-shaped fixtures with OPENCLAW_UPGRADE_SURVIVOR_SCENARIOS such as reported-issues; the reported-issues set includes configured-plugin-installs for automatic external OpenClaw plugin install repair. Package Acceptance exposes those as published_upgrade_survivor_baseline, published_upgrade_survivor_baselines, and published_upgrade_survivor_scenarios; Full Release Validation uses the default latest baseline in the blocking path and expands to all-since/reported-issues only for run_release_soak=true or release_profile=full.pnpm test:docker:session-runtime-context verifies hidden runtime context transcript persistence plus doctor repair of affected duplicated prompt-rewrite branches.bash scripts/e2e/bun-global-install-smoke.sh packs the current tree, installs it with bun install -g in an isolated home, and verifies openclaw infer image providers --json returns bundled image providers instead of hanging. Reuse a prebuilt tarball with OPENCLAW_BUN_GLOBAL_SMOKE_PACKAGE_TGZ=/path/to/openclaw-*.tgz, skip the host build with OPENCLAW_BUN_GLOBAL_SMOKE_HOST_BUILD=0, or copy dist/ from a built Docker image with OPENCLAW_BUN_GLOBAL_SMOKE_DIST_IMAGE=openclaw-dockerfile-smoke:local.bash scripts/test-install-sh-docker.sh shares one npm cache across its root, update, and direct-npm containers. Update smoke defaults to npm latest as the stable baseline before upgrading to the candidate tarball. Override with OPENCLAW_INSTALL_SMOKE_UPDATE_BASELINE=2026.4.22 locally, or with the Install Smoke workflow's update_baseline_version input on GitHub. Non-root installer checks keep an isolated npm cache so root-owned cache entries do not mask user-local install behavior. Set OPENCLAW_INSTALL_SMOKE_NPM_CACHE_DIR=/path/to/cache to reuse the root/update/direct-npm cache across local reruns.OPENCLAW_INSTALL_SMOKE_SKIP_NPM_GLOBAL=1; run the script locally without that env when direct npm install -g coverage is needed.pnpm test:docker:agents-delete-shared-workspace (script: scripts/e2e/agents-delete-shared-workspace-docker.sh) builds the root Dockerfile image by default, seeds two agents with one workspace in an isolated container home, runs agents delete --json, and verifies valid JSON plus retained workspace behavior. Reuse the install-smoke image with OPENCLAW_AGENTS_DELETE_SHARED_WORKSPACE_E2E_IMAGE=openclaw-dockerfile-smoke:local OPENCLAW_AGENTS_DELETE_SHARED_WORKSPACE_E2E_SKIP_BUILD=1.pnpm test:docker:gateway-network (script: scripts/e2e/gateway-network-docker.sh)pnpm test:docker:browser-cdp-snapshot (script: scripts/e2e/browser-cdp-snapshot-docker.sh) builds the source E2E image plus a Chromium layer, starts Chromium with raw CDP, runs browser doctor --deep, and verifies CDP role snapshots cover link URLs, cursor-promoted clickables, iframe refs, and frame metadata.pnpm test:docker:openai-web-search-minimal (script: scripts/e2e/openai-web-search-minimal-docker.sh) runs a mocked OpenAI server through Gateway, verifies web_search raises reasoning.effort from minimal to low, then forces the provider schema reject and checks the raw detail appears in Gateway logs.pnpm test:docker:mcp-channels (script: scripts/e2e/mcp-channels-docker.sh)pnpm test:docker:pi-bundle-mcp-tools (script: scripts/e2e/pi-bundle-mcp-tools-docker.sh)pnpm test:docker:cron-mcp-cleanup (script: scripts/e2e/cron-mcp-cleanup-docker.sh)file:, npm registry with hoisted dependencies, git moving refs, ClawHub kitchen-sink, marketplace updates, and Claude-bundle enable/inspect): pnpm test:docker:plugins (script: scripts/e2e/plugins-docker.sh)
Set OPENCLAW_PLUGINS_E2E_CLAWHUB=0 to skip the ClawHub block, or override the default kitchen-sink package/runtime pair with OPENCLAW_PLUGINS_E2E_CLAWHUB_SPEC and OPENCLAW_PLUGINS_E2E_CLAWHUB_ID. Without OPENCLAW_CLAWHUB_URL/CLAWHUB_URL, the test uses a hermetic local ClawHub fixture server.pnpm test:docker:plugin-update (script: scripts/e2e/plugin-update-unchanged-docker.sh)pnpm test:docker:plugin-lifecycle-matrix installs the packed OpenClaw tarball in a bare container, installs an npm plugin, toggles enable/disable, upgrades and downgrades it through a local npm registry, deletes the installed code, then verifies uninstall still removes stale state while logging RSS/CPU metrics for each lifecycle phase.pnpm test:docker:config-reload (script: scripts/e2e/config-reload-source-docker.sh)pnpm test:docker:plugins covers install/update smoke for local path, file:, npm registry with hoisted dependencies, git moving refs, ClawHub fixtures, marketplace updates, and Claude-bundle enable/inspect. pnpm test:docker:plugin-update covers unchanged update behavior for installed plugins. pnpm test:docker:plugin-lifecycle-matrix covers resource-tracked npm plugin install, enable, disable, upgrade, downgrade, and missing-code uninstall.To prebuild and reuse the shared functional image manually:
OPENCLAW_DOCKER_E2E_IMAGE=openclaw-docker-e2e-functional:local pnpm test:docker:e2e-build
OPENCLAW_DOCKER_E2E_IMAGE=openclaw-docker-e2e-functional:local OPENCLAW_SKIP_DOCKER_BUILD=1 pnpm test:docker:mcp-channels
Suite-specific image overrides such as OPENCLAW_GATEWAY_NETWORK_E2E_IMAGE still win when set. When OPENCLAW_SKIP_DOCKER_BUILD=1 points at a remote shared image, the scripts pull it if it is not already local. The QR and installer Docker tests keep their own Dockerfiles because they validate package/install behavior rather than the shared built-app runtime.
The live-model Docker runners also bind-mount the current checkout read-only and
stage it into a temporary workdir inside the container. This keeps the runtime
image slim while still running Vitest against your exact local source/config.
The staging step skips large local-only caches and app build outputs such as
.pnpm-store, .worktrees, __openclaw_vitest__, and app-local .build or
Gradle output directories so Docker live runs do not spend minutes copying
machine-specific artifacts.
They also set OPENCLAW_SKIP_CHANNELS=1 so gateway live probes do not start
real Telegram/Discord/etc. channel workers inside the container.
test:docker:live-models still runs pnpm test:live, so pass through
OPENCLAW_LIVE_GATEWAY_* as well when you need to narrow or exclude gateway
live coverage from that Docker lane.
test:docker:openwebui is a higher-level compatibility smoke: it starts an
OpenClaw gateway container with the OpenAI-compatible HTTP endpoints enabled,
starts a pinned Open WebUI container against that gateway, signs in through
Open WebUI, verifies /api/models exposes openclaw/default, then sends a
real chat request through Open WebUI's /api/chat/completions proxy.
The first run can be noticeably slower because Docker may need to pull the
Open WebUI image and Open WebUI may need to finish its own cold-start setup.
This lane expects a usable live model key, and OPENCLAW_PROFILE_FILE
(~/.profile by default) is the primary way to provide it in Dockerized runs.
Successful runs print a small JSON payload like { "ok": true, "model": "openclaw/default", ... }.
test:docker:mcp-channels is intentionally deterministic and does not need a
real Telegram, Discord, or iMessage account. It boots a seeded Gateway
container, starts a second container that spawns openclaw mcp serve, then
verifies routed conversation discovery, transcript reads, attachment metadata,
live event queue behavior, outbound send routing, and Claude-style channel +
permission notifications over the real stdio MCP bridge. The notification check
inspects the raw stdio MCP frames directly so the smoke validates what the
bridge actually emits, not just what a specific client SDK happens to surface.
test:docker:pi-bundle-mcp-tools is deterministic and does not need a live
model key. It builds the repo Docker image, starts a real stdio MCP probe server
inside the container, materializes that server through the embedded Pi bundle
MCP runtime, executes the tool, then verifies coding and messaging keep
bundle-mcp tools while minimal and tools.deny: ["bundle-mcp"] filter them.
test:docker:cron-mcp-cleanup is deterministic and does not need a live model
key. It starts a seeded Gateway with a real stdio MCP probe server, runs an
isolated cron turn and a /subagents spawn one-shot child turn, then verifies
the MCP child process exits after each run.
Manual ACP plain-language thread smoke (not CI):
bun scripts/dev/discord-acp-plain-language-smoke.ts --channel <discord-channel-id> ...Useful env vars:
OPENCLAW_CONFIG_DIR=... (default: ~/.openclaw) mounted to /home/node/.openclawOPENCLAW_WORKSPACE_DIR=... (default: ~/.openclaw/workspace) mounted to /home/node/.openclaw/workspaceOPENCLAW_PROFILE_FILE=... (default: ~/.profile) mounted to /home/node/.profile and sourced before running testsOPENCLAW_DOCKER_PROFILE_ENV_ONLY=1 to verify only env vars sourced from OPENCLAW_PROFILE_FILE, using temporary config/workspace dirs and no external CLI auth mountsOPENCLAW_DOCKER_CLI_TOOLS_DIR=... (default: ~/.cache/openclaw/docker-cli-tools) mounted to /home/node/.npm-global for cached CLI installs inside Docker$HOME are mounted read-only under /host-auth..., then copied into /home/node/... before tests start
.minimax~/.codex/auth.json, ~/.codex/config.toml, .claude.json, ~/.claude/.credentials.json, ~/.claude/settings.json, ~/.claude/settings.local.jsonOPENCLAW_LIVE_PROVIDERS / OPENCLAW_LIVE_GATEWAY_PROVIDERSOPENCLAW_DOCKER_AUTH_DIRS=all, OPENCLAW_DOCKER_AUTH_DIRS=none, or a comma list like OPENCLAW_DOCKER_AUTH_DIRS=.claude,.codexOPENCLAW_LIVE_GATEWAY_MODELS=... / OPENCLAW_LIVE_MODELS=... to narrow the runOPENCLAW_LIVE_GATEWAY_PROVIDERS=... / OPENCLAW_LIVE_PROVIDERS=... to filter providers in-containerOPENCLAW_SKIP_DOCKER_BUILD=1 to reuse an existing openclaw:local-live image for reruns that do not need a rebuildOPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1 to ensure creds come from the profile store (not env)OPENCLAW_OPENWEBUI_MODEL=... to choose the model exposed by the gateway for the Open WebUI smokeOPENCLAW_OPENWEBUI_PROMPT=... to override the nonce-check prompt used by the Open WebUI smokeOPENWEBUI_IMAGE=... to override the pinned Open WebUI image tagRun docs checks after doc edits: pnpm check:docs.
Run full Mintlify anchor validation when you need in-page heading checks too: pnpm docs:check-links:anchors.
These are “real pipeline” regressions without real providers:
src/gateway/gateway.test.ts (case: "runs a mock OpenAI tool call end-to-end via gateway agent loop")wizard.start/wizard.next, writes config + auth enforced): src/gateway/gateway.test.ts (case: "runs wizard over ws and writes auth token config")We already have a few CI-safe tests that behave like “agent reliability evals”:
src/gateway/gateway.test.ts).src/gateway/gateway.test.ts).What’s still missing for skills (see Skills):
SKILL.md before use and follow required steps/args?Future evals should stay deterministic first:
Contract tests verify that every registered plugin and channel conforms to its
interface contract. They iterate over all discovered plugins and run a suite of
shape and behavior assertions. The default pnpm test unit lane intentionally
skips these shared seam and smoke files; run the contract commands explicitly
when you touch shared channel or provider surfaces.
pnpm test:contractspnpm test:contracts:channelspnpm test:contracts:pluginsLocated in src/channels/plugins/contracts/*.contract.test.ts:
Located in src/plugins/contracts/*.contract.test.ts.
Located in src/plugins/contracts/*.contract.test.ts:
Contract tests run in CI and do not require real API keys.
When you fix a provider/model issue discovered in live:
src/secrets/exec-secret-ref-id-parity.test.ts derives one sampled target per SecretRef class from registry metadata (listSecretTargetRegistryEntries()), then asserts traversal-segment exec ids are rejected.includeInPlan SecretRef target family in src/secrets/target-registry-data.ts, update classifyTargetClass in that test. The test intentionally fails on unclassified target ids so new classes cannot be skipped silently.