Back to Lobehub

Agent Testing (Agentic End-to-End Verification)

.agents/skills/agent-testing/SKILL.md

2.2.418.6 KB
Original Source

Agent Testing (Agentic End-to-End Verification)

One skill for all agentic end-to-end testing — local-first today, designed to also run as full cloud automation. Every test session follows the same four-step contract:

Step -1: Plan approval  →  Step 0: Env + Auth  →  Step 1: Pick surface  →  Step 2: Run  →  Step 3: Structured report

Step -1 — Plan approval for non-trivial tests

Skip directly to Step 0 if: the test is a single re-run after a fix, the plan was already agreed on, or the user gave exact commands.

Otherwise, propose a test plan (surface, cases, expected evidence, assumptions) and use the runtime structured question tool (request_user_input / ask-user-question equivalent) with two fixed choices:

  1. 开始执行 (Recommended) — 测试方案没问题,开始执行
  2. 先讨论下 — 方案有问题,先讨论下

Wait for the user's choice before proceeding.

Step 0 — Environment setup + auth check (mandatory)

Step 0 is about getting the environment ready: dependencies are healthy and auth is green. A test run that dies halfway on a missing dependency or a login wall wastes the whole session — clear both gates BEFORE writing a single test step.

0.0 Resolve the current test environment

Before starting a dev server, checking auth, opening agent-browser, or writing test steps, print and confirm the current local test environment:

bash
./.agents/skills/agent-testing/scripts/test-env.sh

This command is the source of truth for local test ports. It reads the current shell plus .env files using the same precedence as scripts/runWithEnv.mts, then prints:

  • APP_URL
  • PORT
  • SERVER_URL
  • AUTH_TRUSTED_ORIGINS
  • SPA_PORT
  • MOBILE_SPA_PORT
  • DESKTOP_PORT

For commands that need these values, export them from the same resolver:

bash
eval "$(./.agents/skills/agent-testing/scripts/test-env.sh --exports)"

Do not rely on hard-coded port tables. If the printed values do not match the running dev server, fix/export the env first, then continue.

0.1 Dependencies are installed — root AND standalone apps

The root pnpm workspace does NOT cover every app: pnpm-workspace.yaml lists packages/**, e2e, apps/server, and only apps/desktop/src/mainapps/desktop and apps/cli are standalone, each keeping its own node_modules with its own links into packages/. A root install does not refresh them, so install in every app the test will touch:

bash
pnpm install                    # root workspace
cd apps/desktop && pnpm install # Electron surface
cd apps/cli && pnpm install     # CLI surface

Symptom of a stale standalone install: the build/launch fails to resolve a recently added workspace package — Rolldown failed to resolve import "@lobechat/<pkg>" (Electron) or Cannot find module '@lobechat/<pkg>' (CLI).

0.2 Run scripts from the repo root

All paths in this skill (./.agents/skills/agent-testing/...) are repo-root-relative, and background commands inherit the current working directory — a script launched while cwd is apps/desktop fails with No such file or directory. Verify pwd is the repo root before launching long-running scripts.

0.3 Init local dev env without .env

For Web smoke against local code, start a normal local dev environment. First check the repo root for .env:

  • If .env exists, use the existing local configuration and start the dev server normally.
  • If .env does not exist, use the agent-testing env bootstrap.

Do not start the standalone e2e server as the product under test.

Use scripts/init-dev-env.sh. It follows the e2e setup pattern — Postgres, migrations, auth/key-vault/S3 test env, seed user — but it is owned by this skill and starts the repo's dev server (pnpm run dev:next / bun run dev), not e2e/scripts/setup.ts --start. The script hard-blocks when root .env exists, so it cannot accidentally override a user's local config. When .env exists, do not call any init-dev-env.sh subcommand.

Decision flow:

bash
if [[ -f .env ]]; then
  bun run dev
else
  ./.agents/skills/agent-testing/scripts/init-dev-env.sh setup-db
  ./.agents/skills/agent-testing/scripts/init-dev-env.sh seed-user
  ./.agents/skills/agent-testing/scripts/init-dev-env.sh dev
fi

Bootstrap flow when no .env exists:

bash
# From repo root. Managed DB flow requires Docker Desktop.
./.agents/skills/agent-testing/scripts/init-dev-env.sh setup-db
./.agents/skills/agent-testing/scripts/init-dev-env.sh seed-user
./.agents/skills/agent-testing/scripts/init-dev-env.sh dev

If using an existing Postgres instead of the managed Docker DB, set DATABASE_URL and skip setup-db:

bash
DATABASE_URL=postgresql://... ./.agents/skills/agent-testing/scripts/init-dev-env.sh migrate
DATABASE_URL=postgresql://... ./.agents/skills/agent-testing/scripts/init-dev-env.sh seed-user
DATABASE_URL=postgresql://... ./.agents/skills/agent-testing/scripts/init-dev-env.sh dev

For backend-only checks, dev-next is available, but Web smoke needs the full-stack dev command so Next can proxy the SPA HTML from Vite:

bash
./.agents/skills/agent-testing/scripts/init-dev-env.sh dev-next

Useful subcommands:

bash
./.agents/skills/agent-testing/scripts/init-dev-env.sh env       # print exports
./.agents/skills/agent-testing/scripts/init-dev-env.sh write     # write .records/env/agent-testing-dev.env
./.agents/skills/agent-testing/scripts/init-dev-env.sh migrate   # migrations only
./.agents/skills/agent-testing/scripts/init-dev-env.sh seed-user # seed user + CLI API key
./.agents/skills/agent-testing/scripts/init-dev-env.sh qstash    # local QStash for workflow paths
./.agents/skills/agent-testing/scripts/init-dev-env.sh clean-db  # remove managed DB container

Default script env:

  • APP_URL=http://localhost:3010
  • DATABASE_URL=postgresql://postgres:postgres@localhost:5433/postgres
  • DATABASE_DRIVER=node
  • FEATURE_FLAGS=-agent_self_iteration so local smoke does not require QStash
  • Local QStash defaults (QSTASH_URL, QSTASH_TOKEN, signing keys) are exported; run init-dev-env.sh qstash in a separate terminal when the path under test triggers QStash/Workflow.
  • KEY_VAULTS_SECRET, AUTH_SECRET, auth verification off
  • S3 mock vars
  • Managed DB container: lobehub-agent-testing-postgres

seed-user creates [email protected] / TestPassword123! with onboarding already completed, plus a local API key in .records/env/agent-testing-cli.env for CLI automation. When running Cucumber against this dev server, pass the same script env into the test process too; Cucumber has its own BeforeAll seed path and it must see DATABASE_URL instead of silently skipping setup:

bash
cd e2e
# Only in the no-.env branch.
eval "$(../.agents/skills/agent-testing/scripts/init-dev-env.sh env)"
BASE_URL=http://localhost:3010 HEADLESS=true bun run test:smoke

0.4 Auth is green for the selected surface

Auth is the gate for automated testing, but the gate is surface-scoped. Pick the intended surface first when it is already clear from the task, then check only that surface. Do not block a Web test on CLI device-code auth or an Electron login state unless the test spans those surfaces.

bash
./.agents/skills/agent-testing/scripts/setup-auth.sh status --surface web

Use status with no --surface only for cross-surface test plans.

SurfaceMechanismOne-key pathStandard check
CLISeeded API key, device-code fallbacksetup-auth.sh cli-seedsetup-auth.sh status --surface cli
WebSeeded better-auth login into agent-browsersetup-auth.sh web-seedsetup-auth.sh status --surface web
ElectronApp's own persistent login stateLog in once in the appsetup-auth.sh status --surface electron
BotNative apps already logged inper-platform screenshot

Login-state checks are standardized — do NOT hand-roll window.__LOBE_STORES eval snippets; use scripts/app-probe.sh auth (returns { isSignedIn, userId }, works for Electron CDP and web sessions via AB_TARGET).

For Web tests, the test surface is always agent-browser --session lobehub-dev. Use setup-auth.sh web-seed first in the seeded local env. The user's normal Chrome is only a source for copying the Cookie header when seed auth is not available or status --surface web still fails. If Chrome is already logged in, do not open a login page; verify agent-browser first, then request the Network Cookie: header only if that verification fails. Full background and failure modes: references/auth.md.

Step 1 — Pick the surface by change scope

Change scopeDefault surfaceWhyGuide
Backend (TRPC router / service / model / migration)CLIFastest loop, text-assertable output, zero UI flakinesscli/index.md
Pure frontend (components, store, styles, UX)Electron (agent-browser + CDP)Primary product shape; __LOBE_STORES state introspectionui/electron.md
Full-stack (new API + UI consuming it)Web (browser + local dev server)One surface where network requests and UI are observable togetherui/web.md
Bot channels (Discord / WeChat / Lark / …)Native app via osascript / bridgeOnly way to exercise the real channel end-to-endbot/<platform>/index.md

Escalate, don't duplicate: verify a backend change with the CLI first; only add a UI pass when the change actually affects the UI.

Environment support (local macOS vs cloud Linux)

The decisive constraint per surface is how evidence (screenshots) is captured: CDP-based capture (agent-browser screenshot) renders from the browser engine and needs no real display; OS-level capture (screencapture, osascript) is macOS-only.

SurfacemacOS (local)Linux / cloud (headless)Screenshot mechanism
CLIn/a — text output
Web✅ headless Chromium works nativelyCDP — no display needed
Electron⚠️ runs, but needs a display server: wrap with xvfb-runCDP works under Xvfb; capture-app-window.sh does NOT
Bot❌ osascript + native apps are macOS-onlymacOS screencapture only

When a test must stay cloud-portable, prefer CDP-based evidence over OS-level capture wherever both exist.

Bot platforms

PlatformGuideQuick switcher
Discordbot/discord/index.mdCmd+K
Slackbot/slack/index.mdCmd+K
Telegrambot/telegram/index.mdCmd+F
WeChat / 微信bot/wechat/index.mdCmd+F
Lark / 飞书bot/lark/index.mdCmd+K
QQbot/qq/index.mdCmd+F
iMessagebot/imessage/index.mdbridge (no osascript)

Each platform folder contains an index.md (activation, navigation, send-message, verification snippets) and a test-<platform>-bot.sh script sharing the interface:

bash
./.agents/skills/agent-testing/bot/<platform>/test-<platform>-bot.sh <channel_or_contact> <message> [wait_seconds] [screenshot_path]

New to osascript automation? Read references/osascript.md first — it is a general macOS-automation asset (activate, type, paste, screenshot, accessibility reads, gotchas), not bot-specific.

Step 2 — Run

Surface guides above carry the detailed workflows. Shared infrastructure:

NeedWhere
Start / restart the local dev serverreferences/dev-server.md
agent-browser command referencereferences/agent-browser.md
osascript patterns (general macOS)references/osascript.md
Agent gateway probingreferences/agent-gateway.md
Screen recordingreferences/record-app-screen.md

Scripts

All under .agents/skills/agent-testing/scripts/:

ScriptUsage
test-env.shPrint/export the resolved local test env and ports
setup-auth.shOne-stop auth setup & status check (status / cli / web)
init-dev-env.shSelf-contained local dev env (setup-db / seed-user / dev-next / dev)
app-probe.shLobeHub app probes: auth / route / ops / goto <path> / errors
record-gif.shFrame-sequence → GIF for time-based behavior (streaming, timers, animations)
report-init.shScaffold a structured test report (Step 3)
electron-dev.shManage Electron dev env (start/stop/status/restart, CDP 9222)
capture-app-window.shScreenshot a specific app window (general; used by bot tests)
record-app-screen.shRecord app screen (video + periodic screenshots)
record-electron-demo.shRecord Electron app demo with ffmpeg
agent-gateway/Gateway probe / dump / analyze tools

app-probe.sh is the LobeHub-specific fast path into app state — auth check, current route, running operations, and goto <path> quick navigation (/agent/<agentId>/<topicId>, /task/<taskId>, /settings, …) so a test can jump straight to the state under test instead of clicking through the UI. See ui/electron.md for usage.

Step 3 — Structured report (mandatory deliverable)

Every automated test session ends with a structured, evidence-backed report — not a chat-only summary. Scaffold it up front and fill it as you test:

bash
DIR=$(./.agents/skills/agent-testing/scripts/report-init.sh my-feature "Verify my feature")
# ... test, saving screenshots / CLI transcripts into $DIR/assets/ ...
# fill $DIR/report.md (scope, case table with inline evidence, verdict, score) and $DIR/result.json

Reports live in .records/reports/<timestamp>-<slug>/ (gitignored): report.md (human-readable, with screenshots/GIFs embedded directly in the case table), result.json (machine-readable pass/fail + score), assets/ (evidence). Format spec and evidence rules: references/report.md.

Two hard rules worth front-loading:

  • Report language = the user's conversation language. Write the ENTIRE report.md (headings included) in the language the user is conversing in — no mixed English. result.json keys/status values stay English.
  • The case table is the main reading surface. Prefer the compact # | case | result | key observation | evidence shape and embed the screenshot/GIF in the evidence cell. Use separate evidence sections only for long CLI transcripts, HAR summaries, or supplemental detail.
  • Visual evidence must render inline. Screenshots and GIFs in report.md must use Markdown image syntax like ![case 1](assets/case1.png). Do not use bare file paths, Markdown links, or local file links as the primary visual evidence; those make the report unreadable without opening each asset.
  • Final replies must include visual evidence links. When a run includes UI screenshots or GIFs, include the report directory and the most important visual artifacts in the final chat response. Each item must include a stable label, an evidence caption describing the observed UI outcome, and a repo-relative path, for example: [Image #1 - error toast shows provider auth failure](<report-dir>/assets/foo.png). Use repo-relative paths, not absolute paths.
  • Time-based behavior needs a GIF, not a screenshot. If a case asserts change over time (streaming output, a ticking timer, loading states, animations), record it with scripts/record-gif.sh and embed the GIF — a static screenshot cannot prove the behavior.

Directory map

agent-testing/
├── SKILL.md            # this router
├── cli/index.md        # backend verification via the LobeHub CLI
├── ui/electron.md      # pure-frontend verification in the desktop app
├── ui/web.md           # full-stack verification in the browser
├── bot/<platform>/     # bot-channel verification (osascript / bridge)
├── references/         # shared knowledge: auth, dev-server, agent-browser, osascript, report
└── scripts/            # setup-auth, report-init, electron-dev, capture, recording, gateway

Gotchas