showcase/TESTING.md
Tagline: bin/showcase test reference, cell red→green SOP, per-demo coverage
matrix, and CI gating matrix.
Three concerns live here:
bin/showcase test CLI reference — how to run the
harness locally and how to drive a cell from red to green.Operational gotchas (aimock fixture caching, --isolate slot collisions, etc.)
live in GOTCHAS.md. Per-failure-mode debugging strategies and
production-side ops (probe triggering, Railway log access, isolated stack
cleanup) live in DEBUGGING.md.
bin/showcase test invocation semantics--d5 / --d6 route through the production-equivalent control-plane pipeline
(producer → queue → worker, same as Railway). --direct switches to a legacy
in-process driver (bypasses control-plane).
| Invocation | Family | Pipeline | Per-demo scoping (:demo) |
|---|---|---|---|
--d5 (no :demo) | d5 representative | control-plane | hardcoded agentic-chat |
--d5 :demo | d5 single demo | control-plane | honored (post-A18) |
--d5 --direct | d5 family | in-process | honored via buildDeepInputs |
--d6 (no :demo) | d6 full sweep | control-plane | full demo list, aggregate validation |
--d6 :demo | d6 single demo | control-plane | honored (post-A18) |
--d6 --direct | d6 family | in-process | honored via buildFullInputs |
--direct (no --d5/--d6) | d5+d6 default | in-process | honored |
Use control-plane (no --direct) for production-equivalent testing. It
exercises the same producer→queue→worker pipeline Railway uses, so local
results are apples-to-apples with staging. --direct is opt-out legacy:
useful for fast in-process debugging when you don't need the queue.
--isolate (post-A21+A21b) scopes the rebuild to the target slug — infra
services (aimock, pocketbase, dashboard, harness, harness-pool-worker) reuse
cached images from the local Docker store. Cold-build is ~30s–2 min per slug
instead of 10+ min full-stack rebuild.
Run the harness LOCALLY FIRST. Capture the RED log via the production-equivalent path BEFORE any code change:
bin/showcase test <slug>:<demo> --d5 --isolate
No theory, no "should work" — observe the actual failure.
One change. Verify GREEN locally on the same probe. Re-run the same invocation; it must go green. Iterate if not. Don't commit on red.
LGP regression check every time. Gold-standard cell must stay green:
bin/showcase test langgraph-python:tool-rendering-custom-catchall --d5 --isolate
Diagnose by failure mode (use the aimock /journal endpoint + docker logs showcase-iso<N>-aimock + DOM/probe text):
tool_result → backend fix (add tool handler, fix
agentId routing). See crewai-crews for the canonical example.toolCallId-gated narration fixture doesn't match → backend rewrites IDs
(Anthropic toolu_*, TanStack fc-*). Fix: swap toolCallId discriminator
for turnIndex (or hasToolResult / sequenceIndex) — backend-id-invariant.
See built-in-agent and claude-sdk-typescript for canonical match-key tunes.userMessage → add the entry following
the existing match-shape conventions.Layer boundaries:
response.content (canonical
narration is fixture-author truth — the d5 probe asserts on it; a real LLM won't
reliably emit the verbatim phrase). Tune match keys instead.tool_result. Don't expand
backend logic when a fixture match-key change suffices.langgraph-python's fixture/backend
pattern, not the sibling-of-the-day.aimock matcher semantics (for /v1/responses after responsesInputToMessages
transform; identical to /v1/chat/completions):
| Match key | Semantics |
|---|---|
userMessage | substring on the last role:"user" message |
toolCallId | strict equality vs last role:"tool" tool_call_id (FRAGILE — backend ID-rewrite breaks this) |
hasToolResult | boolean: any role:"tool" message present |
turnIndex | integer: count of role:"assistant" messages |
context | equality vs x-aimock-context header |
First-match-wins. Order entries specific-before-generic.
aimock caches fixtures at container startup. Editing a fixture in a live
stack requires docker restart showcase-iso<N>-aimock to reload. Fresh
--isolate slots cold-start aimock from the volume mount, so the first
post-edit run picks up the change automatically; warm-slot reuse will not.
--isolate slot collisions with foreign Docker projects. The slot registry
only knows about showcase-* compose projects. If a sibling project (e.g.,
ag2mm-*) owns the same host ports for the auto-picked slot, health checks
cross to the wrong containers and results misroute. Either pre-reserve the
conflicting slot dirs in ~/.local/state/copilotkit/showcase/slots/ or tear
down the foreign stack first.
Cell-color flip claims MUST be empirically value-tested via the production-equivalent control-plane path on ≥3 candidate cells before merge. No "should flip N." Use:
bin/showcase test <slug>:<feature> --d6 --isolate
(or --d5 --isolate for single-pill e2e). DO NOT use --direct for
value-test — it bypasses the queue/worker pipeline staging actually runs and
has misled investigations in the past.
No fixture rewrite from real-LLM record/replay. The canonical-phrase probe
is anti-record/replay-against-real-LLM by construction: a real LLM won't emit
the verbatim assertion phrase. Fixture content is authored truth; tune the
match keys to fire on the right turn.
Tagline: what test surface exists per demo. PASS=covered, WARN=partial, FAIL=none, STUB=needs aimock fixture.
| Demo | Manual QA | Vitest Unit | Playwright E2E (smoke) | Playwright E2E (interaction) | Per-Package E2E | Aimock Fixtures | CI Auto |
|---|---|---|---|---|---|---|---|
| Agentic Chat | PASS 17 packages | FAIL | PASS load + suggestions | WARN suggestion click only | PASS weather card, background change, multi-turn | WARN background, weather matches | WARN validate only (no Playwright in CI by default) |
| Human in the Loop | PASS 17 packages | FAIL | PASS load + suggestions | WARN suggestion click only | PASS step selector, approve/reject | WARN plan/steps/mars matches (text only, no interrupt) | WARN validate only |
| Tool Rendering | PASS 17 packages | FAIL | PASS load + suggestions | WARN suggestion click only | PASS WeatherCard with stats grid | WARN weather match (tool call) | WARN validate only |
| Gen UI (Tool-Based) | PASS 17 packages | FAIL | FAIL | FAIL | PASS sidebar, haiku card, pie/bar chart | FAIL no haiku-specific fixture | WARN validate only |
| Gen UI (Agent) | PASS 17 packages | FAIL | FAIL | FAIL | PASS task progress tracker, progress bar | FAIL no gen-ui-agent fixture | WARN validate only |
| Shared State (Read) | PASS 17 packages | FAIL | FAIL | FAIL | PASS recipe card, sidebar, pipeline | FAIL no shared-state fixture | WARN validate only |
| Shared State (Write) | PASS 17 packages (stub) | FAIL | FAIL | FAIL | PASS pipeline, deal CRUD, agent state writes | FAIL no shared-state fixture | WARN validate only |
| Shared State (Streaming) | PASS 17 packages (stub) | FAIL | FAIL | FAIL | PASS document editor, confirm/reject changes | FAIL no streaming fixture | WARN validate only |
| Sub-Agents | PASS 17 packages (stub) | FAIL | FAIL | FAIL | PASS travel planner, agent indicators, sections | FAIL no subagent fixture | WARN validate only |
| Feature | Manual QA | Vitest Unit | Playwright E2E (smoke) | Playwright E2E (interaction) | Aimock Fixtures | CI Auto |
|---|---|---|---|---|---|---|
| Sales Dashboard (page load) | FAIL | PASS extract-starter tests (on-demand extraction) | PASS header, 4 renderer pills | PASS pill switching, content verification | WARN sales/todo/deal matches | WARN validate + aimock-e2e (manual trigger) |
| Renderer Selector | FAIL | FAIL | PASS 4 pills visible, default selection | PASS mutual exclusion, content changes per mode | FAIL | WARN validate only |
| Tool-Based mode | FAIL | FAIL | PASS pipeline heading, KPI cards | PASS Add a deal, multiple deals, empty state | WARN sales/todo matches | WARN validate only |
| A2UI Catalog mode | FAIL | FAIL | PASS same pipeline content | FAIL | FAIL | WARN validate only |
| json-render mode | FAIL | FAIL | PASS fallback note + pipeline | FAIL | FAIL | WARN validate only |
| HashBrown mode | FAIL | FAIL | PASS pipeline content | FAIL | FAIL | WARN validate only |
| Depth | Probe Name | Cadence | What "Green" Means |
|---|---|---|---|
| D5 | e2e-demos | hourly :10 | All per-integration smoke probes pass |
| D6 | d6-all-pills-e2e | hourly :40 | All demo cells pass across every integration pill |
showcase/integrations/*/qa/*.md (153 files, 17 packages × 9 demos).showcase/scripts/__tests__/*.test.ts — registry/constraint/bundle/integration generators.showcase/scripts/__tests__/e2e/ — starter-e2e.spec.ts, demo-e2e.spec.ts, screenshots.spec.ts.showcase/integrations/*/tests/e2e/ — 9 demo specs per package; langgraph-python adds renderer-selector.spec.ts.showcase/aimock/ — feature-parity.json, smoke.json, d4/<slug>/, d6/<slug>/..github/workflows/showcase_*.yml (see Test-Gating Matrix below)./test-aimock comment.langgraph-python.This matrix documents which CI workflows fire on which triggers, what they test, and whether they gate merges.
Scope: all testing-related workflows (unit, integration, e2e, smoke) across the monorepo. Data read directly from .github/workflows/*.yml on the current branch.
| Workflow file | Name (CI UI) | Trigger | Path filter | Required? | What it tests |
|---|---|---|---|---|---|
.github/workflows/test_unit.yml | test / unit | push (main), pull_request (main), workflow_dispatch | paths-ignore: docs/**, README.md, examples/** | No | Vitest unit suite across Node 20/22/24 for all TS packages |
.github/workflows/test_unit-python-sdk.yml | test / unit / python-sdk | push (main), pull_request (main) | sdk-python/**, this workflow | No | pytest against sdk-python/ under Python 3.12 + Poetry |
.github/workflows/test_integration-runtime.yml | test / integration / runtime | push (main), pull_request (main), workflow_dispatch | packages/runtime/**, this workflow | No | Runtime server integration tests (Node, possibly others) |
.github/workflows/test_integration-docs.yml | test / integration / docs | push (main), pull_request | docs/** | No | Extracts code blocks from docs, runs them against aimock (model-name + doc-test) |
.github/workflows/test_e2e-dojo.yml | test / e2e / dojo | push (main), pull_request (main), workflow_dispatch | packages/**, sdk-python/**, this workflow, .changeset | No | ag-ui dojo end-to-end matrix on Depot runners |
.github/workflows/test_e2e-legacy-v1.yml | test / e2e / legacy-v1 | push (main), pull_request (main), workflow_dispatch | examples/**, this workflow, .changeset | No | Legacy v1.x examples (form-filling, travel, research-canvas, chat-with-your-data, state-machine) |
.github/workflows/showcase_validate.yml | Showcase: Validate | push (main), pull_request | showcase/**, examples/integrations/**/fixtures/**, scripts/doc-tests/fixtures/** | No | Build-pipeline Vitest + manifest/registry validation + shell build |
.github/workflows/test_e2e-showcase-on-demand.yml | test / e2e / showcase / on-demand | issue_comment (/test-aimock), workflow_dispatch | n/a (comment-gated) | No | aimock-backed Playwright E2E on demand per-package |
.github/workflows/test_smoke-starter.yml | test / smoke / starter | schedule (0 */6 * * *), workflow_run (publish / release), pull_request, workflow_dispatch | examples/integrations/**, this workflow | No | Docker-compose smoke for 12 starter integrations (build + curl) |
.github/workflows/test_smoke-starter-deployed.yml | test / smoke / starter-deployed | schedule (0 */6 * * *), workflow_run (Showcase: Build & Deploy), workflow_dispatch | n/a (scheduled / post-deploy) | No | Playwright E2E against live deployed starter URLs (@starter-health/-agent/-chat) |
packages/** (runtime/SDK): test / unit, test / integration, test / e2e / dojo, static / quality, static / check binaries, plus static / danger if packages/sdk-js/src/langgraph.ts is touched.sdk-python/**: test / unit / python-sdk, test / e2e / dojo, static / check binaries, plus static / danger if copilotkit/langgraph_agent.py is touched. test / unit also fires (paths-ignore does not exclude sdk-python).showcase/**: Showcase: Validate, test / unit (paths-ignore does not exclude showcase), static / quality, static / check binaries. No Playwright E2E runs automatically -- comment /test-aimock on the PR to trigger test / e2e / showcase / on-demand.examples/** (legacy v1.x): test / e2e / legacy-v1, static / check binaries. test / unit is excluded via paths-ignore.examples/integrations/** (starters): test / smoke / starter (Docker), Showcase: Validate (for fixtures only), static / check binaries.docs/**: test / integration / docs only. test / unit and static / quality are excluded via paths-ignore..github/workflows/**: each workflow that lists its own path in its trigger runs (most do). No single "workflows changed" catch-all.None of the workflows above are enforced as required status checks. The active PROTECT_OUR_MAIN ruleset on main requires zero status contexts -- merges are gated only by review approval, not by CI outcome.
The legacy classic branch protection required_status_checks.contexts array contains stale entries (e.g. test / unit, Showcase: Validate) that appear in GitHub's API responses but are not evaluated by the active ruleset. If you see a CI workflow marked as "required" in an older doc or script, treat that as ghost data: nothing in GitHub's current enforcement path consumes it.
Practical consequence: a red CI run does not block merge. Reviewers must eyeball gh pr checks before approving.
test / unit matrix is Node 20/22/24; the other workflows pin a single Node version each (22 most common).test / e2e / dojo uses Depot runners (depot-ubuntu-24.04); all others use standard GitHub runners.test / smoke / starter and test / smoke / starter-deployed both run every 6h; the former validates Docker-build integrity of examples/integrations/, the latter validates the deployed Railway services.workflow_run triggers fire after another workflow completes -- they do not gate the triggering PR, they run post-merge.