showcase/RUNBOOK.md
Operational documentation for debugging and fixing showcase D5 probe failures locally. Intended audience: engineers and AI agents working on showcase integrations.
ALWAYS use bin/showcase for all operations. Never raw docker compose or docker build.
bin/showcase up <slug> # start container
bin/showcase rebuild <slug> # code changes (new image)
bin/showcase test <slug> --d5 # run D5 probe
bin/showcase test <slug> --d6 --isolate <name> # canonical d6 verification
rebuild handles symlink dereferencing that raw docker build cannot (tools/ and shared-tools/ are symlinks to ../../shared/).recreate for env/config changes (same image, new container).rebuild for code changes (new image).To verify an integration's D6 state, run:
bin/showcase test <slug> --d6 --isolate <name>
This is THE default way to verify a slug. --isolate <name> brings up a fully
isolated stack — its own aimock + PocketBase + dashboard + integration +
harness control-plane and pool-worker — on offset ports in its own docker
compose project, then runs the canonical harness/src/probes/drivers/d6-all-pills.ts
driver: it enqueues per-pill jobs, the isolated worker claims them, and asserts
per-pill. This is identical to the non-isolate path — the same driver, the
same per-pill assertions — just on isolated ports/project so it doesn't disturb
the shared long-lived showcase-* stack.
Verifiers use THIS flow rather than hand-driving the browser. Hand-driving breaks the identical-tests invariant: the whole point is that the same driver runs and asserts per-pill the same way across every integration, so a result is comparable to every other integration and to production. Manual clicking is non-reproducible and tests something subtly different per run.
See Isolated Verification Runs (--isolate)
below for the full mechanics and cleanup.
D5 fixtures use hasToolResult (not turnIndex) for multi-turn disambiguation.
| Field | Meaning |
|---|---|
hasToolResult: false | First LLM call -- no tool result in messages yet |
hasToolResult: true | Follow-up call -- tool result present |
turnIndex counts assistant messages, which varies across frameworks. Do not use turnIndex in new fixtures.
The mcp-subagents supervisor chain should also avoid turnIndex; chain each
follow-up on the prior tool call's toolCallId so repeated and interleaved pill
clicks replay the same way.
showcase/harness/fixtures/d5/*.json -- edit these, then rebuild bundle.showcase/aimock/d5-all.json -- aggregate of all source fixtures. Rebuild after any source edit.Add --log-level debug to the aimock command in docker-compose.local.yml to see fixture match/miss per request.
These are canonical. Do not deviate.
Backend agent has tools=[] (no backend tools). Frontend registers tools via useHumanInTheLoop or useFrontendTool. CopilotKit injects frontend tool definitions into the LLM call. Every HITL integration follows this pattern without exception.
langgraph-python uses the chart pattern (useComponent with render_pie_chart). All other integrations should also demonstrate meaningful custom generative UI. Do not replace charts/data-viz with trivial text-only components just to pass tests.
Frontend registers useRenderTool for get_weather. The v2 API uses parameters (not args) in the render callback. Backend has the actual tool.
Backend calls set_notes tool, must forward tool result back to LLM for the follow-up text response. Frameworks that don't auto-cycle (crewai, langroid) need explicit tool-execution loops.
All providers must be routed through aimock. Required env vars in x-integration-defaults:
OPENAI_API_KEY / OPENAI_BASE_URL -> http://aimock:4010/v1
ANTHROPIC_API_KEY / ANTHROPIC_BASE_URL -> http://aimock:4010
GOOGLE_API_KEY / GOOGLE_GEMINI_BASE_URL -> http://aimock:4010
SPRING_AI_OPENAI_BASE_URL -> http://aimock:4010
Missing any of these means that provider's integrations bypass aimock and hit real APIs (or fail with an empty key).
docker compose -f showcase/docker-compose.local.yml psdocker logs showcase-<slug> --tail 30--log-level debug to compose commandbin/showcase test <slug> --d5 --verbosedocker logs showcase-aimock 2>&1 | grep "Fixture matched\|No fixture"hasToolResult, userMessage substring matchingThe harness emits structured logs at INFO level for probe lifecycle events:
probe.tick-start / probe.tick-complete — probe run lifecycleprobe.target-start / probe.target-complete — per-service resultsprobe.d6-all-pills.service-start / probe.d6-all-pills.service-complete — D5 per-serviceprobe.d6-all-pills.feature-complete — per-feature pass/fail with error detailsprobe.run-summary — single line with all service resultsprobe.d6-all-pills.pool-abort-release — browser pool starvation eventsView with: RAILWAY_PROJECT_ID=6f8c6bff-a80d-4f8f-b78d-50b32bcf4479 railway logs --service showcase-harness --tail 200
For detailed conversation-runner traces (selector resolution, DOM text extraction, settle polling, per-turn lifecycle), set LOG_LEVEL=debug on the showcase-harness Railway service. This enables console.debug(...) output from the conversation runner and D5 scripts.
To enable temporarily: set the env var in Railway dashboard → showcase-harness → Variables → LOG_LEVEL=debug. The service auto-restarts. Remember to unset after debugging — debug output is verbose.
curl -sf -X POST "https://showcase-harness-production.up.railway.app/api/probes/probe:d6-all-pills-e2e/trigger" \
-H "Authorization: Bearer $OPS_TRIGGER_TOKEN" \
-H "Content-Type: application/json"
Retrieve OPS_TRIGGER_TOKEN: RAILWAY_PROJECT_ID=6f8c6bff-a80d-4f8f-b78d-50b32bcf4479 railway variables --service showcase-harness --json | python3 -c "import json,sys; print(json.load(sys.stdin)['OPS_TRIGGER_TOKEN'])"
Rate limit: 5 minutes per probe ID.
When package.json changes (new deps, version bumps), volume mounts don't cover node_modules. You MUST rebuild the Docker image: bin/showcase rebuild <slug>, then re-test. A passing bin/showcase test against a volume-mounted container does NOT validate the build.
--isolate)--isolate <name> is the default way to verify a slug (see Verifying a Slug's
D6 State). It brings up a fully
isolated stack on offset ports in its own docker compose project and runs the
canonical d6-all-pills driver against it — identical to the non-isolate path,
just namespaced so it never touches the shared showcase-* stack.
bin/showcase test <slug> --d6 --isolate <name>
<name>: names the isolated compose project. It must start with a
lowercase letter or digit, followed by lowercase letters, digits, - or _
([a-z0-9][a-z0-9_-]*, a docker compose project-name constraint); uppercase
is normalized to lowercase with a warning. The name showcase is reserved —
it is the default stack's own compose project name, so the CLI refuses it.
Use a distinct name per run so concurrent runs never collide.(slot + 1) * 200 — slot 0 → +200,
slot 1 → +400, and so on. Do not assign slots or offsets manually; the
CLI claims and frees them for you. Up to 46 concurrent isolated runs are
supported.<name> compose project.showcase-*
project is left completely alone, so an isolated run is safe to launch
alongside it (or alongside other isolated runs).[email protected] / showcase-local-dev) matches the
POCKETBASE_SUPERUSER_EMAIL the compose stack seeds, so a fresh isolated PB
authenticates with no manual setup. (A mismatch here is what previously 400'd
on pb-auth and left the control-plane enqueuing zero jobs.)docker-compose.local.yml and shared/local-ports.json into a per-run
scratch dir at
${XDG_STATE_HOME:-$HOME/.local/state}/copilotkit/showcase/runs/<name>/.
If the process crashes, the originals are untouched.Per-pill FAILs in an isolated run reflect real demo/feature issues for that integration — not artifacts of isolation. The driver runs and asserts per-pill identically across every integration, so a FAIL is the same signal you'd get from the shared stack or production for that pill.
By default an isolated run tears its stack down on exit (via trap EXIT) and
frees its slot automatically — the normal, no-cleanup-needed case. Each fresh
run gets a clean PocketBase volume, which is what makes pb-auth deterministic.
To leave the stack up for inspection after the run, pass --keep: the isolated
stack survives the run (success or failure), and the run dir and slot are
preserved — the live containers keep the slot from being reaped. That
protection lasts only while the containers are RUNNING: if they stop (manual
docker stop, a daemon restart, a host reboot), the next isolate run's sweep
reclaims the slot — composing the stopped containers and named volumes down and
removing the run dir. Inspect a kept stack before stopping it, and don't expect
it to survive a reboot. At exit the CLI prints a survival notice with the
stack's host ports (aimock, dashboard, PocketBase) and the manual teardown
command:
docker compose -p <name> down --remove-orphans --volumes && rm -rf <run-dir> <slot-dir>
The teardown includes --volumes: isolated stacks are ephemeral, so the
project-scoped named volumes (e.g. <name>_showcase-pb-data) are removed along
with the containers and networks — the same flags the automatic (non---keep)
teardown uses, so a kept stack leaves nothing behind once torn down. The
rm -rf clears the per-run scratch dir and the slot reservation (the notice
prints the real paths).
If a run was killed with SIGKILL (so the trap never fired), the next isolate
run auto-reaps the dead slot (a slot whose compose project has no live
containers is reclaimed). To clean up the slot reservations manually:
rm -rf "${XDG_STATE_HOME:-$HOME/.local/state}/copilotkit/showcase/slots"/*
.iso-bak files: if you see docker-compose.local.yml.iso-bak or
local-ports.json.iso-bak, those are leftovers from an old isolate behavior
that mutated files in place. The current code auto-restores them on startup,
but you can also clean up manually with git checkout on the affected files.Earned by bugs. Do not repeat.
turnIndex in new fixtures. Use hasToolResult.docker build. Symlinks break. Use bin/showcase rebuild.When adding or modifying fixture files in showcase/aimock/, the showcase-aimock image must be rebuilt so production picks up the changes. CI handles this automatically -- any push to main that touches showcase/aimock/** triggers the Build & Deploy workflow to rebuild and redeploy the image.
For manual iteration (e.g. testing a fixture change before merging), build and push directly:
docker build --platform linux/amd64 -f showcase/aimock/Dockerfile -t ghcr.io/copilotkit/showcase-aimock:latest showcase/aimock/ --push
After pushing, redeploy the Railway service so it pulls the new image (the CI workflow does this automatically via serviceInstanceRedeploy).
When adding a new fixture file, update both:
showcase/docker-compose.local.yml -- add a volume mount for the new fileshowcase/aimock/Dockerfile -- add a COPY line for the new fileEach integration service bind-mounts its host src/ directory into the container via the volumes entry in docker-compose.local.yml:
volumes:
- ./integrations/<slug>/src:/app/src
This means source edits take effect on container restart without rebuilding the Docker image. The workflow becomes:
showcase/integrations/<slug>/src/bin/showcase restart <slug>bin/showcase test <slug> --d5Use bin/showcase rebuild <slug> only when you change dependencies (requirements.txt, package.json) or non-src files (Dockerfile, entrypoint). For pure src/ changes, restart is sufficient and much faster.