docs/plans/2026-02-24-e2e-infrastructure-design.md
Date: 2026-02-24 Status: Approved Goal: Deterministic browser-level E2E tests for the IronClaw web gateway using Python + Playwright, with a mock LLM backend for CI reliability.
| Decision | Choice | Rationale |
|---|---|---|
| Assertion style | Deterministic DOM-first | Claude vision optional later; DOM assertions are fast, cheap, reliable |
| Language | Python + pytest + Playwright | Rich browser automation ecosystem, async/await, separate from Rust tests |
| LLM backend | Mock HTTP server | Canned OpenAI-compat responses; deterministic, fast, zero cost |
| Initial scope | 3 scenarios | Connection + Chat + Skills; covers highest-bug-rate areas |
| Architecture | Subprocess + Playwright | Tests the real binary end-to-end; proven pattern from existing ws_gateway tests |
pytest
|
+----------+-----------+
| |
mock_llm.py ironclaw binary
(canned responses) (cargo build --features libsql)
127.0.0.1:{port} 127.0.0.1:{port}
| |
+----------+-----------+
|
Playwright
(headless Chromium)
DOM assertions
Flow:
tests/e2e/
conftest.py # pytest fixtures: build binary, start ironclaw, mock LLM, browser
mock_llm.py # OpenAI-compat HTTP server with canned responses
helpers.py # Shared utilities (wait_for_ready, selectors)
scenarios/
__init__.py
test_connection.py # Auth, tab navigation, connection status
test_chat.py # Send message, SSE streaming, response rendering
test_skills.py # Search, install, remove lifecycle
pyproject.toml # Dependencies
README.md # How to run locally and in CI
A minimal async HTTP server that speaks the OpenAI Chat Completions API.
Endpoint: POST /v1/chat/completions
Behavior:
messages array from the request bodyChatCompletionResponse with id, choices[0].message, usagestream: true by returning SSE chunks with delta objects (critical: IronClaw streams responses via SSE to the browser)Canned response table:
| Pattern (regex) | Response |
|---|---|
hello|hi|hey | Hello! How can I help you today? |
2\+2|2 \+ 2|two plus two | The answer is 4. |
skill|install | I can help you with skills management. |
.* (default) | I understand your request. |
Streaming format:
data: {"id":"mock-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"role":"assistant","content":"The "},"finish_reason":null}]}
data: {"id":"mock-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"answer is 4."},"finish_reason":null}]}
data: {"id":"mock-1","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Implementation: aiohttp.web (async, lightweight). No tool call support needed for initial 3 scenarios.
Health check: GET /v1/models returns {"data": [{"id": "mock-model"}]}.
ironclaw_binary
./target/debug/ironclaw existscargo build --no-default-features --features libsqlmock_llm_server
mock_llm.py as subprocess on 127.0.0.1:0 (OS-assigned port)Mock LLM listening on 127.0.0.1:{port})GET /v1/models until ready (timeout 10s)(process, url)ironclaw_server(ironclaw_binary, mock_llm_server)
GATEWAY_ENABLED=true
GATEWAY_HOST=127.0.0.1
GATEWAY_PORT=0
GATEWAY_AUTH_TOKEN=e2e-test-token
GATEWAY_USER_ID=e2e-tester
CLI_ENABLED=false
LLM_BACKEND=openai_compatible
LLM_BASE_URL={mock_llm_url}
LLM_MODEL=mock-model
DATABASE_BACKEND=libsql
LIBSQL_PATH=:memory:
SANDBOX_ENABLED=false
SKILLS_ENABLED=true
ROUTINES_ENABLED=false
HEARTBEAT_ENABLED=false
Gateway listening on 127.0.0.1:XXXX)GET /api/status until ready (timeout 60s)http://127.0.0.1:{port})page(ironclaw_server)
{base_url}/?token=e2e-test-tokenPage objecttest_connection.py)Tests auth, initial page load, and tab switching.
test_page_loads_and_connects:
1. Assert page title or main container is visible
2. Assert connection status indicator shows "Connected" (or equivalent)
3. Assert all 6 tab buttons visible: Chat, Memory, Jobs, Routines, Extensions, Skills
test_tab_navigation:
1. For each tab in [Chat, Memory, Jobs, Routines, Extensions, Skills]:
a. Click the tab button
b. Assert the corresponding panel container becomes visible
c. Assert no error toasts appear
2. Return to Chat tab
3. Assert chat input is visible and focusable
test_auth_rejection:
1. Navigate to base_url without token (no ?token= param)
2. Assert auth screen / login prompt appears (not the main app)
test_chat.py)Tests the full message flow: user input -> gateway -> mock LLM -> SSE -> browser rendering.
test_send_message_and_receive_response:
1. Locate chat input element
2. Type "What is 2+2?"
3. Press Enter (or click Send button)
4. Wait for assistant message to appear (timeout 15s)
5. Assert user message bubble contains "What is 2+2?"
6. Assert assistant message bubble contains "4"
7. Assert no error toasts visible
test_multiple_messages:
1. Send "Hello"
2. Wait for response containing "Hello" or "help"
3. Send "What is 2+2?"
4. Wait for response containing "4"
5. Assert message count >= 4 (2 user + 2 assistant)
test_empty_message_not_sent:
1. Focus chat input
2. Press Enter with empty input
3. Assert no new messages appear after 2s
test_skills.py)Tests ClawHub search, install, and remove through the browser UI.
Note: ClawHub registry blocks non-browser TLS fingerprints but Playwright is a real browser, so this works. Tests are skipped if ClawHub is unreachable.
test_skills_tab_visible:
1. Click Skills tab
2. Assert skills panel is visible
3. Assert search input is present
test_skills_search:
1. Click Skills tab
2. Type "markdown" in search input
3. Click Search (or press Enter)
4. Wait for results (timeout 15s)
5. Assert at least one result card is visible
6. Assert result cards contain: name, version, description fields
test_skills_install_and_remove:
1. Search for a skill
2. Override window.confirm to auto-accept: page.evaluate("window.confirm = () => true")
3. Click Install on first result
4. Wait for installed skills list to update (timeout 15s)
5. Assert skill appears in installed section
6. Click Remove on the installed skill
7. Wait for installed section to update
8. Assert skill is gone from installed list
IronClaw logs Gateway listening on 127.0.0.1:XXXX at startup. The fixture reads stdout line-by-line until it finds this pattern, extracts the port.
async def wait_for_port(process, pattern=r"Gateway listening on .+:(\d+)", timeout=60):
"""Read process stdout until we find the listening port."""
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
line = await asyncio.wait_for(
process.stdout.readline(), timeout=deadline - time.monotonic()
)
if match := re.search(pattern, line.decode()):
return int(match.group(1))
raise TimeoutError("ironclaw did not report listening port")
Same pattern for the mock LLM server.
# tests/e2e/pyproject.toml
[project]
name = "ironclaw-e2e"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"pytest>=8.0",
"pytest-asyncio>=0.23",
"playwright>=1.40",
"aiohttp>=3.9",
"httpx>=0.27",
]
[project.optional-dependencies]
vision = [
"anthropic>=0.40",
]
# .github/workflows/e2e.yml
name: E2E Tests
on:
schedule:
- cron: "0 6 * * 1" # Weekly Monday 6 AM UTC
workflow_dispatch:
pull_request:
paths:
- 'src/channels/web/**'
- 'tests/e2e/**'
jobs:
e2e:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- uses: actions/cache@v4
with:
path: target
key: e2e-${{ hashFiles('Cargo.lock') }}
- name: Build ironclaw
run: cargo build --no-default-features --features libsql
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install E2E dependencies
run: |
cd tests/e2e
pip install -e .
playwright install chromium
- name: Run E2E tests
run: pytest tests/e2e/ -v --timeout=120
Trigger policy: Weekly + manual + PRs touching web gateway or E2E tests. Not on every PR.
Not in initial scope. Design accommodates it via:
conftest.py fixture claude_vision wrapping anthropic.Anthropic()assert_visually(page, prompt): takes screenshot, sends to Claude vision API, asserts response@pytest.mark.vision, only runs when ANTHROPIC_API_KEY is setpytest tests/e2e/ -v passes locally with a pre-built ironclaw binary