packages/app-core/src/benchmark/README.md
HTTP bridge exposing the Eliza runtime to Python benchmark runners.
Python Benchmark Runner
| (imports eliza-adapter)
eliza-adapter (Python client)
| (HTTP requests)
server.ts (this directory)
| (canonical message pipeline)
elizaOS AgentRuntime
This directory contains:
| File | Purpose |
|---|---|
server.ts | HTTP server for benchmark traffic. Initializes AgentRuntime, handles benchmark sessions, and routes each message through runtime.messageService.handleMessage(...). |
mock-plugin-base.ts | Tracked deterministic mock plugin used by benchmark unit tests and CI smoke checks. |
mock-plugin.ts | Optional local override (gitignored) loaded first when ELIZA_BENCH_MOCK=true. |
TESTING_PROTOCOL.md | Benchmark action/testing protocol (required checks + CUA-bench compatibility commands). |
The Python client side can live in a local adapter directory such as benchmarks/eliza-adapter/.
# from the eliza package root
npm run benchmark:server
# or directly
node --import tsx src/benchmark/server.ts
The server prints ELIZA_BENCH_READY port=<port> when ready.
# benchmark-focused unit tests
bunx vitest run src/benchmark/*.test.ts
# watch a live benchmark smoke run end-to-end
bun run benchmark:watch
# watch live CUA execution in LUME VM (requires CUA_HOST + model credentials)
CUA_HOST=localhost:8000 OPENAI_API_KEY=sk-... CUA_COMPUTER_USE_MODEL=computer-use-preview bun run benchmark:cua:watch
# see the full benchmark testing/checklist protocol
cat src/benchmark/TESTING_PROTOCOL.md
GET /api/benchmark/healthReturns readiness + runtime metadata.
{ "status": "ready", "agent_name": "Kira", "plugins": 3 }
POST /api/benchmark/resetStarts a fresh benchmark session (new room/user context).
Request:
{ "task_id": "webshop-42", "benchmark": "agentbench" }
Response:
{ "status": "ok", "room_id": "<uuid>", "task_id": "webshop-42", "benchmark": "agentbench" }
GET /api/benchmark/cua/statusReturns CUA service status when CUA benchmark mode is enabled (ELIZA_ENABLE_CUA=1).
POST /api/benchmark/cua/runRuns a live CUA task in the configured LUME/cloud sandbox.
Request:
{
"goal": "Open ChatGPT and close extra tabs",
"room_id": "optional-room-id",
"auto_approve": true,
"include_screenshots": false
}
GET /api/benchmark/cua/screenshotCaptures the current sandbox screenshot (base64 png) via the CUA service.
POST /api/benchmark/messageSends benchmark input through the canonical message pipeline.
Request:
{
"text": "Find a laptop under $500",
"context": {
"benchmark": "agentbench",
"task_id": "webshop-42",
"goal": "Buy a laptop under $500",
"observation": { "page": "search results" },
"action_space": ["search[query]", "click[id]", "buy[id]"]
}
}
Response:
{
"text": "Searching for options under $500...",
"thought": "I should issue a search action first",
"actions": ["BENCHMARK_ACTION"],
"params": { "command": "search[laptop under $500]" }
}
| Environment variable | Default | Description |
|---|---|---|
ELIZA_BENCH_PORT | 3939 | Port to listen on |
COMPUTER_USE_ENABLED | unset | Set to 1 to load local computeruse plugin |
ELIZA_ENABLE_CUA | unset | If set (or CUA env is configured), loads @elizaos/plugin-cua |
CUA_HOST | unset | Local CUA/LUME host (e.g. localhost:8000) |
CUA_API_KEY + CUA_SANDBOX_NAME | unset | Cloud CUA mode alternative to CUA_HOST |
CUA_COMPUTER_USE_MODEL | auto | Set computer-use-preview to force OpenAI computer-use runner |
ELIZA_BENCH_MOCK | unset | Enables inline mock benchmark plugin |
context is attached to the prompt context for each benchmark step.actions and params extracted from responseContent for runner-side evaluation.