Back to Eliza

Eliza Benchmark Server

packages/app-core/src/benchmark/README.md

2.0.14.0 KB
Original Source

Eliza Benchmark Server

HTTP bridge exposing the Eliza runtime to Python benchmark runners.

Architecture

Python Benchmark Runner
    |  (imports eliza-adapter)
eliza-adapter (Python client)
    |  (HTTP requests)
server.ts (this directory)
    |  (canonical message pipeline)
elizaOS AgentRuntime

This directory contains:

FilePurpose
server.tsHTTP server for benchmark traffic. Initializes AgentRuntime, handles benchmark sessions, and routes each message through runtime.messageService.handleMessage(...).
mock-plugin-base.tsTracked deterministic mock plugin used by benchmark unit tests and CI smoke checks.
mock-plugin.tsOptional local override (gitignored) loaded first when ELIZA_BENCH_MOCK=true.
TESTING_PROTOCOL.mdBenchmark action/testing protocol (required checks + CUA-bench compatibility commands).

The Python client side can live in a local adapter directory such as benchmarks/eliza-adapter/.

Start the server

bash
# from the eliza package root
npm run benchmark:server

# or directly
node --import tsx src/benchmark/server.ts

The server prints ELIZA_BENCH_READY port=<port> when ready.

Testing

bash
# benchmark-focused unit tests
bunx vitest run src/benchmark/*.test.ts

# watch a live benchmark smoke run end-to-end
bun run benchmark:watch

# watch live CUA execution in LUME VM (requires CUA_HOST + model credentials)
CUA_HOST=localhost:8000 OPENAI_API_KEY=sk-... CUA_COMPUTER_USE_MODEL=computer-use-preview bun run benchmark:cua:watch

# see the full benchmark testing/checklist protocol
cat src/benchmark/TESTING_PROTOCOL.md

HTTP API

GET /api/benchmark/health

Returns readiness + runtime metadata.

json
{ "status": "ready", "agent_name": "Kira", "plugins": 3 }

POST /api/benchmark/reset

Starts a fresh benchmark session (new room/user context).

Request:

json
{ "task_id": "webshop-42", "benchmark": "agentbench" }

Response:

json
{ "status": "ok", "room_id": "<uuid>", "task_id": "webshop-42", "benchmark": "agentbench" }

GET /api/benchmark/cua/status

Returns CUA service status when CUA benchmark mode is enabled (ELIZA_ENABLE_CUA=1).

POST /api/benchmark/cua/run

Runs a live CUA task in the configured LUME/cloud sandbox.

Request:

json
{
  "goal": "Open ChatGPT and close extra tabs",
  "room_id": "optional-room-id",
  "auto_approve": true,
  "include_screenshots": false
}

GET /api/benchmark/cua/screenshot

Captures the current sandbox screenshot (base64 png) via the CUA service.

POST /api/benchmark/message

Sends benchmark input through the canonical message pipeline.

Request:

json
{
  "text": "Find a laptop under $500",
  "context": {
    "benchmark": "agentbench",
    "task_id": "webshop-42",
    "goal": "Buy a laptop under $500",
    "observation": { "page": "search results" },
    "action_space": ["search[query]", "click[id]", "buy[id]"]
  }
}

Response:

json
{
  "text": "Searching for options under $500...",
  "thought": "I should issue a search action first",
  "actions": ["BENCHMARK_ACTION"],
  "params": { "command": "search[laptop under $500]" }
}

Configuration

Environment variableDefaultDescription
ELIZA_BENCH_PORT3939Port to listen on
COMPUTER_USE_ENABLEDunsetSet to 1 to load local computeruse plugin
ELIZA_ENABLE_CUAunsetIf set (or CUA env is configured), loads @elizaos/plugin-cua
CUA_HOSTunsetLocal CUA/LUME host (e.g. localhost:8000)
CUA_API_KEY + CUA_SANDBOX_NAMEunsetCloud CUA mode alternative to CUA_HOST
CUA_COMPUTER_USE_MODELautoSet computer-use-preview to force OpenAI computer-use runner
ELIZA_BENCH_MOCKunsetEnables inline mock benchmark plugin

Notes

  • context is attached to the prompt context for each benchmark step.
  • Session reset creates isolated room/user context so task runs do not leak history.
  • Responses include actions and params extracted from responseContent for runner-side evaluation.