Back to Kibana

Evals CLI Reference

x-pack/platform/packages/shared/kbn-evals/CLI.md

9.4.07.5 KB
Original Source

Evals CLI Reference

The node scripts/evals CLI orchestrates LLM evaluation workflows. All commands support --help for inline usage.

Workflow overview

init  -->  start  -->  [iterate: start again]  -->  stop
           ^  |
           |  +-- runs EDOT, Scout, EIS CCM, then Playwright
           |
       logs (tail background service output)

EDOT and Scout run as persistent background daemons. They survive between start runs so you can iterate on eval suites without waiting for ES/Kibana to restart each time.

Commands

init -- Set up connectors

Interactive wizard that discovers EIS models or validates existing connectors in kibana.dev.yml.

bash
node scripts/evals init
node scripts/evals init --skip-discovery
FlagDescription
--skip-discoverySkip EIS model discovery (reuse existing target/eis_models.json)

start -- Start stack and run a suite

The main command. Starts EDOT + Scout as background daemons, enables EIS CCM if needed, then runs a Playwright eval suite.

bash
node scripts/evals start
node scripts/evals start --suite agent-builder
node scripts/evals start --suite agent-builder --model eis-gpt-4.1 --judge eis-claude-4-5-sonnet
node scripts/evals start --suite agent-builder --model eis-gpt-4.1,eis-claude-4-sonnet
node scripts/evals start --suite agent-builder --grep "product documentation"
node scripts/evals start --suite agent-builder --skip-server
FlagAliasDescription
--suite <id>Suite to run (interactive prompt if omitted)
--config <path>Playwright config path (alternative to --suite)
--project <id>--modelConnector/model to evaluate (comma-separated for multiple)
--evaluation-connector-id <id>--judgeConnector used for LLM-as-a-judge evaluators
--profile <name>Load both dataset + export settings from config.<name>.json in the vault config dir
--datasets-profile <name>Load dataset settings from config.<name>.json (sets EVALUATIONS_KBN_URL/EVALUATIONS_KBN_API_KEY)
--export-profile <name>Load export settings from config.<name>.json (sets EVALUATIONS_ES_URL, TRACING_ES_URL, and TRACING_EXPORTERS)
--grep <pattern>Filter tests by name (passed to Playwright --grep)
--repetitions <n>Number of times to repeat each example
--skip-serverSkip EDOT/Scout/EIS startup (use existing services)
--dry-runPrint configuration and exit without running

When flags are omitted and stdin is a TTY, start prompts interactively for suite, judge, and model selection.

Traces are exported by EDOT to the export cluster (controlled by --export-profile / TRACING_ES_URL), and TRACING_ES_URL is set so trace-based evaluators query the right cluster.

Golden dataset + local export (profiles)

Use profiles to fetch datasets from the golden cluster while exporting results and traces to your local Elasticsearch/Kibana (default: http://localhost:9200 / http://localhost:5601).

Create the profiles:

bash
# golden cluster config (datasets)
node scripts/evals init config

# local export profile (results + traces to localhost:9200)
node scripts/evals init config --profile local

Run:

bash
node scripts/evals start --suite attack-discovery --export-profile local

stop -- Stop background services

bash
node scripts/evals stop
node scripts/evals stop --service scout
node scripts/evals stop --service edot
FlagDescription
--service <name>Stop only edot or scout (default: stop all)

logs -- Tail service logs

bash
node scripts/evals logs
node scripts/evals logs --service scout
node scripts/evals logs --service edot --from-start
FlagDescription
--service <name>Tail only edot or scout (default: both)
--from-startShow logs from the beginning (default: tail from current position)

scout -- Start Scout standalone

Convenience wrapper around node scripts/scout.js start-server with evals defaults (--arch stateful --domain classic --serverConfigSet evals_tracing). Extra flags are forwarded to Scout.

bash
node scripts/evals scout

Use this when you want to manage Scout separately from the start workflow.

run -- Run a suite (no service management)

Runs a Playwright eval suite without managing EDOT/Scout. Use this when you already have services running (via start, scout, or manually).

bash
node scripts/evals run --suite agent-builder --judge bedrock-claude
node scripts/evals run --suite obs-ai-assistant --model azure-gpt4o --repetitions 3
node scripts/evals run --suite agent-builder --grep "product documentation"
node scripts/evals run --suite streams --dry-run
FlagAliasDescription
--suite <id>Suite to run (interactive prompt if omitted)
--config <path>Playwright config path (alternative to --suite)
--project <id>--modelConnector/model to evaluate
--evaluation-connector-id <id>--judgeConnector for LLM-as-a-judge evaluators
--grep <pattern>Filter tests by name (passed to Playwright --grep)
--repetitions <n>Repeat each example N times
--executor <name>kibana (default) or phoenix
--profile <name>Load both dataset + export settings from config.<name>.json
--datasets-profile <name>Load dataset settings from config.<name>.json
--export-profile <name>Load export settings from config.<name>.json
--trace-es-url <url>Elasticsearch URL for trace queries
--trace-es-api-key <key>API key for trace ES
--evaluations-es-url <url>Elasticsearch URL for storing eval results
--evaluations-es-api-key <key>API key for evaluations ES
--phoenix-base-url <url>Phoenix API URL (when using --executor phoenix)
--phoenix-api-key <key>Phoenix API key
--dry-runPrint the Playwright command and exit

list -- List available suites

bash
node scripts/evals list
node scripts/evals list --refresh
node scripts/evals list --json
FlagDescription
--refreshRe-scan the repo for suite configs
--jsonOutput as JSON

doctor -- Check prerequisites

bash
node scripts/evals doctor
node scripts/evals doctor --fix
FlagDescription
--fixAttempt to auto-fix detected issues

compare -- Compare eval runs

bash
node scripts/evals compare <run-id-a> <run-id-b>

env -- List environment variables

bash
node scripts/evals env

ci-map -- Output CI label mapping

bash
node scripts/evals ci-map
node scripts/evals ci-map --json
FlagDescription
--jsonOutput as JSON

Tips

Fast iteration: Use --grep to run a single test within a large suite:

bash
node scripts/evals start --suite agent-builder --grep "product documentation" --model eis-gpt-4.1

Reuse services: After the first start, EDOT and Scout stay alive. Subsequent start runs detect them and skip to step 3 (EIS CCM) or step 4 (Playwright). This cuts iteration time significantly.

Multiple models: Pass comma-separated model IDs to --model:

bash
node scripts/evals start --suite agent-builder --model eis-gpt-4.1,eis-claude-4-sonnet

View traces locally: EDOT exports traces to your local ES (from kibana.dev.yml). Open your local Kibana at http://localhost:5601 to view APM traces and LLM spans.