x-pack/solutions/observability/packages/kbn-evals-suite-observability-ai/README.md
Evaluation suite for Observability AI features, built on @kbn/evals.
This package is separate from @kbn/evals-suite-obs-ai-assistant, which covers the legacy Observability AI Assistant. The evaluations here cover features built on the Agent Builder platform.
observability-aievals:observability-ai| Directory | Client | Description |
|---|---|---|
evals/ai_insights/ | AiInsightClient | Correctness of LLM-generated insight summaries for alerts and APM errors |
evals/observability_agent/ | AgentBuilderClient | Observability Agent chat behavior |
AI Insights evaluations replay observability data from GCS snapshot repositories:
obs-ai-datasets/otel-demo/payment-service-failures — payment service invalid token errorsobs-ai-datasets/otel-demo/payment-unreachable — payment service unreachable from checkout (via the paymentUnreachable feature flag in the OTel demo)Set GCS_CREDENTIALS before starting Scout. This must contain the full JSON service account credential string (not a file path):
export GCS_CREDENTIALS='{"type":"service_account",...}'
When GCS_CREDENTIALS is set, the evals_tracing Scout config writes it to a temporary file and injects gcs.client.default.credentials_file so Elasticsearch can access GCS snapshots through the keystore flow.
To capture trace-based metrics (token usage, latency), configure tracing exporters and run the EDOT Collector.
Important: Use a separate Elasticsearch cluster for exporting traces. Evaluations clean up observability data between scenarios. If traces are stored in the same cluster, this cleanup will also remove traces that trace-based evaluators depend on.
kibana.dev.yml:telemetry.tracing.exporters:
- http:
url: 'http://localhost:4318/v1/traces'
Optionally, include the Phoenix exporter for a trace UI:
telemetry.tracing.exporters:
- phoenix:
base_url: 'https://<my-phoenix-host>'
public_url: 'https://<my-phoenix-host>'
project_name: '<my-name>'
api_key: '<my-api-key>'
- http:
url: 'http://localhost:4318/v1/traces'
If you use node scripts/evals start, EDOT is started automatically. Run this step only when using the separate-terminal/manual workflow.
Start the EDOT (Elastic Distribution of OpenTelemetry) Gateway Collector to receive and store traces. Ensure Docker is running, then execute:
# Point EDOT at a separate trace cluster (recommended)
ELASTICSEARCH_HOST=https://<username>:<password>@<trace-cluster-url> node scripts/edot_collector.js
The EDOT Collector receives traces from Kibana via the HTTP exporter and stores them in the specified Elasticsearch cluster, where they can be queried to extract non-functional metrics like token usage and latency.
You can optionally use non-default ports with --http-port <port> or --grpc-port <port>. If you change ports, update the exporter URL in kibana.dev.yml accordingly.
node scripts/evals init
If you choose EIS/Cloud Connected Mode, init prints an export command for
KIBANA_TESTING_AI_CONNECTORS. Run that export in your shell before start/run.
Use start to launch the full stack (Scout server, EDOT collector, and EIS CCM) and run the suite.
node scripts/evals start \
--suite observability-ai \
--judge <connector-id> \
--project <connector-id>
After the stack is running, use run for faster iteration on the full suite or specific specs.
Run the full suite:
node scripts/evals run \
--suite observability-ai \
--judge <connector-id> \
--project <connector-id>
Run a specific evaluation:
node scripts/evals run \
--suite observability-ai \
--judge <connector-id> \
--project <connector-id> \
evals/ai_insights/alert_insight.spec.ts
If you prefer to manage services manually, start Scout in one terminal and run evals in another.
Terminal 1:
node scripts/scout start-server --arch stateful --domain classic --serverConfigSet evals_tracing
Terminal 2:
node scripts/evals run \
--suite observability-ai \
--judge <connector-id> \
--project <connector-id>
node scripts/evals start runs Scout and EDOT as background daemons.
node scripts/evals stop
Stop only one service:
node scripts/evals stop --service scout
node scripts/evals stop --service edot
| Flag | Description |
|---|---|
--suite | Suite ID (observability-ai) |
--judge | Connector ID for the LLM judge (required) |
--project | Connector/model project to evaluate against |
--repetitions | Number of times to repeat each example |
--trace-es-url | Elasticsearch cluster for trace storage (e.g., https://user:pass@trace-cluster:9200) |
--dry-run | Preview without executing |
| Metric | Description |
|---|---|
| Quantitative Correctness | Factuality, Relevance, and Sequence Accuracy scores from LLM-as-a-judge |
| Input Tokens | Number of input tokens consumed per evaluation (trace-based) |
| Output Tokens | Number of output tokens generated per evaluation (trace-based) |
| Cached Tokens | Number of cached tokens used per evaluation (trace-based) |
| Latency | Duration of the ChatComplete span (trace-based) |