x-pack/platform/packages/shared/kbn-evals-suite-significant-events/README.md
Evaluations for Significant Events, which assess the quality of LLM-based Knowledge Indicator (KI) feature extraction, KI query generation, KI feature exclusion, and KI feature duplication across failure scenarios. These evaluations support both qualitative (LLM-as-a-judge + deterministic CODE evaluators) and quantitative (trace-based) metrics.
For general information about writing evaluation tests, configuration, and usage, see the main @kbn/evals documentation.
| Suite | Spec | What it measures |
|---|---|---|
| KI feature extraction | ki_feature_extraction/ki_feature_extraction.spec.ts | Can the LLM identify entities, dependencies, and infrastructure from raw log samples? |
| KI query generation | ki_query_generation/ki_query_generation.spec.ts | Can the LLM produce valid, hit-producing ES|QL rules for significant event detection? |
| KI feature exclusion | ki_feature_exclusion/ki_feature_exclusion.spec.ts | Does the LLM respect excluded features and avoid regenerating them in follow-up runs? |
| KI feature duplication | ki_feature_duplication/ki_feature_duplication.spec.ts | Are KIs stable and semantically unique across repeated extraction runs? |
Evaluations replay Elasticsearch snapshots from a GCS bucket (significant-events-datasets). The bucket is structured as:
significant-events-datasets/
<run-id>/
<dataset>/
<scenario-snapshot>
Set GCS_CREDENTIALS before starting Scout so Elasticsearch can access the GCS repository:
export GCS_CREDENTIALS='{"type":"service_account",...}'
The default run ID is pinned in code (SIGEVENTS_SNAPSHOT_RUN). Override it at runtime:
SIGEVENTS_SNAPSHOT_RUN=2026-02-25 node scripts/evals run --suite significant-events --judge gemini-3-pro
To capture trace-based metrics (input/output/cached tokens, and latency), configure tracing exporters and run the EDOT Collector.
Add the following to kibana.dev.yml:
elastic.apm.active: false
elastic.apm.contextPropagationOnly: false
telemetry.enabled: true
telemetry.tracing.enabled: true
telemetry.tracing.sample_rate: 1
telemetry.tracing.exporters:
- http:
url: 'http://localhost:4318/v1/traces'
Note:
elastic.apm.active: falseandelastic.apm.contextPropagationOnly: falseare required — Elastic APM and OpenTelemetry tracing cannot run simultaneously. The Scoutevals_tracingconfig set handles this automatically, but when configuringkibana.dev.ymldirectly you must set both.
Optionally include the Phoenix exporter for a trace UI:
telemetry.tracing.exporters:
- phoenix:
base_url: 'https://<my-phoenix-host>'
public_url: 'https://<my-phoenix-host>'
project_name: '<my-name>'
api_key: '<my-api-key>'
- http:
url: 'http://localhost:4318/v1/traces'
Ensure Docker is running, then start the EDOT Gateway Collector:
node scripts/edot_collector.js
Or point it at a separate trace cluster:
ELASTICSEARCH_HOST=https://<username>:<password>@<trace-cluster-url> node scripts/edot_collector.js
Without tracing infrastructure, token and latency evaluators gracefully return score: null — all other evaluators are unaffected.
node scripts/scout.js start-server --arch stateful --domain classic --serverConfigSet evals_tracing
Note: Use Gemini 3 Pro as the evaluation judge to ensure consistent scoring across models. This keeps LLM-as-a-judge criteria evaluations comparable regardless of which model is being evaluated.
node scripts/evals run \
--suite significant-events \
--project <connector-id> \
--judge <gemini-3-pro-connector-id>
SIGEVENTS_DATASET=otel-demo node scripts/evals run \
--suite significant-events \
--project <connector-id> \
--judge <gemini-3-pro-connector-id>
node scripts/evals run \
--suite significant-events \
--project <connector-id> \
--judge <gemini-3-pro-connector-id> \
ki_feature_extraction.spec.ts
| Flag | Description |
|---|---|
--suite | Suite ID to run (use significant-events) |
--project | Connector/model project to evaluate against |
--judge | Connector ID for the LLM judge (use Gemini 3 Pro for consistency) |
--repetitions | Number of times to repeat each evaluation example (e.g., 3) |
--trace-es-url | URL of the Elasticsearch cluster where traces are stored (e.g., https://user:pass@trace-cluster:9200) |
--dry-run | Preview the command without executing |
| Variable | Description | Default |
|---|---|---|
SIGEVENTS_SNAPSHOT_RUN | Run ID subfolder in GCS to replay snapshots from | 2026-02-25 |
SIGEVENTS_DATASET | Dataset(s) to run (comma-separated or all) | all |
KI_QUERY_GENERATION_KI_FEATURE_SOURCE | KI feature source for KI query generation (canonical, snapshot, both) | both |
GCS_CREDENTIALS | GCS service account JSON for snapshot access | — |
TRACING_ES_URL | Elasticsearch URL for trace queries (if traces are in a separate cluster) | Falls back to test cluster |
TRACING_ES_API_KEY | API key for the trace Elasticsearch cluster | — |
| Evaluator | Suite | Description |
|---|---|---|
| type_validation | KI feature extraction | All KI types are valid (entity, infrastructure, technology, dependency, schema) |
| evidence_grounding | KI feature extraction | Evidence strings are grounded in input documents; evidence_doc_ids reference real docs |
| ki_feature_count | KI feature extraction | KI feature count falls within expected bounds |
| confidence_bounds | KI feature extraction | No KI exceeds the maximum confidence threshold |
| type_assertions | KI feature extraction | Required types are present; forbidden types are absent |
| ki_query_generation_code_evaluator | KI query generation | ES|QL syntax validity and execution hit rate |
| ki_feature_duplication | KI feature duplication | Structural deduplication |
| Evaluator | Suite | Description |
|---|---|---|
| scenario_criteria | KI feature extraction, KI query generation | Scenario-specific criteria (e.g. "must identify payment service") |
| llm_exclude_compliance | KI feature exclusion | Excluded features are not regenerated in follow-up identification runs |
| llm_semantic_uniqueness | KI feature duplication | Semantic deduplication across KIs |
| llm_id_consistency | KI feature duplication | Same KI ID refers to the same concept across runs |
| Evaluator | Description |
|---|---|
| Input Tokens | Total input tokens consumed per evaluation |
| Output Tokens | Total output tokens generated per evaluation |
| Cached Tokens | Total cached input tokens used per evaluation |
| Latency | Duration of the ChatComplete inference span |
Each dataset should have its own capture script under scripts/ so that snapshot creation is fully reproducible. See capture_otel_demo_snapshots.ts as the reference implementation.
A capture script typically:
getConnectionConfig)registerGcsRepository)logs* (dataset-specific — e.g. deploy an app, run synthtrace, replay from an external source)significant_events_workflow helpers)logs* and extracted KIs to GCS (via createSnapshot)The shared helpers in scripts/lib/ handle GCS registration, snapshot creation, KI feature extraction orchestration, and ES/Kibana connection — the capture script only needs to provide the data generation logic specific to its dataset.
Register the script entry point in scripts/ (e.g. scripts/capture_sigevents_my_app_snapshots.js) so it can be run with:
node scripts/capture_sigevents_my_app_snapshots.js --connector-id <id> --run-id <run-id>
src/datasets/ (e.g. my_app.ts, following the otel_demo.ts pattern)src/datasets/index.tsSIGEVENTS_DATASET=my-app node scripts/evals run --suite significant-events --judge <gemini-3-pro-connector-id>
evals/ (e.g. my_eval/my_eval.spec.ts)src/evaluators/ (e.g. my_eval_evaluators.ts) — code evaluators for deterministic checks and LLM-as-a-judge evaluators for qualitative criteria. See the existing files for reference (e.g. ki_feature_extraction/evaluators.ts)scenario_criteria evaluator - it automatically scores LLM output against the dataset's criteria using an LLM judge, so you don't need to write custom LLM evaluators for each spec