x-pack/solutions/observability/packages/kbn-evals-suite-obs-ai-assistant/README.md
Evaluation test suites for the Observability AI Assistant, built on top of @kbn/evals.
This package contains evaluation tests for the Observability AI Assistant, covering key features such as alerts, APM, ES|QL, knowledge base, connectors and more.
For general information about writing evaluation tests, configuration, and usage, see the main @kbn/evals documentation.
Start Scout server:
node scripts/scout.js start-server --arch stateful --domain classic
The Scout server Kibana instance is accessible at http://localhost:5620. This may be useful if you want to query evaluation results for further analysis.
Run evaluations using the following base command:
EVALUATION_CONNECTOR_ID=llm-judge-connector-id \
node scripts/playwright test \
--config x-pack/solutions/observability/packages/kbn-evals-suite-obs-ai-assistant/playwright.config.ts
Environment Variables:
EVALUATION_CONNECTOR_ID (required): Connector ID for the LLM judgeEVALUATION_REPETITIONS: Number of times to repeatedly evaluate each example (e.g., 3)USE_QUALITATIVE_EVALUATORS: Enable additional evaluators for Correctness (Factuality, Relevance, Sequence Accuracy) and Groundedness (Hallucination) (defaults to false)SCENARIO_REPORTING: Enable scenario-grouped reporting that aggregates datasets by scenario prefix (defaults to false)Playwright Options:
evals/alerts/alerts.spec.ts)--project="my-connector" to evaluate a specific model/connectorExample with all options:
# Running alerts scenarios
EVALUATION_REPETITIONS=3 \
USE_QUALITATIVE_EVALUATORS=true \
SCENARIO_REPORTING=true \
EVALUATION_CONNECTOR_ID=llm-judge-connector-id \
node scripts/playwright test \
--config x-pack/solutions/observability/packages/kbn-evals-suite-obs-ai-assistant/playwright.config.ts \
evals/alerts/alerts.spec.ts \
--project="my-connector" \
The evaluation suite supports running evaluations against both the Obs AI Assistant API and the Agent Builder API. This allows engineers to evaluate the Elastic AI Agent early in the migration process.
Use the EVALUATION_CLIENT environment variable to specify which client to use:
obs_ai_assistant (default): Uses the Obs AI Assistant APIagent_builder: Uses the Agent Builder APIWhen using agent_builder, you can optionally specify which agent to invoke using the AGENT_BUILDER_AGENT_ID environment variable (defaults to the default agent).
Run Agent Builder evaluations against a local Kibana instance using these steps:
1. Start Kibana without base path
yarn start --no-base-path
The APM synthrace client fixture requires Kibana to run without a base path.
2. Configure local Scout server
Create .scout/servers/local.json with the following content:
{
"serverless": false,
"isCloud": false,
"hosts": {
"kibana": "http://localhost:5601/",
"elasticsearch": "http://localhost:9200/"
},
"auth": {
"username": "elastic",
"password": "changeme"
}
}
3. Run evaluations
EVALUATION_REPETITIONS=1 \
EVALUATION_CLIENT="agent_builder" \
EVALUATION_CONNECTOR_ID="your-connector-id" \
node scripts/playwright test \
--config x-pack/solutions/observability/packages/kbn-evals-suite-obs-ai-assistant/playwright.config.ts \
evals/esql/esql.spec.ts \
--project="your-connector" \
--debug
context tool, alerts tool behavior differs)The evaluation framework supports two reporting modes:
SCENARIO_REPORTING=true): Aggregates datasets into scenario-level statistics. Useful for high-level overview across different scenarios (e.g., alerts, esql, apm).Dataset Naming Convention: For scenario-grouped reporting, datasets must be named "scenario: dataset-name" (e.g., "alerts: critical", "esql: simple queries"). Any dataset not matching this pattern will be categorized under "Other". Always use this format when adding new test cases.
Evaluation results are stored in the .kibana-evaluations data stream on the Elasticsearch cluster your Scout configuration points to (defaulting to http://localhost:9220). For on-demand analysis, you can query this data stream using the Kibana instance at http://localhost:5620.
You must replace the
${run_id}parameter in these queries with an actual evaluationrun_idto retrieve data.
Get evaluator scores per dataset (replicating in-terminal evaluation results):
POST /_query?format=txt
{
"query": """
FROM .kibana-evaluations
| WHERE run_id == "${run_id}"
| EVAL mean_dataset_score = MV_AVG(evaluator.scores)
| STATS
criteria_score = AVG(mean_dataset_score) WHERE evaluator.name == "Criteria",
groundedness_score = AVG(mean_dataset_score) WHERE evaluator.name == "Groundedness",
factuality_score = AVG(mean_dataset_score) WHERE evaluator.name == "Factuality",
relevance_score = AVG(mean_dataset_score) WHERE evaluator.name == "Relevance",
sequence_accuracy_score = AVG(mean_dataset_score) WHERE evaluator.name == "Sequence Accuracy"
BY dataset.name
| SORT dataset.name
| LIMIT 100
"""
}
Get evaluator scores per scenario (replicating in-terminal results when SCENARIO_REPORTING=true):
POST /_query?format=txt
{
"query": """
FROM .kibana-evaluations
| WHERE run_id == "${run_id}"
| DISSECT dataset.name "%{scenario}: %{rest}"
| EVAL mean_dataset_score = MV_AVG(evaluator.scores)
| STATS
criteria_score = AVG(mean_dataset_score) WHERE evaluator.name == "Criteria",
groundedness_score = AVG(mean_dataset_score) WHERE evaluator.name == "Groundedness",
factuality_score = AVG(mean_dataset_score) WHERE evaluator.name == "Factuality",
relevance_score = AVG(mean_dataset_score) WHERE evaluator.name == "Relevance",
sequence_accuracy_score = AVG(mean_dataset_score) WHERE evaluator.name == "Sequence Accuracy"
BY scenario
| SORT scenario
| LIMIT 100
"""
}
View performance ratings for each scenario in the evaluation run based on performance matrix heuristics:
POST /_query?format=txt
{
"query": """
FROM .kibana-evaluations
| WHERE run_id == "${run_id}"
| DISSECT dataset.name "%{scenario}: %{rest}"
| EVAL mean_dataset_score = MV_AVG(evaluator.scores)
| STATS
criteria_score = AVG(mean_dataset_score) WHERE evaluator.name == "Criteria"
BY scenario
| EVAL rating = CASE(
criteria_score >= 0 and criteria_score < 0.45, "Poor",
criteria_score < 0.75, "Good",
criteria_score < 0.84, "Great",
criteria_score <= 1, "Excellent",
"Invalid"
)
| SORT scenario
| LIMIT 100
"""
}
For setup, prerequisites, and instructions on running AI Insights evaluations, see the AI Insights README.