x-pack/solutions/security/packages/kbn-evals-suite-security-ai-rules/README.md
Playwright-based evaluation suite for testing the AI rule creation feature in Elastic Security Solution using @kbn/evals.
This package evaluates the quality of AI-generated detection rules against known examples from the elastic/detection-rules repository. It measures:
The eval suite calls the sync Agent Builder API (POST /api/agent_builder/converse) and extracts tool results from the security.create_detection_rule tool call steps, matching the pattern used by the agent-builder eval suite.
kbn-evals-suite-security-ai-rules/
├── playwright.config.ts # Playwright evals configuration
├── evals/
│ └── rule_generation.spec.ts # Evaluation scenarios (baseline + edge + negative cases)
├── datasets/
│ ├── sample_rules.ts # 8 canonical reference detection rules with ES|QL translations
│ ├── standard_pairs.ts # 18 standard prompt/rule pairs (Windows, Linux, Cloud, etc.)
│ ├── complex_pairs.ts # 5 complex multi-domain pairs (containers, supply-chain)
│ ├── hard_cases.ts # Edge-case prompts for robustness testing
│ └── negative_pairs.ts # 5 prompts that should NOT produce a valid rule
└── src/
├── chat_client.ts # Agent Builder API client (sync converse)
├── evaluate.ts # Suite-specific eval fixture extensions
├── evaluate_dataset.ts # Experiment runner + all evaluator definitions
├── helpers.ts # Utility functions (MITRE extraction, syntax check, etc.)
└── helpers.test.ts # Unit tests for helpers
Elasticsearch running locally:
yarn es snapshot
Kibana with AI rule creation enabled:
kibana.dev.yml has AI connectors configuredaiRuleCreationEnabled: true is setAI Connectors: Configure one or more AI connectors in config/kibana.dev.yml or via the Kibana UI. The suite runs against all connectors discovered at runtime (including EIS models when available).
GenAI Settings: Navigate to Stack Management > AI > GenAI Settings (app/management/ai/genAiSettings) and select AI agent (Beta) in Chat Experience. This enables the Agent Builder API that the eval suite calls.
Index patterns: The dataset prompts reference specific index patterns (e.g., logs-endpoint.events.*, logs-aws.cloudtrail*). If these indices do not exist in your Elasticsearch instance, the affected examples will be skipped (all evaluators return N/A). Check the task logs for "Could not discover a suitable index" warnings.
Run the suite with node scripts/evals run. Results are persisted to an Elasticsearch cluster and a summary table is printed at the end.
EVALUATIONS_ES_URL=<ES_URL> \
EVALUATIONS_ES_API_KEY=<API_KEY> \
EVALUATION_CONNECTOR_ID=gpt-4o \
node scripts/evals run --suite security-ai-rules
Replace <ES_URL> and <API_KEY> with the Elasticsearch endpoint and API key for the cluster where evaluation scores should be stored (this can be a remote/cloud cluster, not necessarily the local one Kibana is connected to).
| Variable | Description | Default |
|---|---|---|
EVALUATIONS_ES_URL | Elasticsearch URL for storing results | http://elastic:changeme@localhost:9220 |
EVALUATIONS_ES_API_KEY | API key for the results Elasticsearch cluster (used instead of basic auth) | (none) |
EVALUATION_CONNECTOR_ID | Connector ID for the task model | required |
EVALUATION_REPETITIONS | Number of times to run each example | 1 |
SELECTED_EVALUATORS | Comma-separated evaluator names to run | (all) |
When storing results in a local dev cluster with basic auth, set the URL with embedded credentials:
EVALUATIONS_ES_URL=http://elastic:changeme@localhost:9200 \
EVALUATION_CONNECTOR_ID=gpt-4o \
node scripts/evals run --suite security-ai-rules
EVALUATIONS_ES_URL=<ES_URL> \
EVALUATIONS_ES_API_KEY=<API_KEY> \
EVALUATION_CONNECTOR_ID=gpt-4o \
SELECTED_EVALUATORS="Query Syntax Validity,Field Coverage,MITRE Accuracy" \
node scripts/evals run --suite security-ai-rules
The suite runs 12 evaluators (10 deterministic CODE evaluators, 1 LLM-as-judge evaluator, and 1 rejection evaluator). In the summary table, these are grouped into columns for readability.
Six binary evaluators that check whether the generated rule is well-formed:
@elastic/esql parser. Also rejects bare FROM * queries, which are disallowed in alerting rules. Score: 1 (valid) or 0 (invalid).type === 'esql' and language === 'esql'. Score: 1 (correct) or 0 (wrong).low, medium, high, critical. Score: 1 or 0.5m, 30s, 1h). Score: 1 or 0.from field must be >= the interval to avoid lookback gaps. Score: 1 (no gap) or 0 (gap present).Measures the fraction of required rule fields present: name, description, query, severity, tags, riskScore. A score of 0.83 means 5 of 6 fields are present.
Three evaluators that compare the generated rule against the expected reference:
Uses the built-in createEsqlEquivalenceEvaluator from @kbn/evals to assess whether the generated ES|QL query would produce the same detection results as the reference query, regardless of syntax differences. For non-ES|QL reference rules that have an esqlQuery translation, the evaluator compares against the translation. Returns N/A when no ES|QL ground truth is available.
Scores whether the model correctly refused to generate a rule for a negative case (a prompt where the available data source cannot support the requested detection). Returns N/A for positive cases. Score: 1 (correctly refused) or 0 (incorrectly generated a rule).
These two evaluators use criteria to check semantic equivalence for the rule name and description fields. They are intentionally disabled in the default evaluator list because they add significant latency per example. Re-enable them in src/evaluate_dataset.ts when running thorough multi-model comparisons:
// In createEvaluateDataset, uncomment:
createRuleNameEvaluator(evaluators),
createRuleDescriptionEvaluator(evaluators),
All evaluators except Rejection are wrapped with skipNegativeCases (returns N/A for negative test examples). All evaluators are wrapped with skipMissingIndexFailures (returns N/A when the rule creation tool failed due to missing index patterns). The ES|QL equivalence evaluator additionally uses skipNonEsqlReferences to avoid meaningless comparisons when no ES|QL ground truth exists.
Results are automatically exported to Elasticsearch in the .kibana-evaluations datastream.
Navigate to Kibana > Dev Tools and paste the queries below. Replace <run-id> with the run ID printed in the eval logs (e.g. a3f2c1b0d4e56789).
GET .kibana-evaluations/_search
{
"query": {
"term": { "run_id": "<run-id>" }
},
"sort": [{ "evaluator.name": "asc" }],
"size": 200
}
GET .kibana-evaluations/_search
{
"size": 0,
"query": {
"term": { "run_id": "<run-id>" }
},
"aggs": {
"by_evaluator": {
"terms": { "field": "evaluator.name" },
"aggs": {
"mean_score": { "avg": { "field": "evaluator.score" } }
}
}
}
}
GET .kibana-evaluations/_search
{
"size": 0,
"query": {
"terms": { "run_id": ["<run-id-1>", "<run-id-2>"] }
},
"aggs": {
"by_run": {
"terms": { "field": "run_id" },
"aggs": {
"by_evaluator": {
"terms": { "field": "evaluator.name" },
"aggs": {
"mean_score": { "avg": { "field": "evaluator.score" } }
}
}
}
}
}
}
GET .kibana-evaluations/_search
{
"query": {
"bool": {
"must": [
{ "term": { "task.model.id": "gpt-4o" } },
{ "match": { "example.dataset.name": "security-ai-rules" } }
]
}
},
"sort": [{ "@timestamp": "desc" }],
"size": 100
}
{
"@timestamp": "2026-02-12T20:30:00.000Z",
"run_id": "abc123def456",
"task": {
"model": {
"id": "gpt-4o",
"family": "openai",
"provider": "azure"
}
},
"example": {
"dataset": {
"name": "security-ai-rules: rule-generation-basic"
}
},
"evaluator": {
"name": "Query Syntax Validity",
"score": 1,
"label": null,
"explanation": null
}
}
Problem: The specified connector ID is not configured.
Solution:
config/kibana.dev.ymlProblem: The AI rule creation APIs are not available.
Solution:
kibana.dev.yml:
xpack.securitySolution.enableExperimental:
- aiRuleCreationEnabled
/api/agent_builder/converse)Problem: The rule creation tool cannot find matching data for the index pattern in the prompt.
This means the required index (e.g., logs-azure.auditlogs*) does not exist in the Elasticsearch instance. All evaluators for the affected example will return N/A.
Solution:
[Summary] ... X/Y examples scored (Z skipped due to missing indices)Problem: All evaluations scoring near 0.
Possible causes:
security.create_detection_ruleSolution:
chat_client.ts diagnostics logged at warning levelProblem: No results in .kibana-evaluations datastream.
Solution:
EVALUATIONS_ES_URL is set correctlyThe evaluation suite runs three datasets:
suspicious-genai-descendant-activity) has incomplete ground truth (empty query, no esqlQuery) pending publication in the detection-rules repo; the ES|QL Functional Equivalence evaluator returns N/A for that entry.Domains covered include:
To expand the dataset, add entries to the appropriate file in datasets/:
export const sampleRules: ReferenceRule[] = [
// ... existing rules
{
id: 'your-rule-id',
name: 'Your New Rule',
prompt: 'Describe the detection...\n\nAvailable data: logs-endpoint.events.*',
description: 'Detects XYZ behavior',
query: 'process where ...', // reference query (EQL or ES|QL)
threat: [{ technique: 'T1234', tactic: 'TA0001' }],
severity: 'high',
tags: ['Domain: Endpoint', 'OS: Windows'],
riskScore: 73,
from: 'now-9m',
category: 'execution',
esqlQuery: 'FROM logs-endpoint.events.* ...', // optional: ES|QL translation for non-ES|QL rules
},
];
yarn test:jest x-pack/solutions/security/packages/kbn-evals-suite-security-ai-rules/src/helpers.test.ts
node scripts/type_check --project x-pack/solutions/security/packages/kbn-evals-suite-security-ai-rules/tsconfig.json
node scripts/eslint x-pack/solutions/security/packages/kbn-evals-suite-security-ai-rules
When adding new evaluators or modifying existing ones:
src/evaluate_dataset.ts following the existing createQuerySyntaxValidityEvaluator patternskipNegativeCases and skipMissingIndexFailures as appropriatesrc/helpers.test.tsEVALUATION_REPETITIONS=3 or more)