x-pack/solutions/observability/plugins/slo/README.md
A Kibana plugin
See the kibana contributing guide for instructions setting up your development environment.
SLO API tests live under test/scout/api/tests/. Run them with Scout (stateful classic or serverless observability), for example:
node scripts/scout.js run-tests --arch stateful --domain classic --config x-pack/solutions/observability/plugins/slo/test/scout/api/playwright.config.ts
Or target files:
node scripts/scout.js run-tests --arch stateful --domain classic --testFiles x-pack/solutions/observability/plugins/slo/test/scout/api/tests/slo_create.spec.ts
Good Events: Events that meet your success criteria
Total Events: All events that occurred
SLI (Service Level Indicator): The calculated ratio
SLI = Good Events / Total Events
Target: Your objective (e.g., 99.9% = 0.999)
Error Budget: The allowed failure rate
Error Budget = 1 - Target
Example: 99.9% target = 0.1% error budget
Burn Rate: How fast you're consuming your error budget
Burn Rate = Error Rate / Error Budget
Rolling Time Window
Calendar Aligned Time Window
Occurrences Budgeting
SLI = Good Events / Total EventsTimeslices Budgeting
SLI = 1 - (Bad Slices / Total Slices in Window)APM Transaction Error Rate (APM Availability)
event.outcome fieldAPM Transaction Duration (APM Latency)
Custom KQL Query
Custom Metric
Histogram Metric
Timeslice Metric
Synthetics
Service Level Objectives (SLOs) are measurable targets that define the expected level of service reliability. They answer the question: "What percentage of time should our service meet performance expectations?"
function computeSLI(good: number, total: number): number {
if (total === 0) {
return -1; // No data
}
return good / total;
}
Example:
Good: 9,900 requests
Total: 10,000 requests
SLI: 9,900 / 10,000 = 0.99 (99%)
function computeSLI(
good: number, // Count of good slices observed
total: number, // Count of total slices observed
totalSlicesInRange: number // Total slices in window (e.g., 2,016 for 7d/5m)
): number {
// Key insight: Missing slices are considered GOOD
const badSlices = total - good;
return 1 - badSlices / totalSlicesInRange;
}
Example:
Time window: 7 days
Slice window: 5 minutes
Total possible slices: 2,016
Observed slices: 1,800 (some data missing)
Good slices: 1,750
Bad slices: 50
SLI = 1 - (50 / 2,016) = 1 - 0.0248 = 0.9752 (97.52%)
Why missing data is "good": Conservative approach - assumes service was healthy during missing data periods.
function computeErrorBudget(target: number, sliValue: number) {
const initialErrorBudget = 1 - target;
const consumedErrorBudget = 1 - sliValue;
const remainingErrorBudget = initialErrorBudget - consumedErrorBudget;
return {
initial: initialErrorBudget,
consumed: consumedErrorBudget,
remaining: remainingErrorBudget,
remainingPercentage: (remainingErrorBudget / initialErrorBudget) * 100,
};
}
Example:
Target: 99.9% (0.999)
SLI: 99.5% (0.995)
Initial error budget: 1 - 0.999 = 0.001 (0.1%)
Consumed: 1 - 0.995 = 0.005 (0.5%)
Remaining: 0.001 - 0.005 = -0.004 (-0.4%)
ā Over budget! ā ļø
With events:
Total events: 1,000,000
Target: 99.9%
Error budget: 0.1% = 1,000 allowed failures
Actual failures: 5,000
Budget remaining: 1,000 - 5,000 = -4,000
ā 400% over budget!
Burn Rate tells you how fast you're consuming your error budget.
function computeBurnRate(target: number, sliValue: number): number {
if (sliValue >= 1) {
return 0; // Perfect performance
}
const errorBudget = 1 - target;
const errorRate = 1 - sliValue;
return errorRate / errorBudget;
}
Example 1: Healthy Service
Target: 99.9% (0.999)
SLI: 99.95% (0.9995)
Error budget: 0.001 (0.1%)
Error rate: 0.0005 (0.05%)
Burn rate: 0.0005 / 0.001 = 0.5
ā Consuming budget at 50% of allowed rate ā
Example 2: Service Under Stress
Target: 99.9% (0.999)
SLI: 99% (0.99)
Error budget: 0.001 (0.1%)
Error rate: 0.01 (1%)
Burn rate: 0.01 / 0.001 = 10
ā Consuming budget 10Ć faster than allowed! ā ļø
Burn Rate Interpretation:
SLOs track burn rates over multiple time windows for smarter alerting:
{
oneHourBurnRate: 14.4, // Last hour
fiveMinuteBurnRate: 20.1, // Last 5 min
oneDayBurnRate: 2.5 // Last day
}
Alert logic (simplified):
IF oneHourBurnRate > 14.4 AND fiveMinuteBurnRate > 14.4:
ā Fast burn detected (alert: critical)
IF oneDayBurnRate > 6 AND oneHourBurnRate > 3:
ā Medium burn detected (alert: high)
This prevents alert fatigue from short spikes while catching sustained issues.
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā SLO DATA FLOW ARCHITECTURE ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
āāāāāāāāāāāāāāāāāāā
ā Source Data ā (e.g., APM logs, metrics, custom indices)
ā (your-index) ā
āāāāāāāāāā¬āāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāā Queries source data continuously (every 1m)
ā Rollup ā Aggregates good/bad events into time buckets
ā Transform ā Groups by SLO ID, timestamp, and optional groupBy fields
āāāāāāāāāā¬āāāāāāāāā
ā
ā dest.pipeline: "slo-{sloId}-{revision}"
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā SLI INGEST PIPELINE (Rollup Pipeline) ā
ā slo-{sloId}-{revision} ā
ā ā
ā 1. Set _id, event.ingested, slo.id, slo.name, etc. ā
ā 2. Route to correct monthly index ā
ā 3. Generate slo.instanceId ā
ā 4. āāāŗ CALL slo-{sloId}@custom (per-SLO custom pipeline) ā
ā ā ā
ā ā Optional: Add global custom pipeline: ā
ā ā āāāŗ slo-rollup-global@custom ā
ā ā
āāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāā
ā Rollup Index ā .slo-observability.sli-v3.3-YYYY.MM
ā (SLI data) ā Contains: good/bad event counts per time bucket
āāāāāāāāāā¬āāāāāāāāā
ā
ā Summary Transform reads from rollup index
ā¼
āāāāāāāāāāāāāāāāāāā Aggregates all rollup buckets
ā Summary ā Calculates: SLI value, error budget, burn rates
ā Transform ā Runs continuously (every 1m)
āāāāāāāāāā¬āāāāāāāāā
ā
ā dest.pipeline: "slo-summary-{sloId}-{revision}"
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā SUMMARY INGEST PIPELINE ā
ā slo-summary-{sloId}-{revision} ā
ā ā
ā 1. Set status (HEALTHY/DEGRADING/VIOLATED/NO_DATA) ā
ā 2. Set all SLO metadata (name, tags, objective, etc.) ā
ā 3. Calculate burn rate values ā
ā 4. āāāŗ CALL slo-summary-{sloId}@custom (per-SLO custom) ā
ā ā ā
ā ā Optional: Add global custom pipeline: ā
ā ā āāāŗ slo-summary-global@custom ā
ā ā
āāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāā
ā Summary Index ā .slo-observability.summary-v3.3
ā ā Contains: current SLO status, SLI values, burn rates
ā ā (Displayed in UI)
āāāāāāāāāāāāāāāāāāā
Rollup Transform:
SLI Ingest Pipeline (slo-{sloId}-{revision}):
slo-{sloId}@custom)Summary Transform:
Summary Ingest Pipeline (slo-summary-{sloId}-{revision}):
slo-summary-{sloId}@custom)Causes:
Debug:
GET logs-*/_searchCauses:
Fix:
syncDelay settingPOST _transform/<transform-id>/_startsyncField with event.ingested (recommended): If your source data includes an event.ingested field, configure the transform to use it instead of the source timestamp field.Causes:
Debug:
GET index/_mappingSymptom: Transform slow or failing, many SLO instances
Fix:
service.name instead of host.id)env: production), specific regions, or critical services| Term | Definition |
|---|---|
| SLI | Service Level Indicator - The actual measurement of service performance |
| SLO | Service Level Objective - The target goal for the SLI |
| SLA | Service Level Agreement - Contractual obligation with consequences |
| Error Budget | The allowed amount of failures (1 - target) |
| Burn Rate | How fast error budget is being consumed relative to the target |
| Good Events | Events that meet success criteria |
| Total Events | All events measured |
| Timeslice | A time bucket that is evaluated as entirely good or bad |
| Occurrences | Budgeting method based on event counts |
| Rolling Window | Time window that moves with current time |
| Calendar Aligned | Time window aligned to calendar boundaries |
| Transform | Elasticsearch feature that aggregates source data |
| Grouping | Splitting one SLO into multiple instances by a field |