docs/ml-ai/ml-anomaly-detection/ml-accuracy.md
This document is an analysis of Netdata's machine learning approach to anomaly detection. The system employs an ensemble of k-means clustering models with a consensus-based decision mechanism, achieving a calculated false positive rate of 10^-36 per metric. This analysis examines the mathematical foundations, design trade-offs, and operational characteristics of the implementation.
Netdata's anomaly detection system operates on the following principles:
The system employs k-means clustering with k=2, effectively partitioning each metric's behavioral space into "normal" and "potentially anomalous" clusters. The choice of k=2 represents a fundamental design decision prioritizing simplicity and interpretability over nuanced classification.
Feature Engineering: Each data point is transformed into a 6-dimensional feature vector:
This feature space captures both instantaneous changes and temporal patterns while remaining computationally tractable.
The anomaly score calculation employs min-max normalization:
distance = ||x - μ||₂ where μ ∈ {c₁, c₂}, the nearest of the two cluster centers
score = 100 × (distance - min_distance) / (max_distance - min_distance)
Where min_distance and max_distance are determined during training. A score ≥ 99 indicates the point lies at or beyond the extremes observed during training.
The false positive rate calculation assumes independence among models:
P(false positive) = P(all 18 models flag anomaly | no true anomaly)
= ∏ᵢ₌₁¹⁸ P(model i flags anomaly | no true anomaly)
= (0.01)¹⁸
= 10⁻³⁶
The independence assumption is justified by:
While the models are designed for independence through offset training windows and separate normalization, some degree of correlation may persist due to shared metric behavior across time. The 10^-36 rate should be considered a strong theoretical bound rather than an empirical guarantee.
Host-level anomaly detection employs a two-stage process:
anomaly_rate(t) = count(anomalous_metrics(t)) / total_metrics
host_anomaly = average(anomaly_rate(t - 5min, t)) ≥ threshold
For a typical 5,000-metric host with a 1% threshold:
P(false host anomaly) ≈ (5000 choose 50) × (10⁻³⁶)⁵⁰ ≈ 10⁻¹⁶⁵⁰
This probability is effectively zero for all practical purposes.
Computational Efficiency
Operational Simplicity
Statistical Robustness
Architectural Advantages
Root Cause Analysis Capabilities
Temporal Coverage Constraints
Algorithm Simplicity
Fixed Hyperparameters
Detection Boundaries
| Anomaly Type | Description | Detected? | Detection Mechanism |
|---|---|---|---|
| Point Anomalies | Sudden spikes or drops exceeding historical bounds | ✅ | Min-max threshold at 99th percentile |
| Contextual Anomalies | Normal values in abnormal sequences | ✅ | 6D feature space with temporal lags |
| Collective Anomalies | Concurrent anomalies across multiple metrics | ✅ | Correlation engine and Anomaly Advisor |
| Change Points | Sudden shifts to new normal levels | ✅ | Detects transition, adapts within 3-57h |
| Concept Drifts | Gradual drift to new states | ⚠️ | Only if drift occurs within 57 hours |
| Rate-of-Change Anomalies | Abnormal acceleration/deceleration | ✅ | Differenced values in feature vector |
| Short-term Patterns | Hourly/daily pattern violations | ✅ | Multiple models capture different cycles |
| Weekly Patterns | 5-day work week behaviors | ❌ | Exceeds 57-hour memory window |
| Gradual Degradation | Slow drift over 57+ hours | ❌ | Models adapt to degradation as normal |
| Known Scheduled Events | Black Friday, maintenance windows | ❌ | Would require training exclusion |
The current implementation effectively detects the following anomaly types:
Point Anomalies (Strange Points)
Contextual Anomalies (Strange Patterns)
Collective Anomalies (Strange Multivariate Patterns)
Change Points (Strange Steps)
Concept Drifts (Strange Trends) - Partially Detected
Rate-of-Change Anomalies
The following anomaly types cannot be reliably detected with the current fixed-window approach:
Long-term Seasonal Patterns
Gradual Performance Degradation
Rare but Regular Events
Metric-Specific Patterns
Known Anomalous Periods
Rationale: The choice of k=2 reflects a fundamental philosophy prioritizing operational reliability over detection sophistication.
Alternatives considered:
Trade-off: Reduced anomaly classification granularity for guaranteed stability
Rationale: Balances memory usage, computational cost, and temporal coverage.
Mathematics:
Trade-off: Limited long-term pattern recognition for predictable resource usage
Rationale: Extreme conservative bias eliminates virtually all false positives.
Alternative approaches:
Trade-off: Potential false negatives for near-certain true positive identification
Rationale: Distribution-agnostic approach works for any metric type.
Comparison to alternatives:
Trade-off: Less statistical rigor for universal applicability
Based on implementation analysis:
False Positive Analysis:
False Negative Analysis:
Analysis of the system in production environments reveals:
When evaluated against alternative approaches:
| Aspect | Netdata ML | Statistical (3σ) | Deep Learning | Commercial APM |
|---|---|---|---|---|
| False Positive Rate | 10^-36 | 0.3% | Variable | Typically 0.1-1% |
| Configuration Required | None | Minimal | Extensive | Moderate to High |
| Resource Overhead | 2-5% CPU | <1% CPU | 30-60% CPU | Unknown |
| Pattern Memory | 57 hours | |||
| (configurable) | Unlimited | Model-dependent | Days to Weeks | |
| Adaptation Speed | 3 hours | |||
| (configurable) | Immediate | Retraining required | Hours to Days | |
| Metric Coverage | ALL metrics | Selected metrics | Selected metrics | Selected metrics |
| ML Enablement | Automatic | Manual per metric | Manual training | Manual/Paid tier |
| Infrastructure Level Outage Detection | Automatic | No | No | No |
| Correlation Discovery | Automatic | No | Limited | Manual/Limited |
Critical Distinctions:
Universal Coverage: Netdata applies ML anomaly detection to every single metric collected (typically 3,000-20,000 per server) without configuration or additional cost. Commercial APMs typically require manual selection of metrics for ML analysis, often limit the number of ML-enabled metrics, and may charge additional fees for ML capabilities.
Infrastructure-Level Intelligence: Netdata automatically calculates host-level anomaly rates, detecting when a server exhibits abnormal behavior across multiple metrics. This capability identifies infrastructure-wide issues that metric-by-metric approaches miss.
Automatic Correlation Discovery: During incidents, Netdata's correlation engine automatically identifies which metrics are anomalous together, revealing hidden relationships and cascading failures. Commercial solutions typically require manual investigation or pre-configured correlation rules.
These fundamental differences mean Netdata can detect both obvious infrastructure failures and subtle, complex issues automatically, while other solutions may miss issues in non-monitored metrics or fail to identify systemic problems.
Netdata's ML implementation represents a deliberate optimization for operational reliability over detection sophistication. The mathematical foundation ensures extraordinarily low false positive rates at the cost of potentially missing subtle or long-term patterns.
The consensus mechanism's reduction of false positives to 10^-36 represents a significant achievement in practical anomaly detection, effectively eliminating random false insights while maintaining sensitivity to genuine infrastructure issues.
Netdata's ML is not a replacement for deep statistical analysis or business-intent monitoring. But it is, unequivocally, one of the most reliable, scalable, and maintenance-free anomaly detection engines for infrastructure and application metrics available today.
If you're running 20+ servers or a fleet of IoT/edge devices? This is your early warning system for unexpected behaviors.
Managing a complex microservice deployment with unpredictable patterns? Layer this in as the safety net that never sleeps.
Need to detect infrastructure problems without a team of data scientists? This gives you automated anomaly detection that actually works.
The system's strength lies in its ability to provide trustworthy anomaly detection and surface correlations and dependencies across components and applications, without configuration or tuning. The trade-offs — limited temporal memory, binary detection, and conservative thresholds — represent a careful balance between sensitivity and reliability, false positives and false negatives. These design choices ensure the system maintains its 10^-36 false positive rate while still catching meaningful infrastructure issues, working reliably out of the box without drowning you in false insights.
For environments requiring detection of weekly patterns or gradual degradation over months, you'll need supplementary approaches (we also plan to support this with additional configuration to define periodicity). But for detecting significant, unexpected behavioral changes in infrastructure metrics — the kind that actually break things — Netdata's ML delivers exceptional reliability with negligible overhead.
In short: Yes, you need it.
Not as your only monitoring tool — but as the one that makes all the others smarter.