src/health/README.md
Netdata provides a distributed, real-time health monitoring framework that evaluates conditions against your metrics and executes actions on state transitions. You can configure notifications as one of these actions.
Unlike traditional monitoring systems, Netdata evaluates alerts simultaneously at multiple levels - on the edge (Agents), at aggregation points (Parents), and deduplicates them in Netdata Cloud. This allows your teams to implement different alerting strategies at different infrastructure levels.
Netdata alerts function as component-level watchdogs. You attach them to specific components/instances (network interfaces, database instances, web servers, containers, processes) where they evaluate metrics at configurable intervals.
To simplify your configuration, you can define alert templates once and apply them to all matching components. The system matches instances by host labels, instance labels, and names, allowing you to define the same alert multiple times with different matching criteria.
Each alert provides a name, value, unit, and status - making them easy to display in dashboards and send as meaningful notifications regardless of your infrastructure's complexity.
Your alerts evaluate at the edge. Every Netdata Agent and Parent runs alerts on the metrics it processes and stores (enabled by default, but you can disable alerting at any level). When you stream metrics to a Parent, the Parent evaluates its own alerts on those metrics independently of the child's alerts. Each Agent maintains its own alert configuration and evaluates alerts autonomously. Metric streaming doesn't propagate alert configurations or transitions to Parents.
┌─────────┐ Metrics ┌──────────┐ Metrics ┌──────────┐
│ Child │ ───────────────> │ Parent 1 │ ───────────────> │ Parent 2 │
│ Agent │ of child │ Agent │ of child + │ Agent │
└────┬────┘ └────┬─────┘ Parent 1 └────┬─────┘
│ │ │
│ Evaluates alerts on │ Evaluates alerts on │ Evaluates alerts on
│ local metrics │ child + local metrics │ all streamed + local
│ │ │
▼ ▼ ▼
Alerts Alerts Alerts
Your Netdata Agents treat notifications as actions triggered by alert status transitions. Agents can dispatch notifications or perform automation tasks like scaling services, restarting processes, or rotating logs. Actions are shell scripts or executable programs that receive all alert transition metadata from Netdata.
When you claim Agents to Netdata Cloud, they send their alert configurations and transitions to Cloud, which deduplicates them (merging multiple transitions from different Agents for the same host). Netdata Cloud triggers notifications centrally through its integrations (Slack, Microsoft Teams, Amazon SNS, PagerDuty, OpsGenie).
Netdata Cloud's intelligent deduplication works by:
Your Agents and Netdata Cloud trigger actions independently using their own configurations and integrations.
This design enables you to:
Web Server (Child):
- Alert: system CPU > 80% triggers scale out
- Alert: process X memory > 90% restarts process X
DevOps Parent:
- Alert: Response time > 500ms across all web servers
- Alert: Error rate > 1% for any service
SRE Parent:
- Alert: Anomaly detection on traffic patterns
- Alert: Capacity planning thresholds
Netdata Cloud:
- Receives all alert transitions
- Deduplicates overlapping alerts
- Shows CRITICAL if any instance reports CRITICAL
- Provides unified view for incident response
Each level operates independently while Netdata Cloud provides a coherent, deduplicated view of your entire infrastructure's health (when all agents connect directly to Cloud).
You configure Netdata alerts in 3 layers:
/usr/lib/netdata/conf.d/health.d to detect common issues. Don't edit these directly - updates will overwrite your changes./etc/netdata/health.d.You can configure notifications for any infrastructure node at 3 levels:
| Level | What It Evaluates | Where Notifications Come From | Use Case | Documentation |
|---|---|---|---|---|
| Netdata Agent | Local Metrics | Netdata Agent | Edge automation | Agent integrations |
| Netdata Parent | Local and Children Metrics | Netdata Parent | Edge automation | Agent integrations |
| Netdata Cloud | Receives Transitions | Netdata Cloud | Web-hooks, role/room based | Cloud integrations |
:::note
When using Parents and Cloud with default settings, you may receive duplicate email notifications. Agents send emails by default when an MTA exists on their systems. Disable email notifications on Agents and Parents when using Cloud by setting SEND_EMAIL="NO" in /etc/netdata/health_alarm_notify.conf using edit-config.
:::
When you:
Follow these steps:
SEND_EMAIL="NO" in /etc/netdata/health_alarm_notify.conf)This emulates traditional monitoring tools where you configure alerts centrally and dispatch notifications centrally.
When you:
Follow these steps:
enable stock health configuration to no in /etc/netdata/netdata.conf [health] section)SEND_EMAIL="NO")This enables edge automation on children while maintaining central alerting control and deduplicated Cloud notifications.
Space → NotificationsOpen notification config:
sudo ./edit-config health_alarm_notify.conf
Enable your method (example: email):
SEND_EMAIL="YES"
DEFAULT_RECIPIENT_EMAIL="[email protected]"
Verify your system can send mail (sendmail, SMTP relay)
Restart the agent:
sudo systemctl restart netdata
Netdata supports two alert types:
Your alerts produce more than threshold checks. Each generates:
This enables sophisticated alerts like:
out of disk space time: 450 seconds - Predicts when disk fills based on current rate3xx redirects: 12.5 percent - Calculates redirects as percentage of totalresponse time vs yesterday: 150% - Compares current to historical baselineYour alerts exist in one of these states:
| State | Description | Trigger |
|---|---|---|
| CLEAR | Normal - conditions exist but not triggered | Warning and critical conditions evaluate to zero |
| WARNING | Warning threshold exceeded | Warning condition evaluates to non-zero |
| CRITICAL | Critical threshold exceeded | Critical condition evaluates to non-zero |
| UNDEFINED | Cannot evaluate | No conditions defined, or value is NaN/Inf |
| UNINITIALIZED | Never evaluated | Alert just created |
| REMOVED | Alert deleted | Child disconnected, agent exit, or health reload |
Alerts transition freely between states based on:
Key behaviors:
Your alerts perform complex calculations:
lookup calc warn,crit status
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐
│ Database │ │Expression│ │ Warning │ │ Execute │
│ Query │──────>│Processor │──────>│ Critical │ ───> │ Action on │
│(optional)│ $this │(optional)│ $this │ Checks │ │Transition │
└──────────┘ └──────────┘ └──────────┘ └───────────┘
Examples:
# Simple threshold
calc: $used
# Result: $this = latest value of dimension 'used'
# Time-series lookup
lookup: average -1h of used
# Result: $this = average of 'used' over last hour
# Combined calculation
lookup: average -1h of used
calc: $this * 100 / $total
# Result: $this = percentage of hourly average vs total
# Baseline comparison
lookup: average -1h of used
calc: $this * 100 / $average_yesterday
# Result: $this = percentage vs yesterday's average
After calculating, check conditions:
# Simple conditions
warn: $this > 80
crit: $this > 90
# Flapping prevention
warn: ($status >= $WARNING) ? ($this > 50) : ($this > 80)
crit: ($status == $CRITICAL) ? ($this > 70) : ($this > 90)
# Complex conditions
warn: $this > 80 AND $rate > 10
crit: $this > 90 OR $failures > 5
Each condition evaluates to:
Final status:
Alert evaluation runs independently from data collection:
Data Collection Alert Evaluation
│ │
▼ every 1s ▼ configurable interval
[Metrics] ──────────> [Alert Engine]
│
▼
Query metrics,
Calculate values,
Check conditions
every for custom intervalsNetdata prevents alert flapping through:
warn: ($status < $WARNING) ? ($this > 80) : ($this > 50)
Triggers at 80, clears at 50, preventing flapping between 50-80.
Alerts transition immediately in dashboards but notifications use exponential backoff.
lookup: average -10m of used
warn: $this > 80
Requires 10 minutes of data before triggering.
Create dependent alerts:
# Stage 1: Baseline
template: requests_average_yesterday
on: web_log.requests
lookup: average -1h at -1d
every: 10s
# Stage 2: Current
template: requests_average_now
on: web_log.requests
lookup: average -1h
every: 10s
# Stage 3: Compare
template: web_requests_vs_yesterday
on: web_log.requests
calc: $requests_average_now * 100 / $requests_average_yesterday
units: %
warn: $this > 150 || $this < 75
crit: $this > 200 || $this < 50
Variables resolve in order (first match wins):
| Variable | Description | Value |
|---|---|---|
$this | Current calculated value | Result from lookup/calc |
$after | Query start timestamp | Unix timestamp |
$before | Query end timestamp | Unix timestamp |
$now | Current time | Unix timestamp |
$last_collected_t | Last collection time | Unix timestamp |
$update_every | Collection frequency | Seconds |
$status | Current status code | -2 to 3 |
$REMOVED | Status constant | -2 |
$UNINITIALIZED | Status constant | -1 |
$UNDEFINED | Status constant | 0 |
$CLEAR | Status constant | 1 |
$WARNING | Status constant | 2 |
$CRITICAL | Status constant | 3 |
| Syntax | Description | Example |
|---|---|---|
$dimension_name | Last normalized value | $used |
$dimension_name_raw | Last raw collected value | $used_raw |
$dimension_name_last_collected_t | Collection timestamp | $used_last_collected_t |
template: disk_usage_percent
on: disk.space
calc: $used * 100 / ($used + $available)
units: %
calc: $used > $threshold # If chart defines 'threshold'
warn: $connections > $max_connections * 0.8 # If host defines 'max_connections'
# Alert 1
template: cpu_baseline
calc: $system + $user
# Alert 2
template: cpu_check
calc: $system
warn: $this > $cpu_baseline * 1.5
template: disk_io_vs_iops
on: disk.io
calc: $reads / ${disk.iops.reads}
units: bytes per operation
When alerts reference variables matching multiple instances, Netdata uses label similarity scoring:
Example: Alert on disk.io (labels: device=sda, mount=/data) references ${disk.iops.reads}:
disk.iops for sda (labels match) → Score: 2disk.iops for sdb (no match) → Score: 0
Result: Uses sda's valueDuring lookups with missing data:
$this becomes NaN$this becomes NaNThis handles intermittent collection, dynamic dimensions, and partial outages.
Determine frequency by:
With lookup: Defaults to window duration
lookup: average -5m # Evaluates every 5 minutes
Without lookup: Set explicitly
every: 10s
calc: $system + $user
Custom interval: Override default
lookup: average -1m
every: 10s # Check every 10s despite 1m window
Constraints:
unaligned for efficiencyThe Netdata Assistant provides AI-powered troubleshooting when alerts trigger:
The Assistant window follows you through dashboards for easy reference while investigating.
Visit our Alerts Troubleshooting space for complex issues. Get help through GitHub or Discord. Share your solutions to help others.
Tune alerts for your environment by adjusting thresholds, writing custom conditions, silencing alerts, and using statistical functions.