src/health/REFERENCE.md
:::tip
Don't want to write alert configs manually? Alerts Automation uses AI to suggest alerts, generate the configuration, and test it against historical data — so you can validate thresholds before deployment without learning the syntax. You can also use the visual Alerts Configuration Manager.
:::
This page covers manual alert configuration — editing config files directly. For a comparison of all configuration methods, see How alert configuration works in Netdata.
:::tip
What You'll Learn
In 5 minutes, you'll know how to edit existing alerts, create new ones, and reload your configuration without downtime.
:::
Step 1: Find Your Config Directory
Navigate to your Netdata config directory
Step 2: Edit an Alert
sudo ./edit-config health.d/cpu.conf
Step 3: Apply Changes
sudo netdatacli reload-health
:::note
Key Concept
You can highly configure Netdata's health watchdog with support for dynamic thresholds, hysteresis, alert templates, and more. You can customize any existing alerts based on your infrastructure's topology or specific monitoring needs or create entirely new entities.
You can use health alerts with any of Netdata's collectors (see the supported collector list) to monitor your systems, containers, and applications in real time.
While you can view active alerts on both the local dashboard and Netdata Cloud, you configure all health alerts per node via individual Netdata Agents. If you want to deploy a new alert across your infrastructure, you must configure each node with the same health configuration files.
:::
Next Steps: Jump to Common Tasks for specific workflows or continue reading for comprehensive guidance.
:::tip
What You'll Learn
Step-by-step workflows for the most frequent alert configuration tasks.
:::
Why This Matters: Default thresholds may not fit your specific environment or requirements.
Quick Example:
# Change CPU warning from 85% to 75%
warn: $this > (($status >= $WARNING) ? (60) : (75))
crit: $this > (($status == $CRITICAL) ? (75) : (85))
Step-by-Step:
sudo ./edit-config health.d/cpu.conf10min_cpu_usage)warn and crit linessudo netdatacli reload-health| Method | Use Case | Configuration File | How To |
|---|---|---|---|
| Disable all alerts | Testing/maintenance | netdata.conf | Set enabled = no in [health] section |
| Disable specific alerts | Remove noisy alerts | netdata.conf | Set enabled alarms = !alert_name * |
| Silence notifications | Keep monitoring, stop notifications | Alert config file | Change to: silent |
Real-World Example: Monitor RAM usage above 80%
alarm: ram_usage
on: system.ram
lookup: average -1m percentage of used
units: %
every: 1m
warn: $this > 80
crit: $this > 90
info: RAM usage monitoring
Next Steps: See How-To Guides for detailed explanations.
:::tip
What You'll Learn
Detailed instructions for configuring, managing, and troubleshooting health alerts.
:::
Why This Matters: You don't need to restart your Netdata Agent when making changes, preventing gaps in monitoring.
You don't need to restart your Netdata Agent when making changes to health configuration files, such as specific health entities. Instead, you can use netdatacli with the reload-health option to prevent gaps in metrics collection.
sudo netdatacli reload-health
Alternative Method:
If netdatacli doesn't work on your system, you can send a SIGUSR2 signal to the daemon, which reloads health configuration without restarting the entire process.
sudo killall -USR2 netdata
Configuration Locations:
Configuration Locations:
| Location | Purpose | Common Tasks | How to Edit |
|---|---|---|---|
netdata.conf [health] section | Global health settings | • Disable all monitoring (enabled = no) | |
| • Disable specific alerts | |||
| • Change check frequencies | Edit directly or use edit-config | ||
health.d/*.conf files | Individual alert definitions | • Modify thresholds | |
| • Change notification recipients | |||
• Silence alerts (to: silent) | Use edit-config health.d/filename.conf |
Navigate to your Netdata config directory and use edit-config to make changes to any of these files.
Edit Individual Alerts:
For example, to edit the cpu.conf health configuration file, run:
sudo ./edit-config health.d/cpu.conf
Understanding Alert Structure:
Each health configuration file contains one or more health entities, which always begin with alarm: or template:. Here's the first health entity in health.d/cpu.conf:
template: 10min_cpu_usage
on: system.cpu
class: Utilization
type: System
component: CPU
lookup: average -10m unaligned of user,system,softirq,irq,guest
units: %
every: 1m
warn: $this > (($status >= $WARNING) ? (75) : (85))
crit: $this > (($status == $CRITICAL) ? (85) : (95))
delay: down 15m multiplier 1.5 max 1h
summary: CPU utilization
info: Average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
to: sysadmin
To customize this alert to trigger warning and critical alerts at lower CPU utilization levels, you can change the warn and crit lines to values of your choosing. For example:
warn: $this > (($status >= $WARNING) ? (60) : (75))
crit: $this > (($status == $CRITICAL) ? (75) : (85))
Save the file and reload Netdata's health configuration to apply your changes.
Why This Matters: Different situations require different approaches to managing alerts - permanent removal, temporary silencing, or selective filtering.
You can disable alerts and notifications permanently via configuration changes, or temporarily via the health management API.
Use Case: System maintenance or testing
In the netdata.conf [health] section, set enabled to no, and restart your Agent.
Use Case: Remove known noisy or irrelevant alerts
In the netdata.conf [health] section, use pattern exclusion with enabled alarms = !oom_kill * to load all alerts except oom_kill.
You can also edit the file where the alert is defined, comment out its definition, and reload Netdata's health configuration.
Use Case: Keep monitoring active but stop notifications
You can stop receiving notifications for an individual alert by changing the to: line to silent in the alert's configuration file.
to: silent
:::tip
This action requires that you reload Netdata's health configuration.
:::
Use Case: Scheduled maintenance or dynamic control
| Scenario | Solution | Method |
|---|---|---|
| Disable alerts during backups | Use health management API | API calls without config changes |
| Suppress notifications temporarily | Keep checks running, silence notifications | API control of notification system |
:::tip
You can use the health management API to temporarily control alert behavior without changing configuration or restarting your Agent. The API allows you to:
:::
Why This Matters: While tuning existing alerts may work in some cases, you may need to write entirely new health entities based on how your systems, containers, and applications work.
Prerequisites: Read the Alert Configuration Reference for a complete listing of the format, syntax, and functionality of health entities.
Step-by-Step Process:
Step 1: Create the Configuration File
Navigate to your Netdata config directory, then use touch to create a new file in the health.d/ directory. Use edit-config to start editing the file.
As an example, let's create a ram-usage.conf file:
sudo touch health.d/ram-usage.conf
sudo ./edit-config health.d/ram-usage.conf
Step 2: Write Your Alert
Here's a health entity that triggers a warning alert when your node's RAM usage rises above 80%, and a critical alert above 90%:
alarm: ram_usage
on: system.ram
lookup: average -1m percentage of used
units: %
every: 1m
warn: $this > 80
crit: $this > 90
info: The percentage of RAM being used by the system.
Step 3: Understand Each Component
| Line | Purpose | This Example |
|---|---|---|
alarm | Entity name (alphanumeric, ., _ only) | ram_usage |
on | Chart to monitor | system.ram |
lookup | How to process metrics | Average last 1 minute, percentage of used dimension |
units | Display units | Percentages (%) |
every | Check frequency | Every 1 minute |
warn/crit | Trigger conditions | Warning > 80%, Critical > 90% |
info | Alert description | Appears in dashboard and notifications |
:::note
Understanding This Example
This health entity, named ram_usage, watches the system.ram chart. It looks up the last 1 minute of metrics from the used dimension and calculates the average of all those metrics in a percentage format, using % units. The entity performs this lookup every minute.
If the average RAM usage percentage over the last 1 minute is more than 80%, the entity triggers a warning alert. If the usage is more than 90%, the entity triggers a critical alert.
:::
Step 4: Activate Your Alert
When you finish writing this new health entity, reload Netdata's health configuration to see it live on your local dashboard or Netdata Cloud.
Next Steps: Explore Alert Examples for more complex scenarios, or dive into the Alert Configuration Reference for complete syntax details.
:::tip
What You'll Learn
Complete syntax reference for all alert configuration options. Use this section when you need specific technical details.
:::
| Type | Label | Purpose | Example Use Case |
|---|---|---|---|
| Alerts | alarm: | Attached to specific charts | Monitor specific server's CPU |
| Templates | template: | Apply to all charts of a context | Monitor all network interfaces |
Alerts are attached to specific charts and use the alarm label.
Templates define rules that apply to all charts of a specific context, and use the template label. Templates help you apply one entity to all disks, all network interfaces, all MySQL databases, and so on.
:::note
Precedence
Alarms are processed before templates. If you have alarm and template entities with the same name that both match the same chart, only the alarm will create an active alert for that chart.
For complete details on configuration loading order and precedence rules, see Alert Configuration Ordering.
:::
:::note
Configuration Requirements
alarm or template line must be the first line of any entityon line is always requiredevery line is required if not using lookuplookup, calc, warn, or crit:::
Special Syntax Rules:
* as a wildcard or prefix with ! for a negative match. Order is important! See our simple patterns docs for more examples\ are spliced together with the next line. The backslash is removed, and the following line is joined with the current one. No space is inserted, so you can split a line anywhere, even in the middle of a word. This is handy if your info line consists of several sentences| line | required | functionality |
|---|---|---|
alarm/template | yes | Name of the alert/template |
on | yes | The chart this alert should attach to |
class | no | The general alert classification |
type | no | What area of the system the alert monitors |
component | no | Specific component of the type of the alert |
lookup | yes | The database lookup to find and process metrics for the chart specified through on |
calc | yes (see above) | A calculation to apply to the value found via lookup or another variable |
every | no | The frequency of the alert |
green/red | no | Set the green and red thresholds of a chart |
warn/crit | yes (see above) | Expressions evaluating to true or false, and when true, will trigger the alert |
to | no | A list of roles to send notifications to |
exec | no | The script to execute when the alert changes status |
delay | no | Optional hysteresis settings to prevent floods of notifications |
repeat | no | The interval for sending notifications when an alert is in WARNING or CRITICAL mode |
options | no | Add an option to not clear alerts |
host labels | no | Restrict an alert or template to a list of matching labels present on a host |
chart labels | no | Restrict an alert or template to a list of matching labels present on a chart |
summary | no | A brief description of the alert |
info | no | A longer text field that provides more information about this alert |
alarm or templatePurpose: This line starts an alert or template based on the entity type you want to create.
Alert Syntax:
alarm: NAME
Template Syntax:
template: NAME
Naming Rules:
NAME can be any alphanumeric character. (period) and _ (underscore) symbols allowedchart name, dimension name, family name, or chart variable namesonPurpose: This line defines the chart this alert should attach to.
For Alerts:
on: CHART
The value CHART should be the unique ID or name of the chart you're interested in, as shown on the dashboard. In the image below, the unique ID is system.cpu.
For Templates:
on: CONTEXT
The value CONTEXT should be the context you want this template to attach to.
:::tip
Finding the Context
Need to find the context? Hover over the date on any given chart and look at the tooltip. In the image below, which shows a disk I/O chart, the tooltip reads: proc:/proc/diskstats, disk.io.
You're interested in what comes after the comma: disk.io. That's the name of the chart's context.
If you create a template using the disk.io context, it will apply an alert to every disk available on your system.
:::
classPurpose: This indicates the type of error (or general problem area) that the alert or template applies to.
Example Use: Latency can be used for alerts that trigger on latency issues on network interfaces, web servers, or database systems.
class: Latency
Available Classes:
| Class | Use Case |
|---|---|
| Errors | Error rate monitoring |
| Latency | Response time issues |
| Utilization | Resource usage monitoring |
| Workload | Load and throughput monitoring |
:::note
class will default to Unknown if the line is missing from the alert configuration.
:::
typePurpose: You can use type to indicate the broader area of the system that the alert applies to.
Example: Under the general Database type, you can group together alerts that operate on various database systems, like MySQL, CockroachDB, CouchDB, etc.
type: Database
Available Types:
| Type | Description |
|---|---|
| Ad Filtering | Services related to Ad Filtering (like pi-hole) |
| Certificates | Certificate monitoring related |
| Cgroups | Alerts for CPU and memory usage of control groups |
| Computing | Alerts for shared computing applications (e.g. boinc) |
| Containers | Container related alerts (e.g. docker instances) |
| Database | Database systems (e.g. MySQL, PostgreSQL, etc) |
| Data Sharing | Used to group together alerts for data sharing applications |
| DHCP | Alerts for DHCP related services |
| DNS | Alerts for DNS related services |
| Kubernetes | Alerts for kubernetes nodes monitoring |
| KV Storage | Key-Value pairs services alerts (e.g. memcached) |
| Linux | Services specific to Linux (e.g. systemd) |
| Messaging | Alerts for message passing services (e.g. vernemq) |
| Netdata | Internal Netdata components monitoring |
| Other | When an alert doesn't fit in other types |
| Power Supply | Alerts from power supply related services (e.g. apcupsd) |
| Search engine | Alerts for search services (e.g. elasticsearch) |
| Storage | Class for alerts dealing with storage services (storage devices typically live under System) |
| System | General system alerts (e.g. CPU, network, etc.) |
| Virtual Machine | Virtual Machine software |
| Web Proxy | Web proxy software (e.g. squid) |
| Web Server | Web server software (e.g. Apache, nginx, etc.) |
| Windows | Alerts for monitoring Windows services |
:::note
If an alert configuration is missing the type line, its value will default to Unknown.
:::
componentPurpose: You can use component to narrow down what the previous type value specifies for each alert or template.
Example: Continuing from the previous example, component might include MySQL, CockroachDB, MongoDB, all under the same Database type.
component: MySQL
:::note
As with the class and type lines, if component is missing from the configuration, its value will default to Unknown.
:::
lookupPurpose: This line makes a database lookup to find a value. The result of this lookup is available as $this.
Full Syntax:
lookup: METHOD(GROUPING OPTIONS) AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS]
Required Parameters:
| Parameter | Description | Example |
|---|---|---|
METHOD | Grouping method | average, min, max |
AFTER | How far back to look (negative number) | -1m, -1h, -1d |
Optional Parameters:
| Parameter | Purpose | Details |
|---|---|---|
GROUPING OPTIONS | Conditional processing | CONDITION VALUE where condition is !=, =, ==, <=, <, >, >= |
at BEFORE | End of lookup timeframe | Default is 0 (now) |
every DURATION | Update frequency | Supports s, m, h, d units |
OPTIONS | Processing modifiers | See options table below |
of DIMENSIONS | Which dimensions to include | Space-separated list, supports patterns |
Processing Options:
| Option | Effect |
|---|---|
percentage | Calculate percentage of selected dimensions over total |
absolute | Turn all sample values positive |
min | Return minimum of all dimensions after time-aggregation |
max | Return maximum of all dimensions after time-aggregation |
average | Return average of all dimensions after time-aggregation |
sum | Return sum of all dimensions (default) |
min2max | Return delta between min and max of dimensions |
unaligned | Prevent shifting query window to multiples of duration |
match-ids | Match dimensions by IDs (default) |
match-names | Match dimensions by names |
Example:
lookup: average -10m unaligned of user,system,softirq,irq,guest
This looks back 10 minutes, calculates the average of the specified CPU dimensions, without aligning to time boundaries.
The result of the lookup will be available as $this and $NAME in expressions. The timestamps of the timeframe evaluated by the database lookup are available as variables $after and $before (both are unix timestamps).
calcPurpose: You can design a calc to apply some calculation to the values or variables available to the entity.
Key Points:
$this variablelookuplookup if using other available variablescalc: EXPRESSION
When to Use:
lookup: Perform calculation after database retrievallookup: When using other available variableseveryPurpose: Sets the update frequency of this alert.
every: DURATION
Supported Units:
s for secondsm for minutesh for hoursd for daysExample: every: 30s checks the alert every 30 seconds.
green and redPurpose: Set the green and red thresholds of a chart for visualization.
green: NUMBER
red: NUMBER
Important Notes:
$green and $red in expressionswarn and critPurpose: Define the expressions that trigger warning or critical alerts.
warn: EXPRESSION
crit: EXPRESSION
Key Points:
$this, $green, $redExamples:
warn: $this > 80
crit: $this > 95
toPurpose: Specifies who receives notifications when the alert changes status.
to: ROLE1 ROLE2 ROLE3 ...
How It Works:
exec scriptalarm-notify.sh) treats this as a space-separated list of rolesexecPurpose: Script to execute when the alert status changes.
exec: SCRIPT
Default Behavior:
alarm-notify.shdelayPurpose: Provide optional hysteresis settings to prevent notification floods.
:::important
These settings don't affect the actual alert - only when the exec script is executed.
:::
Full Syntax:
delay: [[[up U] [down D] multiplier M] max X]
Parameters:
| Parameter | Purpose | Default |
|---|---|---|
up U | Delay for status increases (CLEAR→WARNING, WARNING→CRITICAL) | 0 |
down D | Delay for status decreases (CRITICAL→WARNING, WARNING→CLEAR) | 0 |
multiplier M | Multiplies U and D when alert changes state during delay | 1.0 |
max X | Maximum absolute notification delay | max(U×M, D×M) |
Example with Timeline:
delay: up 10s down 15m multiplier 2 max 1h
Starting at 00:00:00 with CLEAR status:
| Time | New Status | Delay Applied | Notification At | Reason |
|---|---|---|---|---|
| 00:00:01 | WARNING | up 10s | 00:00:11 | First state switch |
| 00:00:05 | CLEAR | down 15m x2 | 00:30:05 | Alert changed during delay, so multiplied |
| 00:00:06 | WARNING | up 10s x2 x2 | 00:00:26 | Multiplied twice |
repeatPurpose: Defines the interval between repeating notifications for alerts in CRITICAL or WARNING mode.
repeat: [off] [warning DURATION] [critical DURATION]
Options:
| Option | Effect |
|---|---|
off | Turns off repeating for this alert |
warning DURATION | Repeat interval for WARNING state (use 0s to disable) |
critical DURATION | Repeat interval for CRITICAL state (use 0s to disable) |
Why Use This: Overrides default repeat settings from netdata.conf health configuration.
optionsPurpose: Special alert behavior options.
options: no-clear-notification
Available Options:
no-clear-notification - Prevents clearing the alert notificationWhen to Use no-clear-notification:
Example Use Case: HTTP response time alert comparing recent average to historical average - as time passes, the recent slow responses become part of the historical data, making the alert appear "cleared" even though the underlying issue wasn't resolved.
host labelsPurpose: Restricts alerts to hosts with matching labels.
Prerequisites: See our host labels guide for setup instructions.
Example Configuration:
[host labels]
installed = 20191211
room = server
Usage in Alerts:
host labels: room = server
Pattern Support:
host labels: installed = 201* # Matches all hosts installed in 2010s
How It Works:
chart labelsPurpose: Filters alerts based on chart labels.
How to Find Chart Labels: Check http://localhost:19999/api/v1/charts?all
Example Use Case:
Each disk_space chart has a mount_point label. To exclude external disk alerts:
chart labels: mount_point=!/mnt/disk1 *
Multiple Label Logic:
chart labels: mount_point=/mnt/disk1 device=sda
This requires BOTH conditions to be true (AND logic).
:::important
chart labels require the underlying chart to exist. For example, a disk.space chart is only created when a mount point is present and collected. For example, if a CIFS mount fails to mount after a system reboot, no disk.space chart will exist for that mount point, and the alert will not activate:::
summaryPurpose: Brief title of the alert used in notifications and dashboard.
summary: Available Ram
Variable Support:
| Variable | Replaced With |
|---|---|
${family} | Family instance (e.g., eth0) |
${label:LABEL_NAME} | Chart label value |
Example with Variables:
summary: 1 minute received traffic overflow for ${label:device}
Renders as: 1 minute received traffic overflow for eth0
:::note
Variable names are case-sensitive.
:::
infoPurpose: Detailed description of the alert for notifications and UI elements.
info: Percentage of estimated amount of RAM available for userspace processes, without causing swapping
Variable Support:
| Variable | Replaced With |
|---|---|
${family} | Family instance (e.g., eth0) |
${label:LABEL_NAME} | Chart label value |
Examples with Variables:
Family Variable:
info: average inbound utilization for the network interface ${family} over the last minute
Renders as: average inbound utilization for the network interface eth0 over the last minute
Label Variable:
info: average ratio of HTTP responses with unexpected status over the last 5 minutes for the site ${label:target}
Renders as: average ratio of HTTP responses with unexpected status over the last 5 minutes for the site https://netdata.cloud/
Next Steps: Continue to Expressions and Variables to understand the calculation syntax, or jump to Alert Examples for practical implementations.
:::tip
What You'll Learn
How to write calculations and use variables in your alert definitions. Essential for creating custom logic and accessing chart data.
:::
Why This Matters: Netdata has an internal infix expression parser that allows you to create complex alert logic using mathematical operations, comparisons, and conditional statements.
Supported Operators:
| Type | Operators | Result |
|---|---|---|
| Arithmetic | +, -, *, / | Numeric values |
| Comparison | <, ==, <=, <>, !=, >, >= | 1 (true) or 0 (false) |
| Logical | &&, ` |
Special Functions:
abs() - Absolute value(condition) ? (true_expr) : (false_expr) - Conditional operatorSpecial Values:
| Value | Purpose | Example Use |
|---|---|---|
nan | Not a number (database lookup failed) | $this != nan |
inf | Infinite (division by zero) | $this != inf |
Why This Matters: The conditional operator (? :) can create "sticky" alert thresholds that prevent alert spam when values fluctuate around a threshold. This is called hysteresis.
Basic Pattern:
warn: $this > (($status >= $WARNING) ? (lower_threshold) : (higher_threshold))
Real Example - CPU Usage Alert:
warn: $this > (($status >= $WARNING) ? (75) : (85))
crit: $this > (($status == $CRITICAL) ? (85) : (95))
How This Works:
| Alert State | Triggers At | Clears At | Explanation |
|---|---|---|---|
| Warning | 85% CPU | 75% CPU | Creates 10% buffer - CPU must drop below 75% to clear warning |
| Critical | 95% CPU | 85% CPU | Creates 10% buffer - CPU must drop below 85% to return to warning |
Benefits:
Example Scenario: If CPU usage fluctuates between 80–90%, you'll receive just one initial warning, rather than constant notifications.
How to Find Available Variables:
You can find all variables for a given chart using: http://NODE:19999/api/v1/alarm_variables?chart=CHART_NAME
Example: Variables for the system.cpu chart
:::note
Chart vs Template Variables
Although the alarm_variables link shows variables for a particular chart, the same variables can also be used in templates for charts belonging to a given context. All charts of a given context are essentially identical, with the only difference being the family that identifies a particular hardware or software instance.
:::
What's Available:
$user, $system for CPU chart)Special Chart Variables:
| Variable | Contains |
|---|---|
$last_collected_t | Unix timestamp of last data collection |
$collected_total_raw | Sum of all dimensions (last collected values) |
$update_every | Update frequency of the chart |
$green, $red | Thresholds defined in alerts |
Dimension Value Types:
$dimension_raw - Last collected value$dimension_last_collected_t - Unix timestamp when dimension was last collectedWhat's Available: All dimensions of all charts, including all alerts, in fullname format.
Format: CHART.VARIABLE
CHART can be either chart ID or chart nameExamples:
$system.cpu.user - User CPU from system.cpu chart$disk.sda.reads - Read operations from sda disk chart| Variable | Contains | Usage |
|---|---|---|
$this | Current alert value | Result of calc line or current alert |
$status | Current alert status | Compare with status constants |
$now | Current unix timestamp | Time-based calculations |
Alert Status Constants:
| Constant | Numeric Value | Usage |
|---|---|---|
$REMOVED | -2 | Alert deleted (SIGUSR2 reload) |
$UNINITIALIZED | -1 | Alert not initialized |
$UNDEFINED | 0 | Calculation failed |
$CLEAR | 1 | Alert OK/not triggered |
$WARNING | 2 | Warning condition met |
$CRITICAL | 3 | Critical condition met |
Status Comparison Examples:
# Check if alert is in warning or higher
warn: $status >= $WARNING
# Check if alert is specifically critical
crit: $status == $CRITICAL
:::note
Status values increase with severity, so $status > $CLEAR will match both WARNING and CRITICAL states.
:::
Status Flow: UNINITIALIZED → UNDEFINED/CLEAR → WARNING → CRITICAL
When Status Changes:
REMOVED - Alert deleted during configuration reloadUNINITIALIZED - Alert created but not yet calculatedUNDEFINED - Database lookup failed, division by zero, etc.CLEAR - Alert conditions aren’t met (normal state)WARNING - Warning expression returned true/non-zeroCRITICAL - Critical expression returned true/non-zeroScript Execution: The external script (exec line) is called for ALL status changes.
Next Steps: Ready to see these concepts in action? Continue to Alert Examples for practical implementations.
:::tip
What You'll Learn
Real-world alert configurations that demonstrate different monitoring scenarios. Use these as templates for your own alerts.
:::
<details> <summary><strong>Example 1: Server Alive Check</strong></summary>Scenario: Monitor if the Apache server is collecting data properly.
Why This Matters: Detect when data collection stops, indicating potential server or network issues.
template: apache_last_collected_secs
on: apache.requests
calc: $now - $last_collected_t
every: 10s
warn: $this > ( 5 * $update_every)
crit: $this > (10 * $update_every)
How It Works:
| Component | Purpose | This Example |
|---|---|---|
template | Applies to all Apache servers | apache_last_collected_secs |
on | Chart context to monitor | apache.requests |
calc | Time since last data collection | $now - $last_collected_t |
every | Check frequency | Every 10 seconds |
warn | Warning threshold | 5 missed collection cycles |
crit | Critical threshold | 10 missed collection cycles |
Variables Used:
$now - Current timestamp$last_collected_t - Last data collection timestamp$update_every - Chart update frequency$this - Result of calculation (seconds since last collection)Scenario: Alert when any disk is running low on space.
Why This Matters: Prevent system failures due to full disks.
template: disk_full_percent
on: disk.space
calc: $used * 100 / ($avail + $used)
every: 1m
warn: $this > 80
crit: $this > 95
repeat: warning 120s critical 10s
How It Works:
| Component | Purpose | This Example |
|---|---|---|
template | Applies to all disks | disk_full_percent |
on | Chart context | disk.space |
calc | Calculate usage percentage | $used * 100 / ($avail + $used) |
warn/crit | Simple thresholds | 80% warning, 95% critical |
repeat | Notification frequency | Every 2min (warning), 10sec (critical) |
Variables Used:
$used - Used disk space dimension$avail - Available disk space dimensionScenario: Predict when disks will run out of space based on the current fill rate.
Why This Matters: Get warning before disk space becomes critical.
Step 1: Calculate Disk Fill Rate
template: disk_fill_rate
on: disk.space
lookup: max -1s at -30m unaligned of avail
calc: ($this - $avail) / (30 * 60)
every: 15s
Step 2: Predict Hours Until Full
template: disk_full_after_hours
on: disk.space
calc: $avail / $disk_fill_rate / 3600
every: 10s
warn: $this > 0 and $this < 48
crit: $this > 0 and $this < 24
How It Works:
| Step | Purpose | Calculation |
|---|---|---|
| 1 | Calculate fill rate | (space_30min_ago - current_space) / 1800_seconds |
| 2 | Predict time to full | current_available / fill_rate / 3600 |
Logic:
Scenario: Alert on any network packet drops.
Why This Matters: Packet drops indicate network issues that could affect performance.
template: 30min_packet_drops
on: net.drops
lookup: sum -30m unaligned absolute
every: 10s
crit: $this > 0
How It Works:
| Component | Purpose | This Example |
|---|---|---|
template | Applies to all network interfaces | 30min_packet_drops |
lookup | Sum drops over 30 minutes | sum -30m unaligned absolute |
crit | Any drops trigger critical | $this > 0 |
Key Points:
Scenario: Detect CPU usage anomalies using statistical analysis.
Why This Matters: Identify unusual patterns that fixed thresholds might miss.
alarm: cpu_user_mean
on: system.cpu
lookup: mean -60s of user
every: 10s
alarm: cpu_user_stddev
on: system.cpu
lookup: stddev -60s of user
every: 10s
alarm: cpu_user_zscore
on: system.cpu
lookup: mean -10s of user
calc: ($this - $cpu_user_mean) / $cpu_user_stddev
every: 10s
warn: $this < -2 or $this > 2
crit: $this < -3 or $this > 3
How It Works:
| Alert | Purpose | Calculation |
|---|---|---|
cpu_user_mean | Calculate average CPU usage | Mean over 60 seconds |
cpu_user_stddev | Calculate variability | Standard deviation over 60 seconds |
cpu_user_zscore | Detect anomalies | (current - mean) / stddev |
Z-Score Interpretation:
Scenario: Use Netdata's built-in ML for chart-level anomaly detection.
Why This Matters: Detect complex patterns across multiple metrics without manual threshold tuning.
template: ml_5min_cpu_chart
on: system.cpu
lookup: average -5m anomaly-bit of *
calc: $this
units: %
every: 30s
warn: $this > (($status >= $WARNING) ? (5) : (20))
crit: $this > (($status == $CRITICAL) ? (20) : (100))
info: rolling 5min anomaly rate for system.cpu chart
How It Works:
| Component | Purpose | This Example |
|---|---|---|
lookup | Average anomaly rate across CPU dimensions | 5-minute rolling window |
| Hysteresis | Prevent alert flapping | Warning: 20%→5%, Critical: 100%→20% |
anomaly-bit | ML-generated anomaly indicators | 0 (normal) or 1 (anomalous) |
Scenario: Monitor overall system health using ML across all metrics.
Why This Matters: Get a holistic view of system anomalies beyond individual charts.
template: ml_5min_node
on: anomaly_detection.anomaly_rate
lookup: average -5m of anomaly_rate
calc: $this
units: %
every: 30s
warn: $this > (($status >= $WARNING) ? (5) : (20))
crit: $this > (($status == $CRITICAL) ? (20) : (100))
info: rolling 5min anomaly rate for all ML enabled dims
Key Differences from Chart-Level:
anomaly_detection.anomaly_rate chartanomaly_rate dimensionNext Steps: Having trouble with your alerts? Continue to Troubleshooting for debugging techniques.
:::tip
What You'll Learn
How to debug alert issues, understand why alerts aren't working, and get detailed information about alert processing.
:::
Finding Chart Names: You can find chart information in two places:
| Method | URL | Contains |
|---|---|---|
| Configuration | http://NODE:19999/netdata.conf | All chart details |
| API | http://NODE:19999/api/v1/charts | JSON chart data |
Replace NODE with your server's IP address or hostname.
Why This Matters: Understand how Netdata interprets your expressions and what values are being calculated.
Check Alert Processing:
Visit http://NODE:19999/api/v1/alarms?all to see:
Expression Evaluation Flow: Netdata adds parentheses to show how it evaluates your expressions:
Your Expression:
warn: $this > 80 and $status >= $WARNING
Netdata's Interpretation:
warn: (($this > 80) and ($status >= $WARNING))
:::note
How to Use These Decision Trees
Follow the flowcharts below to systematically diagnose and resolve alert issues. Each path leads to specific solutions with step-by-step instructions.
:::
flowchart TD
A("Alert Not Working") --> B("Check Configuration")
A --> C("Check Data Source")
A --> D("Check Thresholds")
B --> B1("Syntax errors
Reload config
Review logs")
C --> C1("Chart exists
Data collection
Variable names")
D --> D1("Adjust values
Review logic
Test expressions")
%% Style definitions
classDef alert fill:#ffeb3b,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
classDef neutral fill:#f9f9f9,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
classDef complete fill:#4caf50,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
classDef database fill:#2196F3,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
%% Apply styles
class A alert
class B,C,D database
class B1,C1,D1 complete
flowchart TD
A("Alert Always Triggering") --> B("Check Values")
A --> C("Check Logic")
A --> D("Check Variables")
B --> B1("Current metrics
Threshold comparison
Units verification")
C --> C1("Operator direction
Expression syntax
Status conditions")
D --> D1("Variable names
Dimension availability
API verification")
%% Style definitions
classDef alert fill:#ffeb3b,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
classDef neutral fill:#f9f9f9,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
classDef complete fill:#4caf50,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
classDef database fill:#2196F3,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
%% Apply styles
class A alert
class B,C,D database
class B1,C1,D1 complete
flowchart TD
A("Alert Flapping") --> B("Add Hysteresis")
A --> C("Smooth Data")
A --> D("Adjust Timing")
B --> B1("Conditional thresholds
Different up/down values
Status-based logic")
C --> C1("Increase lookup time
Use averages
Reduce noise")
D --> D1("Increase intervals
Add delays
Reduce frequency")
%% Style definitions
classDef alert fill:#ffeb3b,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
classDef neutral fill:#f9f9f9,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
classDef complete fill:#4caf50,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
classDef database fill:#2196F3,stroke:#000000,stroke-width:3px,color:#000000,font-size:18px
%% Apply styles
class A alert
class B,C,D database
class B1,C1,D1 complete
Possible Causes:
| Problem | Check This | Solution |
|---|---|---|
| Wrong chart name | on: line matches actual chart | Use chart ID from dashboard |
| Incorrect dimensions | Dimension names in lookup | Check available dimensions |
| Missing data | Chart has recent data | Verify data collection |
| Expression errors | Variables resolve correctly | Use /api/v1/alarm_variables |
Possible Causes:
| Problem | Check This | Solution |
|---|---|---|
| Wrong threshold direction | > vs < in expressions | Review logic |
| Units mismatch | Comparing percentages to absolute values | Check calculation units |
| Variable name errors | $this vs $chart.dimension | Verify variable names |
Possible Causes:
| Problem | Solution |
|---|---|
| Values near threshold | Implement hysteresis |
| Noisy data | Increase lookup time window |
| Too frequent checks | Increase every: interval |
Hysteresis Example:
# Instead of simple threshold
warn: $this > 80
# Use hysteresis
warn: $this > (($status >= $WARNING) ? (75) : (80))
Debug Steps:
Check Available Variables:
http://NODE:19999/api/v1/alarm_variables?chart=CHART_NAME
Verify Chart Context:
alarm: use chart nametemplate: use chart contextCheck Variable Syntax:
$dimension_name$chart_name.dimension_name$this, $now, $statusSafe Testing Process:
Create Test File:
sudo touch health.d/test-alert.conf
sudo ./edit-config health.d/test-alert.conf
Write Simple Alert:
alarm: test_ram
on: system.ram
lookup: average -1m percentage of used
every: 10s
war: $this > 50 # Low threshold for testing
info: Test alert - safe to ignore
Reload and Monitor:
sudo netdatacli reload-health
# Watch dashboard for test alert appearance
Remove When Done:
sudo rm health.d/test-alert.conf
sudo netdatacli reload-health
Alert Impact on System:
| Factor | Impact | Optimization |
|---|---|---|
Check frequency (every:) | CPU usage | Use appropriate intervals |
| Lookup timeframe | Memory/CPU | Don't use excessively long periods |
| Number of alerts | Overall performance | Disable unused alerts |
| Complex expressions | CPU per check | Simplify where possible |
Recommended Frequencies:
| Alert Type | Suggested Frequency | Reason |
|---|---|---|
| Critical system metrics | 10-30s | Quick response needed |
| Resource usage | 1-5m | Trends matter more than instant values |
| Predictive alerts | 15m-1h | Based on longer-term patterns |
Information to Provide:
When seeking help, include:
Alert Configuration:
# Your complete alert definition
Chart Information:
http://your-server:19999/api/v1/alarm_variables?chart=chart_name
Current Status:
http://your-server:19999/api/v1/alarms?all
Community Resources:
Next Steps: You now have comprehensive knowledge of Netdata health configuration. Start with the Quick Start Guide for immediate needs or dive into Common Tasks for specific workflows.