docs/content/v2024.1/yugabyte-platform/alerts-monitoring/alert-policy-templates.md
Alert policies use the following templates to define how the alert is triggered. The alert templates have been created using Prometheus expressions.
Last attempt to send alert notifications to channel '{{ $labels.source_name }}' has failed. You need to try sending a test alert to obtain details.
last_over_time(ybp_alert_manager_channel_status{customer_uuid="$uuid"}[1d]) < 1
Last attempt to send alert notifications for customer 'customer name' failed. You need to check YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_alert_manager_status{customer_uuid="$uuid"}[1d]) < 1
Last alert query for customer 'customer name' failed. YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_alert_query_status[1d]) < 1
Last alert rules synchronization for customer 'customer name' has failed. YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_alert_config_writer_status[1d]) < 1
Failed to delete $value backups for customer 'customer name' in last GC run. Check logs for more details.
last_over_time(ybp_delete_backup_failure{customer_uuid = "__customerUuid__"}[1d]) {{ query_condition }} 0
Last backup task for universe '$universe_name' failed. You need to check the backup task result for details.
last_over_time(ybp_create_backup_status{universe_uuid = "$uuid"}[1d]) < 1
Last attempt to run a scheduled backup for universe '$universe_name' failed due to other backup or universe operation in progress.
last_over_time(ybp_schedule_backup_status{universe_uuid = "$uuid"}[1d]) < 1
Last snapshot task for universe '$universe_name' failed. To retry, check PITR configuration task result for more details.
min(ybp_pitr_config_status{universe_uuid = "__universeUuid__"}) {{ query_condition }} 1
Database compaction rejections detected for universe '$universe_name'.
sum by (node_prefix) (increase(majority_sst_files_rejections{node_prefix="$node_prefix"}[10m])) > 0
Core files detected for universe '$universe_name' on $value T-Server instances.
ybp_health_check_tserver_core_files{universe_uuid="$uuid"} > 0
TServer detected $value drive failure for universe '$universe_name'.
count by (universe_uuid) (drive_fault{universe_uuid="__universeUuid__",
export_type="tserver_export"}) {{ query_condition }} {{ query_threshold }}
Error logs detected for universe '$universe_name' on $value Master/TServer instance(s).
sum by (universe_uuid) ((ybp_health_check_node_master_error_logs{universe_uuid="__universeUuid__"} < bool 1) * ignoring (saved_name) (ybp_health_check_node_master_fatal_logs{universe_uuid="__universeUuid__"} == bool 1)) + sum by (universe_uuid) ((ybp_health_check_node_tserver_error_logs{universe_uuid="__universeUuid__"} < bool 1) * ignoring (saved_name) (ybp_health_check_node_tserver_fatal_logs{universe_uuid="__universeUuid__"} == bool 1)) {{ query_condition }} {{ query_threshold }}
Fatal logs have been detected for universe '$universe_name' on $value Master or T-Server instances.
sum by (universe_uuid) (ybp_health_check_node_master_fatal_logs{universe_uuid="$uuid"} < bool 1) + sum by (universe_uuid) (ybp_health_check_node_tserver_fatal_logs{universe_uuid="$uuid"} < bool 1) > 0
$value database Master or T-Server instances are down for more than 15 minutes for universe '$universe_name'.
count by (node_prefix) (
label_replace(
max_over_time(up{export_type=~"master_export|tserver_export",node_prefix="$node_prefix"}[15m]),
"exported_instance",
"$1",
"instance",
"(.*)"
)
<
1
and on (node_prefix, export_type, exported_instance)
(min_over_time(ybp_universe_node_function{node_prefix="$node_prefix"}[15m]) == 1)
)
>
0
Universe '$universe_name' Master or T-Server has restarted $value times during last 30 minutes.
max by (node_prefix) (
changes(yb_node_boot_time{node_prefix="$node_prefix"}[30m])
and on (node_prefix)
(max_over_time(ybp_universe_update_in_progress{node_prefix="$node_prefix"}[31m]) == 0)
)
>
0
Database queues overflow has been detected for universe '$universe_name'.
sum by (node_prefix) (increase(rpcs_queue_overflow{node_prefix="$node_prefix"}[10m]))
+
sum by (node_prefix) (increase(rpcs_timed_out_in_queue{node_prefix="$node_prefix"}[10m]))
>
1
Database memory rejections have been detected for universe '$universe_name'.
sum by (node_prefix) (increase(leader_memory_pressure_rejections{node_prefix="$node_prefix"}[10m]))
+
sum by (node_prefix) (
increase(follower_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])
)
+
sum by (node_prefix) (
increase(operation_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])
)
>
0
Version mismatch has been detected for universe '$universe_name' for $value Master or T-Server instances.
ybp_health_check_tserver_version_mismatch{universe_uuid="$uuid"}
+
ybp_health_check_master_version_mismatch{universe_uuid="$uuid"}
>
0
Test YSQL write/read operation failed on $value nodes for universe '$universe_name'.
count by (node_prefix) (yb_node_ysql_write_read{node_prefix="$node_prefix"} < 1)
DocDB cache miss percentage is high for universe '$universe_name'. The current value is $value %.
avg by (universe_uuid) (
sum by (exported_instance, universe_uuid) (
rate(rocksdb_block_cache_miss{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m])
)
/
(
sum by (exported_instance, universe_uuid) (
rate(rocksdb_block_cache_miss{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m])
)
+
sum by (exported_instance, universe_uuid) (
rate(rocksdb_block_cache_hit{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m])
)
)
)
*
100
{{ query_condition }} {{ query_threshold }}
Average node CPU usage for universe '$universe_name' is more than 90% on $value nodes.
count by (node_prefix) (
(
100
-
(
avg by (node_prefix, instance) (
avg_over_time(
irate(node_cpu_seconds_total{job="node",mode="idle",node_prefix="$node_prefix"}[1m])[30m:]
)
)
*
100
)
)
>
90
)
$value database nodes are down for more than 15 minutes for universe '$universe_name'.
count by (node_prefix) (
max_over_time(up{export_type="node_export",node_prefix="$node_prefix"}[15m]) < 1
)
>
0
Node data disk usage for universe '$universe_name' is above $threshold% on $value node(s).
count by (universe_uuid) (count by (universe_uuid, node_name) (100 - (sum without (saved_name) (node_filesystem_free_bytes{mountpoint=~"__mountPoints__",universe_uuid="__universeUuid__", fstype!="rootfs"}) / sum without (saved_name) (node_filesystem_size_bytes{mountpoint=~"__mountPoints__",
universe_uuid="__universeUuid__", fstype!="rootfs"}) * 100) {{ query_condition }} {{ query_threshold }}))
Node file descriptors usage for universe '$universe_name' is above 70% on $value nodes.
count by (universe_uuid) (ybp_health_check_used_fd_pct{universe_uuid="$uuid"} > 70)
More than one out of memory (OOM) kills have been detected for universe '$universe_name' on $value nodes.
count by (node_prefix) (yb_node_oom_kills_10min{node_prefix="$node_prefix"} > 1) > 0
Universe '$universe_name' database node has restarted $value times during last 30 minutes.
max by (node_prefix) (changes(node_boot_time{node_prefix="$node_prefix"}[30m])) > 0
Node system disk usage for universe '$universe_name' is above $threshold% on $value node(s).
count by (universe_uuid) (count by (universe_uuid, node_name) (100 - (sum without (saved_name) (node_filesystem_free_bytes{mountpoint=~"__systemMountPoints__",universe_uuid="__universeUuid__", fstype!="rootfs"}) / sum without (saved_name) (node_filesystem_size_bytes{mountpoint=~"__systemMountPoints__",universe_uuid="__universeUuid__", fstype!="rootfs"}) * 100) {{ query_condition }} {{ query_threshold }}))
The tablet leader is missing for more than 5 minutes for $value tablets in universe '$universe_name'.
max by (node_prefix) (
count by (node_prefix, exported_instance) (
max_over_time(yb_node_leaderless_tablet{node_prefix="$node_prefix"}[5m])
)
>
0
)
Master leader is missing for universe '$universe_name'.
max by (node_prefix) (yb_node_is_master_leader{node_prefix="$node_prefix"}) < 1
Master is missing from Raft group or has follower lag higher than $threshold seconds for universe '$universe_name'.
(min_over_time((ybp_universe_replication_factor{universe_uuid='{{ $labels.universe_uuid }}'} - on(universe_uuid) count by(universe_uuid) (count by (universe_uuid, exported_instance) (follower_lag_ms{export_type="master_export", universe_uuid='{{ $labels.universe_uuid }}'})))[{{query_threshold }}s:]) > 0 or (max by(universe_uuid) (follower_lag_ms{export_type="master_export", universe_uuid='{{ $labels.universe_uuid }}'}) {{ query_condition }} ({{ query_threshold }} * 1000)))
$value tablets remain under-replicated for more than 5 minutes in universe '$universe_name'.
max by (node_prefix) (
count by (node_prefix, exported_instance) (
max_over_time(yb_node_underreplicated_tablet{node_prefix="$node_prefix"}[5m])
)
>
0
)
Average read latency of tablet server for universe '$universe_name' is above $threshold% ms. The current value is $value milliseconds.
(
avg by (universe_uuid) (
rate(
rpc_latency_sum{export_type="tserver_export",server_type="yb_tserver",service_method="Read",service_type="TabletServerService",universe_uuid="__universeUuid__"}[5m]
)
)
)
/
(
avg by (universe_uuid) (
rate(
rpc_latency_count{export_type="tserver_export",server_type="yb_tserver",service_method="Read",service_type="TabletServerService",universe_uuid="__universeUuid__"}[5m]
)
)
*
1000
)
{{ query_condition }} {{ query_threshold }}
Average write latency of tablet server for universe '$universe_name' is above $threshold% ms. The current value is $value milliseconds.
(
avg by (universe_uuid) (
rate(
rpc_latency_sum{export_type="tserver_export",server_type="yb_tserver",service_method="Write",service_type="TabletServerService",universe_uuid="__universeUuid__"}[5m]
)
)
)
/
(
avg by (universe_uuid) (
rate(
rpc_latency_count{export_type="tserver_export",server_type="yb_tserver",service_method="Write",service_type="TabletServerService",universe_uuid="__universeUuid__"}[5m]
)
)
*
1000
)
{{ query_condition }} {{ query_threshold }}
Maximum clock skew for universe '$universe_name' is more than 500 milliseconds. The current value is $value milliseconds.
max by (node_prefix) (max_over_time(hybrid_clock_skew{node_prefix="$node_prefix"}[10m])) / 1000
>
500
Failed to perform health check for universe '$universe_name'. You need to check YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_health_check_status{universe_uuid="$uuid"}[1d]) < 1
Failed to issue health check notification for universe '$universe_name'. You need to check Health notification settings and YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_health_check_notification_status{universe_uuid="$uuid"}[1d]) < 1
$value nodes have inactive cronjob for universe '$universe_name'.
ybp_universe_inactive_cron_nodes{universe_uuid = "$uuid"} > 0
Average memory usage for universe '$universe_name' nodes is above $threshold%. Maximum value is $value.
max by (universe_uuid) ((avg_over_time(node_memory_MemTotal_bytes{universe_uuid="__universeUuid__"}[10m])
- ignoring (saved_name) (avg_over_time(node_memory_Buffers_bytes{universe_uuid="__universeUuid__"}[10m]))
- ignoring (saved_name) (avg_over_time(node_memory_Cached_bytes{universe_uuid="__universeUuid__"}[10m]))
- ignoring (saved_name) (avg_over_time(node_memory_MemFree_bytes{universe_uuid="__universeUuid__"}[10m]))
- ignoring (saved_name) (avg_over_time(node_memory_Slab_bytes{universe_uuid="__universeUuid__"}[10m])))
/ ignoring (saved_name) (avg_over_time(node_memory_MemTotal_bytes{universe_uuid="__universeUuid__"}[10m])))
* 100 {{ query_condition }} {{ query_threshold }}
Failed to collect metrics for universe '$universe_name'. You need to check YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_universe_metric_collection_status{universe_uuid = "__universeUuid__"}[1d]) {{ query_condition }} 1
Average replication lag for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
max by (universe_uuid) (avg_over_time(async_replication_committed_lag_micros{universe_uuid="__universeUuid__"}[10m]) or avg_over_time(async_replication_sent_lag_micros{universe_uuid="__universeUuid__"}[10m])) / 1000 {{ query_condition }} {{ query_threshold }}
More recent OS version is recommended for this universe. Consider running VM image upgrade for the nodes to incorporate security patches and address vulnerabilities.
ybp_universe_os_update_required{universe_uuid="__universeUuid__"} {{ query_condition }} {{ query_threshold }}
Increase in remote bootstraps detected for universe '$universe_name'.
sum by (universe_uuid) (
increase(
rpc_latency_count{export_type="tserver_export",server_type="yb_consensus",service_method="StartRemoteBootstrap",service_type="ConsensusService",universe_uuid="__universeUuid__"}[5m]
)
)
{{ query_condition }} {{ query_threshold }}
Reactor delays for universe '$universe_name' is above $threshold% ms. The current value is $value milliseconds.
max by (universe_uuid) (
avg by (universe_uuid, saved_name) (
label_replace(
rate(rpc_incoming_queue_time_sum{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m]),
"saved_name",
"rpc_incoming_queue_time_count",
"saved_name",
"(.*)"
)
)
/
(
avg by (universe_uuid, saved_name) (
rate(
rpc_incoming_queue_time_count{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m]
)
)
*
1000
)
{{ query_condition }} {{ query_threshold }}
or
(
avg by (universe_uuid, saved_name) (
label_replace(
rate(
handler_latency_outbound_call_queue_time_sum{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m]
),
"saved_name",
"handler_latency_outbound_call_queue_time_count",
"saved_name",
"(.*)"
)
)
)
/
(
avg by (universe_uuid, saved_name) (
rate(
handler_latency_outbound_call_queue_time_count{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m]
)
)
*
1000
)
{{ query_condition }} {{ query_threshold }}
)
RPC queue size is high for universe '$universe_name'.
max by (universe_uuid) (
min_over_time(
{export_type="tserver_export",saved_name=~"rpcs_in_queue_.*",universe_uuid="__universeUuid__"}[5m]
)
{{ query_condition }} {{ query_threshold }}
)
WAL cache size is high for nodes '$node_name' in universe '$universe_name'. The current value is $value MB for one of the nodes.
max by (universe_uuid) (
(
sum by (universe_uuid, node_name) (
log_cache_size{export_type="tserver_export",universe_uuid="__universeUuid__"}
)
)
/
1024
)
{{ query_condition }} {{ query_threshold }}
Client to node certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_c2n_cert_validity_days{universe_uuid="$uuid"} < 30)
Client to node CA certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_c2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
Encryption at rest configuration for universe '$universe_name' expires in $value days.
ybp_universe_encryption_key_expiry_days{universe_uuid="$uuid"} < 3
Node to node certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_n2n_cert_validity_days{universe_uuid="$uuid"} < 30)
Node to node CA certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_n2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
Invalid permissions of private access key file for universe '$universe_name'. You need to check YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_universe_private_access_key_status{universe_uuid = "__universeUuid__"}[1d]) {{ query_condition }} 1
SSH key for universe '$universe_name' will expire in $value days.
ybp_universe_ssh_key_expiry_day{universe_uuid="__universeUuid__"} {{ query_condition }} {{ query_threshold }}
Last SSH key rotation task for universe '$universe_name' failed. To retry, check SSH key rotation task result.
last_over_time(ybp_ssh_key_rotation_status{universe_uuid = "__universeUuid__"}[1d]) {{ query_condition }} 1
YSQLSH connection failure detected for universe '$universe_name' on $value TServer instance(s).
count by (universe_uuid) (yb_node_ysql_connect{universe_uuid="__universeUuid__"} < 1) {{ query_condition }} {{ query_threshold }}
New YSQL tables are added to the source universe '$universe_name' in the database with an existing xCluster configuration, but not added to the xCluster replication.
((count by (namespace_name, universe_uuid)(count by(namespace_name, table_id, universe_uuid)(rocksdb_current_version_sst_files_size{universe_uuid="__universeUuid__",table_type="PGSQL_TABLE_TYPE"}))) - count by(namespace_name, universe_uuid)(count by(namespace_name, universe_uuid, table_id)(async_replication_sent_lag_micros{universe_uuid="__universeUuid__",table_type="PGSQL_TABLE_TYPE"}))) {{ query_condition }} {{ query_threshold }}
Number of YSQL connections for universe '$universe_name' is above $threshold. Current value is $value.
max by (universe_uuid) (max_over_time(yb_node_ysql_connections_count{universe_uuid="__universeUuid__"}[5m])) {{ query_condition }} {{ query_threshold }}
Average YSQL operations latency for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
(sum by (universe_uuid, service_method)(rate(rpc_latency_sum{universe_uuid="__universeUuid__",export_type="ysql_export",server_type="yb_ysqlserver",service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transactions"}[5m])) / sum by (universe_uuid, service_method)(rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="ysql_export",server_type="yb_ysqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transactions"}[5m]))) {{ query_condition }} {{ query_threshold }}
YSQL P99 latency for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
max by (universe_uuid) (rpc_latency{universe_uuid="__universeUuid__",server_type="yb_ysqlserver",service_type="SQLProcessor", service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|OtherStmts|Transactions",quantile="p99"}) {{ query_condition }} {{ query_threshold }}
Maximum throughput for YSQL operations for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
sum by (service_method)(rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="ysql_export",server_type="yb_ysqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transactions"}[5m])) {{ query_condition }} {{ query_threshold }}
CQLSH connection failure has been detected for universe '$universe_name' on $value T-Server instances.
ybp_health_check_cqlsh_connectivity_error{universe_uuid="$uuid"} > 0
Number of YCQL connections for universe '$universe_name' is above $threshold. Current value is $value.
max by (universe_uuid) (max_over_time(rpc_connections_alive{universe_uuid="__universeUuid__",export_type="cql_export"}[5m])) {{ query_condition }} {{ query_threshold }}
Average YSQL operations latency for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
(sum by (service_method)(rate(rpc_latency_sum{universe_uuid="__universeUuid__",export_type="cql_export",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transaction"}[5m])) / sum by (service_method)(rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="cql_export",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transaction"}[5m]))) {{ query_condition }} {{ query_threshold }}
YCQL P99 latency for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
max by (universe_uuid)(rpc_latency{universe_uuid="__universeUuid__",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|OtherStmts|Transaction",quantile="p99"}) {{ query_condition }} {{ query_threshold }}
Maximum throughput for YCQL operations for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
sum by (universe_uuid, service_method) (rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="cql_export",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transaction"}[5m])) {{ query_condition }} {{ query_threshold }}