docs/content/v2.20/yugabyte-platform/alerts-monitoring/alert-policy-templates.md
Alert policies use the following templates to define how the alert is triggered. The alert templates have been created using Prometheus expressions.
Last attempt to send alert notifications to channel '{{ $labels.source_name }}' has failed. You need to try sending a test alert to obtain details.
last_over_time(ybp_alert_manager_channel_status{customer_uuid = "$uuid"}[1d]) < 1
Last attempt to send alert notifications for customer 'customer name' failed. You need to check YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_alert_manager_status{customer_uuid = "$uuid"}[1d]) < 1
Last alert query for customer 'customer name' failed. YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_alert_query_status[1d]) < 1
Last alert rules synchronization for customer 'customer name' has failed. YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_alert_config_writer_status[1d]) < 1
Failed to delete $value backups for customer 'customer name' in last GC run. Check logs for more details.
last_over_time(ybp_delete_backup_failure{customer_uuid = "__customerUuid__"}[1d]) {{ query_condition }} 0
Last backup task for universe '$universe_name' failed. You need to check the backup task result for details.
last_over_time(ybp_create_backup_status{universe_uuid = "$uuid"}[1d]) < 1
Last attempt to run a scheduled backup for universe '$universe_name' failed due to other backup or universe operation in progress.
last_over_time(ybp_schedule_backup_status{universe_uuid = "$uuid"}[1d]) < 1
Last snapshot task for universe '$universe_name' failed. To retry, check PITR configuration task result for more details.
min(ybp_pitr_config_status{universe_uuid = "__universeUuid__"}) {{ query_condition }} 1
Database compaction rejections detected for universe '$universe_name'.
sum by (node_prefix) (increase(majority_sst_files_rejections{node_prefix="$node_prefix"}[10m])) > 0
Core files detected for universe '$universe_name' on $value T-Server instances.
ybp_health_check_tserver_core_files{universe_uuid="$uuid"} > 0
TServer detected $value drive failure for universe '$universe_name'.
count by (universe_uuid) (drive_fault{universe_uuid="__universeUuid__",
export_type="tserver_export"}) {{ query_condition }} {{ query_threshold }}
Error logs detected for universe '$universe_name' on $value Master/TServer instance(s).
sum by (universe_uuid) ((ybp_health_check_node_master_error_logs{universe_uuid="__universeUuid__"} < bool 1) * ignoring (saved_name) (ybp_health_check_node_master_fatal_logs{universe_uuid="__universeUuid__"} == bool 1)) + sum by (universe_uuid) ((ybp_health_check_node_tserver_error_logs{universe_uuid="__universeUuid__"} < bool 1) * ignoring (saved_name) (ybp_health_check_node_tserver_fatal_logs{universe_uuid="__universeUuid__"} == bool 1)) {{ query_condition }} {{ query_threshold }}
Fatal logs have been detected for universe '$universe_name' on $value Master or T-Server instances.
sum by (universe_uuid) (ybp_health_check_node_master_fatal_logs{universe_uuid="$uuid"} < bool 1) + sum by (universe_uuid) (ybp_health_check_node_tserver_fatal_logs{universe_uuid="$uuid"} < bool 1) > 0
$value database Master or T-Server instances are down for more than 15 minutes for universe '$universe_name'.
count by (node_prefix) (label_replace(max_over_time(up{export_type=~"master_export|tserver_export",node_prefix="$node_prefix"}[15m]), "exported_instance", "$1", "instance", "(.*)") < 1 and on (node_prefix, export_type, exported_instance) (min_over_time(ybp_universe_node_function{node_prefix="$node_prefix"}[15m]) == 1)) > 0
Universe '$universe_name' Master or T-Server has restarted $value times during last 30 minutes.
max by (node_prefix) (changes(yb_node_boot_time{node_prefix="$node_prefix"}[30m]) and on (node_prefix) (max_over_time(ybp_universe_update_in_progress{node_prefix="$node_prefix"}[31m]) == 0)) > 0
Database queues overflow has been detected for universe '$universe_name'.
sum by (node_prefix) (increase(rpcs_queue_overflow{node_prefix="$node_prefix"}[10m])) + sum by (node_prefix) (increase(rpcs_timed_out_in_queue{node_prefix="$node_prefix"}[10m])) > 1
Database memory rejections have been detected for universe '$universe_name'.
sum by (node_prefix) (increase(leader_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])) + sum by (node_prefix) (increase(follower_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])) + sum by (node_prefix) (increase(operation_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])) > 0
Redis connection failure has been detected for universe '$universe_name' on $value T-Server instances.
ybp_health_check_redis_connectivity_error{universe_uuid="$uuid"} > 0
Version mismatch has been detected for universe '$universe_name' for $value Master or T-Server instances.
ybp_health_check_tserver_version_mismatch{universe_uuid="$uuid"} + ybp_health_check_master_version_mismatch{universe_uuid="$uuid"} > 0
Test YSQL write/read operation failed on $value nodes for universe '$universe_name'.
count by (node_prefix) (yb_node_ysql_write_read{node_prefix="$node_prefix"} < 1)
Average node CPU usage for universe '$universe_name' is more than 90% on $value nodes.
count by(node_prefix) ((100 - (avg by (node_prefix, instance) (avg_over_time(irate(node_cpu_seconds_total{job="node",mode="idle", node_prefix="$node_prefix"}[1m])[30m:])) * 100)) > 90)
$value database nodes are down for more than 15 minutes for universe '$universe_name'.
count by (node_prefix) (max_over_time(up{export_type="node_export",node_prefix="$node_prefix"}[15m]) < 1) > 0
Node data disk usage for universe '$universe_name' is above $threshold% on $value node(s).
count by (universe_uuid) (count by (universe_uuid, node_name) (100 - (sum without (saved_name) (node_filesystem_free_bytes{mountpoint=~"__mountPoints__",universe_uuid="__universeUuid__", fstype!="rootfs"}) / sum without (saved_name) (node_filesystem_size_bytes{mountpoint=~"__mountPoints__",
universe_uuid="__universeUuid__", fstype!="rootfs"}) * 100) {{ query_condition }} {{ query_threshold }}))
Node file descriptors usage for universe '$universe_name' is above 70% on $value nodes.
count by (universe_uuid) (ybp_health_check_used_fd_pct{universe_uuid="$uuid"} > 70)
More than one out of memory (OOM) kills have been detected for universe '$universe_name' on $value nodes.
count by (node_prefix) (yb_node_oom_kills_10min{node_prefix="$node_prefix"} > 1) > 0
Universe '$universe_name' database node has restarted $value times during last 30 minutes.
max by (node_prefix) (changes(node_boot_time{node_prefix="$node_prefix"}[30m])) > 0
Node system disk usage for universe '$universe_name' is above $threshold% on $value node(s).
count by (universe_uuid) (count by (universe_uuid, node_name) (100 - (sum without (saved_name) (node_filesystem_free_bytes{mountpoint=~"__systemMountPoints__",universe_uuid="__universeUuid__", fstype!="rootfs"}) / sum without (saved_name) (node_filesystem_size_bytes{mountpoint=~"__systemMountPoints__",universe_uuid="__universeUuid__", fstype!="rootfs"}) * 100) {{ query_condition }} {{ query_threshold }}))
The tablet leader is missing for more than 5 minutes for $value tablets in universe '$universe_name'.
max by (node_prefix) (count by (node_prefix, exported_instance) (max_over_time(yb_node_leaderless_tablet{node_prefix="$node_prefix"}[5m])) > 0)
Master leader is missing for universe '$universe_name'.
max by (node_prefix) (yb_node_is_master_leader{node_prefix="$node_prefix"}) < 1
Master is missing from RAFT group or has follower lag higher than $threshold seconds for universe '$universe_name'.
(min_over_time((ybp_universe_replication_factor{universe_uuid='{{ $labels.universe_uuid }}'} - on(universe_uuid) count by(universe_uuid) (count by (universe_uuid, exported_instance) (follower_lag_ms{export_type="master_export", universe_uuid='{{ $labels.universe_uuid }}'})))[{{query_threshold }}s:]) > 0 or (max by(universe_uuid) (follower_lag_ms{export_type="master_export", universe_uuid='{{ $labels.universe_uuid }}'}) {{ query_condition }} ({{ query_threshold }} * 1000)))
$value tablets remain under-replicated for more than 5 minutes in universe '$universe_name'.
max by (node_prefix) (count by (node_prefix, exported_instance) (max_over_time(yb_node_underreplicated_tablet{node_prefix="$node_prefix"}[5m])) > 0)
Maximum clock skew for universe '$universe_name' is more than 500 milliseconds. The current value is $value milliseconds.
max by (node_prefix) (max_over_time(hybrid_clock_skew{node_prefix="$node_prefix"}[10m])) / 1000 > 500
Failed to perform health check for universe '$universe_name'. You need to check YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_health_check_status{universe_uuid = "$uuid"}[1d]) < 1
Failed to issue health check notification for universe '$universe_name'. You need to check Health notification settings and YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_health_check_notification_status{universe_uuid = "$uuid"}[1d]) < 1
$value nodes have inactive cronjob for universe '$universe_name'.
ybp_universe_inactive_cron_nodes{universe_uuid = "$uuid"} > 0
Average memory usage for universe '$universe_name' nodes is above $threshold%. Maximum value is $value.
max by (universe_uuid) ((avg_over_time(node_memory_MemTotal_bytes{universe_uuid="__universeUuid__"}[10m])
- ignoring (saved_name) (avg_over_time(node_memory_Buffers_bytes{universe_uuid="__universeUuid__"}[10m]))
- ignoring (saved_name) (avg_over_time(node_memory_Cached_bytes{universe_uuid="__universeUuid__"}[10m]))
- ignoring (saved_name) (avg_over_time(node_memory_MemFree_bytes{universe_uuid="__universeUuid__"}[10m]))
- ignoring (saved_name) (avg_over_time(node_memory_Slab_bytes{universe_uuid="__universeUuid__"}[10m])))
/ ignoring (saved_name) (avg_over_time(node_memory_MemTotal_bytes{universe_uuid="__universeUuid__"}[10m])))
* 100 {{ query_condition }} {{ query_threshold }}
Failed to collect metrics for universe '$universe_name'. You need to check YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_universe_metric_collection_status{universe_uuid = "__universeUuid__"}[1d]) {{ query_condition }} 1
Number of YEDIS connections for universe '$universe_name' is above $threshold. Current value is $value.
max by (universe_uuid) (max_over_time(rpc_connections_alive{universe_uuid="__universeUuid__",export_type="cql_export"}[5m])) {{ query_condition }} {{ query_threshold }}
Average replication lag for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
max by (universe_uuid) (avg_over_time(async_replication_committed_lag_micros{universe_uuid="__universeUuid__"}[10m]) or avg_over_time(async_replication_sent_lag_micros{universe_uuid="__universeUuid__"}[10m])) / 1000 {{ query_condition }} {{ query_threshold }}
More recent OS version is recommended for this universe. Consider running VM image upgrade for the nodes to incorporate security patches and address vulnerabilities.
ybp_universe_os_update_required{universe_uuid="__universeUuid__"} {{ query_condition }} {{ query_threshold }}
Client to node certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_c2n_cert_validity_days{universe_uuid="$uuid"} < 30)
Client to node CA certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_c2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
Encryption at rest configuration for universe '$universe_name' expires in $value days.
ybp_universe_encryption_key_expiry_days{universe_uuid="$uuid"} < 3
Node to node certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_n2n_cert_validity_days{universe_uuid="$uuid"} < 30)
Node to node CA certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_n2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
Invalid permissions of private access key file for universe '$universe_name'. You need to check YugabyteDB Anywhere logs for details or contact {{% support-platform %}}.
last_over_time(ybp_universe_private_access_key_status{universe_uuid = "__universeUuid__"}[1d]) {{ query_condition }} 1
SSH key for universe '$universe_name' will expire in $value days.
ybp_universe_ssh_key_expiry_day{universe_uuid="__universeUuid__"} {{ query_condition }} {{ query_threshold }}
Last SSH key rotation task for universe '$universe_name' failed. To retry, check SSH key rotation task result.
last_over_time(ybp_ssh_key_rotation_status{universe_uuid = "__universeUuid__"}[1d]) {{ query_condition }} 1
YSQLSH connection failure detected for universe '$universe_name' on $value TServer instance(s).
count by (universe_uuid) (yb_node_ysql_connect{universe_uuid="__universeUuid__"} < 1) {{ query_condition }} {{ query_threshold }}
New YSQL tables are added to the source universe '$universe_name' in the database with an existing xCluster configuration, but not added to the xCluster replication.
((count by (namespace_name, universe_uuid)(count by(namespace_name, table_id, universe_uuid)(rocksdb_current_version_sst_files_size{universe_uuid="__universeUuid__",table_type="PGSQL_TABLE_TYPE"}))) - count by(namespace_name, universe_uuid)(count by(namespace_name, universe_uuid, table_id)(async_replication_sent_lag_micros{universe_uuid="__universeUuid__",table_type="PGSQL_TABLE_TYPE"}))) {{ query_condition }} {{ query_threshold }}
Number of YSQL connections for universe '$universe_name' is above $threshold. Current value is $value.
max by (universe_uuid) (max_over_time(yb_node_ysql_connections_count{universe_uuid="__universeUuid__"}[5m])) {{ query_condition }} {{ query_threshold }}
Average YSQL operations latency for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
(sum by (universe_uuid, service_method)(rate(rpc_latency_sum{universe_uuid="__universeUuid__",export_type="ysql_export",server_type="yb_ysqlserver",service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transactions"}[5m])) / sum by (universe_uuid, service_method)(rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="ysql_export",server_type="yb_ysqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transactions"}[5m]))) {{ query_condition }} {{ query_threshold }}
YSQL P99 latency for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
max by (universe_uuid) (rpc_latency{universe_uuid="__universeUuid__",server_type="yb_ysqlserver",service_type="SQLProcessor", service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|OtherStmts|Transactions",quantile="p99"}) {{ query_condition }} {{ query_threshold }}
Maximum throughput for YSQL operations for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
sum by (service_method)(rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="ysql_export",server_type="yb_ysqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transactions"}[5m])) {{ query_condition }} {{ query_threshold }}
CQLSH connection failure has been detected for universe '$universe_name' on $value T-Server instances.
ybp_health_check_cqlsh_connectivity_error{universe_uuid="$uuid"} > 0
Number of YCQL connections for universe '$universe_name' is above $threshold. Current value is $value.
max by (universe_uuid) (max_over_time(rpc_connections_alive{universe_uuid="__universeUuid__",export_type="cql_export"}[5m])) {{ query_condition }} {{ query_threshold }}
Average YSQL operations latency for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
(sum by (service_method)(rate(rpc_latency_sum{universe_uuid="__universeUuid__",export_type="cql_export",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transaction"}[5m])) / sum by (service_method)(rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="cql_export",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transaction"}[5m]))) {{ query_condition }} {{ query_threshold }}
YCQL P99 latency for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
max by (universe_uuid)(rpc_latency{universe_uuid="__universeUuid__",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|OtherStmts|Transaction",quantile="p99"}) {{ query_condition }} {{ query_threshold }}
Maximum throughput for YCQL operations for universe '$universe_name' is above $threshold milliseconds. Current value is $value milliseconds.
sum by (universe_uuid, service_method) (rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="cql_export",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transaction"}[5m])) {{ query_condition }} {{ query_threshold }}