docs/content/stable/yugabyte-cloud/cloud-monitor/cloud-alerts.md
Use alerts to notify you and your team members when cluster and database resource usage exceeds predefined limits, or of potential billing issues.
You configure alerts and notifications on the Alerts page.
To monitor clusters in real time, use the performance metrics on the cluster Overview and Performance tabs.
Enable alerts using the Configurations tab on the Alerts page.
<!-- To view alert details, select the alert to display the **Alert Policy Settings** sheet.-->To change an alert, click the toggle in the Status column.
To test that you can receive email from the Yugabyte server, click Send Test Email.
Open notifications are listed on the Notifications tab on the Alerts page.
When the condition that caused the alert resolves, the notification dismisses automatically.
YugabyteDB monitors the health of your clusters based on cluster alert conditions and displays the health as either Healthy, Needs Attention, or Unhealthy. The following table provides details on the criteria used to determine the cluster health.
| Status | Alert | Level |
|---|---|---|
| Healthy | No alerts | |
| Disk throughput | ||
| Disk IOPS | ||
| Fewer than 34% of nodes down | ||
| Warning | ||
| Warning | ||
| Info | ||
| Needs Attention | Tablet peers | |
| Node free storage | ||
| More than 34% of nodes down | ||
| Memory Utilization | ||
| YSQL Connections | ||
| CPU Utilization | Warning or Severe | |
| Warning or Severe | ||
| Warning | ||
| Warning or Severe | ||
| Warning or Severe | ||
| Warning or Severe | ||
| Unhealthy | More than 66% of nodes down | |
| CMK unavailable | Severe | |
| Warning |
To see the alert conditions that caused the current health condition, click the cluster health icon.
If multiple alerts are triggered, the most severe alert determines the health that is reported.
Cluster health is updated every three minutes.
Alerts can trigger for issues with a particular cluster, or for billing issues.
Alerts can have two severity levels: Warning or Severe. The nodes down alert has a third level, Info, which is reported in the cluster health, and does not trigger a notification.
When you receive a cluster alert, the first step is to review the chart for the metric over time (if available) to evaluate trends and monitor your progress.
| Alert | Metric |
|---|---|
| Tablet Peers | Tablets |
| Disk Throughput | Disk IOPS |
| Disk IOPS | Disk IOPS |
| Node Free Storage | Disk Usage metric |
| Nodes Down | Go to the cluster Nodes tab to see which nodes are down |
| Memory Use | Memory Usage metric |
| Disaster recovery | Safe time and Replication lag metric |
| Cluster Queues Overflow | RPC Queue Size metric |
| Compaction Overload | Compaction metric |
| YSQL Connections | YSQL Operations/Sec metric |
| CMK Unavailable | N/A |
| CPU Utilization | CPU Usage metric |
You can view metrics on the cluster Performance tab. Refer to Performance metrics.
{{< note title="Note" >}}
If you get frequent cluster alerts on a Sandbox cluster, you may have reached the performance limits for Sandbox clusters. Consider upgrading to a Dedicated cluster.
{{< /note >}}
YugabyteDB Aeon automatically enforces tablet peer limits for clusters.
If a cluster exceeds the limit, YugabyteDB Aeon sends a notification as follows:
If the number of tablet peers in the cluster approaches the limit for the cluster, you should add more RAM. You can do this by scaling the cluster horizontally by adding nodes, or vertically by adding vCPUs.
You can also drop tables. There may be a lag in clearing the associated tablets.
For Sandbox clusters, if you reach the tablet peer limit, you cannot create any more tables, and table splitting is stopped.
For information on scaling clusters, refer to Scale and configure clusters.
For information on tablet limits in YugabyteDB, refer to Memory and tablet limits.
Currently, this alert is only available for Sandbox clusters running v2024.1 or later.
YugabyteDB Aeon sends a notification when the disk throughput on any node in the cluster exceeds the threshold, as follows:
If disk throughput consistently exceeds 80% for any node, consider taking action to remove any bottlenecks on performance. For clusters deployed on AWS, increase the maximum disk IOPS per node. Throughput is automatically scaled together with IOPS. For clusters deployed on GCP or Azure, add more vCPUs or increase the disk size.
For information on scaling clusters, refer to Scale and configure clusters.
YugabyteDB Aeon sends a notification when the disk input output (I/O) operations per second (IOPS) on any node in the cluster (AWS only) exceeds the threshold, as follows:
For large datasets or clusters with high concurrent transactions, higher IOPS is recommended.
If disk IOPS consistently exceeds 80% for any node, consider taking action to remove any bottlenecks on performance. For clusters deployed on AWS, increase the maximum disk IOPS per node. For clusters deployed in GCP and Azure, add more vCPUs and/or increase the disk size.
For information on scaling clusters, refer to Scale and configure clusters.
YugabyteDB Aeon sends a notification when the free storage on any node in the cluster falls below the threshold, as follows:
Once disk capacity falls below 25% for any node, look at your database for unused tables that you can truncate or drop.
Consider increasing the disk space per node. By default, you are entitled to 50GB per vCPU. Adding vCPUs also automatically increases disk space.
For information on scaling clusters, refer to Scale and configure clusters.
YugabyteDB Aeon sends a notification when the number of primary or read replica nodes that are down in a cluster exceeds the threshold, as follows:
Go to the cluster Nodes tab to see which nodes are down.
If fewer than 34% of primary or read replica nodes in a multi-node (that is, highly available) cluster are down, the cluster remains healthy and can continue to serve requests, and the status is reported in the cluster health.
If more than 66% of primary or read replica nodes in a multi-node (that is, highly available) cluster are down, the cluster is considered unhealthy and the downed nodes should be replaced as soon as possible.
When cluster nodes go down, Yugabyte is notified automatically and will restore the nodes as quickly as possible.
YugabyteDB Aeon sends a notification when memory use in the cluster exceeds the threshold, as follows:
If your cluster experiences frequent spikes in memory use, consider optimizing your workload.
Unoptimized queries can lead to memory alerts. Use the Slow Queries and Live Queries views to identify potentially problematic queries, then use the EXPLAIN statement to see the query execution plan and identify optimizations. Consider adding one or more indexes to improve query performance. For more information, refer to Analyzing Queries with EXPLAIN.
If memory use is continuously higher than 80%, your workload may also exceed the capacity of your cluster. If the issue isn't a single query that consumes a lot of memory on a single tablet, consider scaling your cluster vertically by adding vCPUs to increase capacity per node, or horizontally by adding nodes to reduce the load per node. Refer to Scale and configure clusters.
High memory use could also indicate a problem and may require debugging by {{% support-cloud %}}.
If you have set up disaster recovery (DR) for a cluster, YugabyteDB Aeon sends a notification when the safe time or replication lag exceeds the threshold, as follows:
If you receive DR alerts, navigate to the primary cluster Disaster Recovery tab and check the status. See Disaster Recovery alerts for more information.
YugabyteDB Aeon sends the following database overload alert:
This alert triggers if any one of the following conditions occurs:
Cluster queue overflow and compaction overload are indicators of the incoming traffic and system load. If the backends get overloaded, requests pile up in the queues. When the queue is full, the system responds with backpressure errors. This can cause performance degradation.
If your cluster generates this alert, you may need to rate limit your queries. Look at optimizing the application generating the requests in the following ways:
If your cluster generates this alert but isn't under a very large workload, contact {{% support-cloud %}}.
YugabyteDB Aeon clusters support 15 simultaneous connections per vCPU. If built-in Connection Pooling is enabled, then clusters support 10 client connections per server connection.
YugabyteDB Aeon sends a notification when the number of YSQL connections on any node in the cluster exceeds the threshold, as follows:
If Connection Pooling is enabled, the thresholds are as follows:
If your cluster experiences frequent spikes in connections, consider optimizing your application's connection code.
If connections are opened but never closed, your application will eventually exceed the connection limit.
You may need to implement some form of connection pooling, such as enabling built-in Connection Pooling.
If the number of connections is continuously higher than 60%, your workload may also exceed the capacity of your cluster. Be sure to size your cluster with enough spare capacity to remain fault tolerant during maintenance events and outages. For example, during an outage or a rolling restart for maintenance, a 3 node cluster loses a third of its capacity. The remaining nodes need to be able to handle the traffic from the absent node.
If the number of client connections consistently exceeds the limit, you may benefit from an increase to the limit, depending on your application usage pattern; contact {{% support-cloud %}}.
To add connection capacity, scale your cluster by adding vCPUs or nodes. Refer to Scale and configure clusters.
YugabyteDB Aeon sends a notification when the customer managed key (CMK) used to encrypt a cluster is unreachable.
If your CMK is unreachable, check the following:
Refer to Encryption at rest.
YugabyteDB Aeon sends a notification when CPU use on any node in the cluster exceeds the threshold, as follows:
If your cluster experiences frequent spikes in CPU use, consider optimizing your workload.
Unoptimized queries can lead to CPU alerts. Use the Slow Queries and Live Queries views to identify potentially problematic queries, then use the EXPLAIN statement to see the query execution plan and identify optimizations. Consider adding one or more indexes to improve query performance. For more information, refer to Analyzing Queries with EXPLAIN.
High CPU use could also indicate a problem and may require debugging by {{% support-cloud %}}.
If CPU use is continuously higher than 80%, your workload may also exceed the capacity of your cluster. Consider scaling your cluster vertically by adding vCPUs to increase capacity per node, or horizontally by adding nodes to reduce the load per node. Refer to Scale and configure clusters.
If disaster recovery is configured, YugabyteDB Aeon can alert you when the safe time or replication lag to the replica cluster exceeds a threshold.
| Alert | Metric |
|---|---|
| Safe time lag | Time |
| Replication lag | Time |
YugabyteDB Aeon sends a notification when lag exceeds the threshold, as follows:
For more information, refer to Monitor replication.
Billing alerts trigger for the following events:
| Alert | Response |
|---|---|
| Invoice payment failure | Check with your card issuer to ensure that your card is still valid and has not exceeded its limit. Alternatively, you can add another credit card to your billing profile. |
| Credit usage exceeds the 50%, 75%, and 90% limit | Add a credit card to your billing profile, or contact {{% support-cloud %}}. |
| Credit card expiring within 3 months | Add a new credit card. |
For information on adding credit cards and managing your billing profile, refer to Manage your billing profile and payment method.