docs/content/guides/operator/alerts.mdx
When running a Sui Validator node or Full node, you may want to configure alerting based off some or all of the following metrics.
The following sections cover the alert settings, but their details are meant to be customized in the following ways:
$network with your actual network label (for example, mainnet, testnet, and so on).host and container are stripped to be agnostic on infrastructure.These alerts should receive the most immediate attention from you or your team.
| Key | Value |
|---|---|
| Name | Safe Mode during Reconfiguration |
| Summary | Epoch failed to advance; chain entered safe mode |
| Duration | 5m |
is_safe_mode{network="$network"} > 0.5 or absent(is_safe_mode{network="$network"})
| Key | Value |
|---|---|
| Name | Consensus Proposals Failure |
| Summary | Less than 80% of stake is proposing consensus blocks |
| Duration | 5m |
sum(
sum by (host) (current_voting_right{network="$network"})
and
sum by (host) (rate(consensus_proposed_blocks{network="$network"}[5m])) > 0
) < 8000
| Key | Value |
|---|---|
| Name | Checkpoint Execution Rate Is Low |
| Summary | Less than 80% of stake is executing checkpoints quickly enough |
| Duration | 5m |
sum(
sum by (host) (current_voting_right{network="$network"})
and
sum by (host) (rate(last_executed_checkpoint{network="$network"}[5m])) > 2
) < 8000
| Key | Value |
|---|---|
| Name | Certificate execution latencies are high |
| Summary | Less than 80% of stake is handling shared-object tx certs with low enough latency |
| Duration | 5m |
sum(
sum by (host) (current_voting_right{network="$network"})
and
histogram_quantile(0.95, sum by (le, host) (
rate(validator_service_handle_certificate_consensus_latency_bucket{network="$network"}[5m])
)) < 3
) < 8000
| Key | Value |
|---|---|
| Name | RandomnessDkgFailure |
| Summary | Random beacon DKG has failed on one or more hosts |
| Duration | 5m |
epoch_random_beacon_dkg_failed{network="$network"} > 0 or absent(is_safe_mode{network="$network"})
| Key | Value |
|---|---|
| Name | Mysten validators are not upgraded |
| Summary | Validators are behind on protocol version |
| Duration | 1h |
min(sui_configured_max_protocol_version{network="$network", host=~"Mysten-.*"})
< quantile(0.34, sui_configured_max_protocol_version{network="$network"})
All alerts are important, but the following alerts and warnings can be addressed within a normal node maintenance workflow.
| Key | Value |
|---|---|
| Name | Consensus sequencing p99 latencies are high |
| Summary | Less than 80% of stake is sequencing tx certs with acceptable latency |
| Duration | 1m |
sum(
sum by (host) (current_voting_right{network="$network"})
and
histogram_quantile(0.95, sum by (le, host) (
rate(sequencing_certificate_latency_bucket{network="$network", position="0", tx_type=~"shared_certificate|owned_certificate|soft_bundle"}[2m])
)) < 2
) < 5000
| Key | Value |
|---|---|
| Name | System Invariant Violations |
| Summary | A system invariant violation was reported |
| Duration | 1m |
max(system_invariant_violations{network="$network"}) > 0