docs/en/administration/management/monitoring/metrics.md
This topic introduces some important general metrics of StarRocks.
For dedicated metrics for materialized views and shared-data clusters, please refer to the corresponding sections:
For more information on how to build a monitoring service for your StarRocks cluster, see Monitor and Alert.
100000 is considered reasonable..0 (disabled) and 1 (enabled). When Safe Mode is enabled, the cluster no longer accepts any loading requests.0.0.slow_lock_threshold_ms configuration parameter. It tracks the maximum lock held time among all lock owners when a slow lock event is detected. Each metric includes quantile values (0.75, 0.95, 0.98, 0.99, 0.999), _sum, and _count outputs. Note: This metric may not accurately reflect the exact lock held time under high contention, because the metric is updated once the wait time exceeds the threshold, but the held time may continue to increase until the owner completes its operation and releases the lock. However, this metric can still be updated even when deadlock occurs.slow_lock_threshold_ms configuration parameter. It accurately tracks how long threads wait to acquire locks during lock contention scenarios. Each metric includes quantile values (0.75, 0.95, 0.98, 0.99, 0.999), _sum, and _count outputs. This metric provides precise wait time measurements. Note: This metric cannot be updated when deadlock occurs, hence it cannot be used to detect deadlock situations.file_format, scan_type.file_format, scan_type.file_format, scan_type.file_format, scan_type.file_format, scan_type./proc/stat.ulimit command.1 indicates that the disk is in use, and 0 indicates that it is not in use./proc/net/snmp.Unit: Count
Description: The total number of Routine Load jobs in different states. For example:
starrocks_fe_routine_load_jobs{state="NEED_SCHEDULE"} 0
starrocks_fe_routine_load_jobs{state="RUNNING"} 1
starrocks_fe_routine_load_jobs{state="PAUSED"} 0
starrocks_fe_routine_load_jobs{state="STOPPED"} 0
starrocks_fe_routine_load_jobs{state="CANCELLED"} 1
starrocks_fe_routine_load_jobs{state="UNSTABLE"} 0
enable_routine_load_lag_metrics is set to true and the offset lag is greater than or equal to the FE configuration min_routine_load_lag_for_metrics. By default, enable_routine_load_lag_metrics is false, and min_routine_load_lag_for_metrics is 10000.enable_routine_load_lag_time_metrics is set to true. By default, enable_routine_load_lag_time_metrics is false.publish-version-daemon loop runs on this FE node.The following metrics are summary-type metrics that provide latency distributions for different phases of a transaction. These metrics are reported exclusively by the Leader FE node.
Each metric includes the following outputs:
quantile label, which can have values of 0.75, 0.95, 0.98, 0.99, and 0.999.<metric_name>_sum: The total cumulative time spent in this phase, for example, starrocks_fe_txn_total_latency_ms_sum.<metric_name>_count: The total number of transactions recorded for this phase, for example, starrocks_fe_txn_total_latency_ms_count.All transaction metrics share the following labels:
type: Categorizes transactions by their load job source type (for example, all, stream_load, routine_load). This allows for monitoring both overall transaction performance and the performance of specific load types. The reported groups can be configured via the FE parameter txn_latency_metric_report_groups.is_leader: Indicates whether the reporting FE node is the Leader. Only the Leader FE (is_leader="true") reports actual metric values. Followers will have is_leader="false" and report no data.prepare time to the finish time. This metric represents the full end-to-end duration of a transaction.write phase of a transaction, from prepare time to commit time. This metric isolates the performance of the data writing and preparation stage before the transaction is ready to be published.publish phase, from commit time to finish time. This is the duration it takes for a committed transaction to become visible to queries. It is the sum of the schedule, execute, can_finish, and ack sub-phases.commit time to when the publish task is picked up. This metric reflects scheduling delays or queueing time in the publish pipeline.publish task, from when the task is picked up to when it finishes. This metric represents the actual time being spent to make the transaction's changes visible.publish task completion to the moment canTxnFinish() first returns true, measured from publish version finish time to ready-to-finish time.ready-to-finish time to the final finish time when the transaction is marked as VISIBLE. This metric includes final acknowledgment steps after the transaction is ready to finish.Latency metrics expose percentile series such as merge_commit_request_latency_99 and merge_commit_request_latency_90, reported in microseconds. The end-to-end latency obeys:
merge_commit_request = merge_commit_pending + merge_commit_wait_plan + merge_commit_append_pipe + merge_commit_wait_finish
Note: Before v3.4.11, v3.5.12, and v4.0.4, these latency metrics were reported in nanoseconds.
time_travel_type (branch, tag, snapshot, or timestamp) for the categorized series.snapshot means FOR VERSION AS OF <snapshot_id>, branch and tag mean FOR VERSION AS OF <reference_name>, and timestamp means FOR TIMESTAMP AS OF ....status (success or failed)reason (none, timeout, oom, access_denied, unknown)delete_type (position or metadata)DELETE tasks that target Iceberg tables. The metric is incremented by 1 after each task ends, regardless of success or failure. delete_type distinguishes between two delete methods: position (generates position delete files) and metadata (metadata-level delete).delete_type (position or metadata)DELETE tasks in milliseconds. The duration of each task is added after it ends. delete_type distinguishes between two delete methods.delete_type (position or metadata)DELETE tasks. For metadata delete, this represents the size of deleted data files. For position delete, this represents the size of position delete files created.delete_type (position or metadata)DELETE tasks. For metadata delete, this represents the number of rows in deleted data files. For position delete, this represents the number of position deletes created.compaction_type (manual or auto)rewrite_data_files) tasks.compaction_type (manual or auto)compaction_type (manual or auto)compaction_type (manual or auto)compaction_type (manual or auto)status (success or failed)reason (none, timeout, oom, access_denied, unknown)write_type (insert, overwrite, or ctas)INSERT, INSERT OVERWRITE, or CTAS tasks that target Iceberg tables. The metric is incremented by 1 after each task ends, regardless of success or failure. write_type distinguishes between the operation types.write_type (insert, overwrite, or ctas)INSERT, INSERT OVERWRITE, CTAS) in milliseconds. The duration of each task is added after it ends. write_type distinguishes between the operation types.write_type (insert, overwrite, or ctas)INSERT, INSERT OVERWRITE, CTAS). This represents the total size of data files written to the Iceberg table. write_type distinguishes between the operation types.write_type (insert, overwrite, or ctas)INSERT, INSERT OVERWRITE, CTAS). This represents the number of rows written to the Iceberg table. write_type distinguishes between the operation types.write_type (insert, overwrite, or ctas)INSERT, INSERT OVERWRITE, CTAS). This represents the count of data files written to the Iceberg table. write_type distinguishes between the operation types.status (success or failed)reason (none, timeout, oom, access_denied, unknown)write_type (insert or overwrite)INSERT or INSERT OVERWRITE tasks that target Hive tables. The metric is incremented by 1 after each task ends, regardless of success or failure. write_type distinguishes between the operation types.write_type (insert or overwrite)INSERT, INSERT OVERWRITE) in milliseconds. The duration of each task is added after it ends. write_type distinguishes between the operation types.write_type (insert or overwrite)INSERT, INSERT OVERWRITE). This represents the total size of data files written to the Hive table. write_type distinguishes between the operation types.write_type (insert or overwrite)INSERT, INSERT OVERWRITE). This represents the number of rows written to the Hive table. write_type distinguishes between the operation types.write_type (insert or overwrite)INSERT, INSERT OVERWRITE). This represents the count of data files written to the Hive table. write_type distinguishes between the operation types.DataCache metrics provide visibility into cache capacity, usage, and hit rate for data caching.
The following metrics are exposed on the BE Prometheus endpoint (/metrics):
Tablet reshard metrics provide visibility into tablet split and merge operations in shared-data mode with range distribution tables.
job=tablet_reshard, type=SPLIT_TABLET|MERGE_TABLET, state=PENDING|PREPARING|RUNNING|CLEANING|FINISHED|ABORTING|ABORTEDtype=split|mergetype=split|mergetype=split|mergeThe following metrics are exposed on the BE /vars HTTP endpoint with the tablet_reshard_ prefix.
publish_resharding_tablet call).split_tablet computation in microseconds.merge_tablet computation in microseconds.