grafana/dashboards/metrics/standalone/dashboard.md
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Uptime | time() - process_start_time_seconds | stat | The start time of GreptimeDB. | prometheus | s | __auto |
| Version | SELECT pkg_version FROM information_schema.build_info | stat | GreptimeDB version. | mysql | -- | -- |
| Total Ingestion Rate | sum(rate(greptime_table_operator_ingest_rows[$__rate_interval])) | stat | Total ingestion rate. | prometheus | rowsps | __auto |
| Total Query Rate | sum(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_http_promql_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_http_sql_elapsed_count[$__rate_interval])) + sum(rate(greptime_frontend_grpc_handle_query_elapsed_count[$__rate_interval])) | stat | Total query API call rate across MySQL, PostgreSQL, and PromQL frontends. | prometheus | reqps | queries |
| User-facing Error Rate | sum(rate(greptime_servers_error[$__rate_interval])) | stat | Server protocol errors returned by frontends. Sustained non-zero values indicate user-visible failures. | prometheus | eps | errors |
| Recent Restarts | sum(changes(process_start_time_seconds[$__range])) | stat | Process restarts over the selected time range across GreptimeDB roles. | prometheus | short | restarts |
| Deployment | SELECT count(*) as datanode FROM information_schema.cluster_info WHERE peer_type = 'DATANODE'; | |||||
SELECT count(*) as frontend FROM information_schema.cluster_info WHERE peer_type = 'FRONTEND'; | ||||||
SELECT count(*) as metasrv FROM information_schema.cluster_info WHERE peer_type = 'METASRV'; | ||||||
SELECT count(*) as flownode FROM information_schema.cluster_info WHERE peer_type = 'FLOWNODE'; | stat | The deployment topology of GreptimeDB. | mysql | -- | -- | |
| Database Resources | SELECT COUNT(*) as databases FROM information_schema.schemata WHERE schema_name NOT IN ('greptime_private', 'information_schema') | |||||
SELECT COUNT(*) as tables FROM information_schema.tables WHERE table_schema != 'information_schema' | ||||||
SELECT COUNT(region_id) as regions FROM information_schema.region_peers | ||||||
SELECT COUNT(*) as flows FROM information_schema.flows | stat | The number of the key resources in GreptimeDB. | mysql | -- | -- | |
| Total Storage Size | select SUM(disk_size) from information_schema.region_statistics; | stat | Total number of data file size. | mysql | decbytes | -- |
| Total Rows | select SUM(region_rows) from information_schema.region_statistics; | stat | Total number of data rows in the cluster. Calculated by sum of rows from each region. | mysql | sishort | -- |
| Data Size | SELECT SUM(memtable_size) * 0.42825 as WAL FROM information_schema.region_statistics; | |||||
SELECT SUM(index_size) as index FROM information_schema.region_statistics; | ||||||
SELECT SUM(manifest_size) as manifest FROM information_schema.region_statistics; | stat | The data size of wal/index/manifest in the GreptimeDB. | mysql | decbytes | -- | |
| Total Ingestion Rate Trend | sum(rate(greptime_table_operator_ingest_rows[$__rate_interval])) | timeseries | Total ingestion throughput trend across frontends. Protocol breakdown is in the Ingestion row. | prometheus | rowsps | ingestion |
| Total Query Rate Trend | sum(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_http_promql_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_http_sql_elapsed_count[$__rate_interval])) + sum(rate(greptime_frontend_grpc_handle_query_elapsed_count[$__rate_interval])) | timeseries | Total query API call rate trend across frontend protocols. Protocol breakdown is in the Queries row. | prometheus | reqps | queries |
| HTTP Request P99 and Avg | histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_requests_elapsed_bucket{path!~"/health|/metrics"}[$__rate_interval]))) | |||||
sum(rate(greptime_servers_http_requests_elapsed_sum{path!~"/health|/metrics"}[$__rate_interval])) / sum(rate(greptime_servers_http_requests_elapsed_count{path!~"/health|/metrics"}[$__rate_interval])) | timeseries | Tail and average latency for HTTP requests served by frontends. Excludes health and metrics endpoints. | prometheus | s | http-p99 | |
| gRPC Request P99 and Avg | histogram_quantile(0.99, sum by (le) (rate(greptime_servers_grpc_requests_elapsed_bucket[$__rate_interval]))) | |||||
sum(rate(greptime_servers_grpc_requests_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_grpc_requests_elapsed_count[$__rate_interval])) | timeseries | Tail and average latency for gRPC requests served by frontends. | prometheus | s | grpc-p99 |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Total Ingestion Rate | sum(rate(greptime_table_operator_ingest_rows[$__rate_interval])) | timeseries | Total ingestion rate. |
Here we listed 3 primary protocols:
prometheus | rowsps | ingestion |
| Ingestion Rate by Protocol | sum(rate(greptime_table_operator_ingest_rows[$__rate_interval]))
sum(rate(greptime_servers_prometheus_remote_write_samples[$__rate_interval]))
sum(rate(greptime_servers_http_logs_ingestion_counter[$__rate_interval]))
sum(rate(greptime_servers_loki_logs_ingestion_counter[$__rate_interval]))
sum(rate(greptime_servers_elasticsearch_logs_docs_count[$__rate_interval]))
sum(rate(greptime_frontend_otlp_metrics_rows[$__rate_interval]))
sum(rate(greptime_frontend_otlp_logs_rows[$__rate_interval]))
sum(rate(greptime_frontend_otlp_traces_rows[$__rate_interval])) | timeseries | Rows, samples, or documents ingested by primary observability and table-ingestion protocols. | prometheus | rowsps | table-operator |
| Ingestion Latency by Protocol | histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_prometheus_write_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_logs_ingestion_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_loki_logs_ingestion_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_elasticsearch_logs_ingestion_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_otlp_metrics_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_otlp_logs_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_otlp_traces_elapsed_bucket[$__rate_interval])))
sum(rate(greptime_servers_http_prometheus_write_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_prometheus_write_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_http_logs_ingestion_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_logs_ingestion_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_loki_logs_ingestion_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_loki_logs_ingestion_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_elasticsearch_logs_ingestion_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_elasticsearch_logs_ingestion_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_http_otlp_metrics_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_otlp_metrics_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_http_otlp_logs_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_otlp_logs_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_http_otlp_traces_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_otlp_traces_elapsed_count[$__rate_interval])) | timeseries | p99 and average HTTP ingestion latency for Prometheus remote write, logs, Loki, Elasticsearch, and OTLP endpoints. | prometheus | s | prometheus-write |
| Bulk Insert Message Rows and Size | sum(rate(greptime_table_operator_bulk_insert_message_rows_sum[$__rate_interval]))
sum(rate(greptime_table_operator_bulk_insert_message_size_sum[$__rate_interval])) | timeseries | Bulk-insert message row and byte rates. Spikes here can explain frontend bulk-insert latency. | prometheus | rowsps | rows |
| Prom Store Flush Pipeline | sum(rate(greptime_prom_store_flush_total[$__rate_interval]))
sum(rate(greptime_prom_store_flush_rows_sum[$__rate_interval]))
histogram_quantile(0.99, sum by (le) (rate(greptime_prom_store_flush_elapsed_bucket[$__rate_interval]))) | timeseries | Remote-write pending-row flush operations, flushed rows, and p99 flush latency. | prometheus | short | flush-ops |
| OTLP Trace Failures | sum(rate(greptime_frontend_otlp_traces_failure_count[$__rate_interval])) | timeseries | OTLP trace ingestion failures reported by frontends. | prometheus | eps | trace-failures || Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Protocol Error Rates | sum by (protocol) (rate(greptime_servers_error[$__rate_interval])) | |||||
sum by (code) (rate(greptime_servers_auth_failure_count[$__rate_interval])) | ||||||
sum by (path, method, code) (rate(greptime_servers_http_requests_elapsed_count{path!~"/health|/metrics",code!~"2.."}[$__rate_interval])) | ||||||
sum by (path, code) (rate(greptime_servers_grpc_requests_elapsed_count{code!~"0|OK"}[$__rate_interval])) | timeseries | User-facing and protocol-level error rates. Use labels to identify whether failures are server, auth, HTTP, or gRPC related. | prometheus | eps | server-{{protocol}} | |
| Frontend and Query Rejections | sum(rate(greptime_servers_request_memory_rejected_total[$__rate_interval])) | |||||
sum(rate(greptime_query_memory_pool_rejected_total[$__rate_interval])) | timeseries | Request and query memory rejections. Non-zero values indicate requests are being rejected before or during execution. | prometheus | rps | request-memory | |
| Datanode Write Failures | sum by (instance, pod) (rate(greptime_datanode_region_request_fail_count[$__rate_interval])) | |||||
sum by (instance, pod) (rate(greptime_datanode_region_failed_insert_count[$__rate_interval])) | timeseries | Region request failures and failed inserts on datanodes. These indicate backend write-path errors after routing. | prometheus | eps | region-request-[{{instance}}]-[{{pod}}] | |
| Buffered Ingestion Loss | sum(rate(greptime_pending_rows_flush_failures[$__rate_interval])) | |||||
sum(rate(greptime_pending_rows_flush_dropped_rows[$__rate_interval])) | timeseries | Pending-row flush failures and dropped rows. Sustained non-zero dropped rows are a data-loss signal. | prometheus | eps | flush-failures | |
| Mito Backpressure and Failures | sum(rate(greptime_mito_write_reject_total[$__rate_interval])) | |||||
sum(rate(greptime_mito_write_stall_total[$__rate_interval])) | ||||||
sum(rate(greptime_mito_flush_failure_total[$__rate_interval])) | ||||||
sum(rate(greptime_mito_compaction_failure_total[$__rate_interval])) | timeseries | Storage-engine write rejects, write stalls, flush failures, and compaction failures on datanodes. | prometheus | eps | write-reject | |
| Scan and Compaction Memory Rejects | sum(rate(greptime_mito_scan_requests_rejected_total[$__rate_interval])) | |||||
sum(rate(greptime_mito_scan_memory_exhausted_total[$__rate_interval])) | ||||||
sum(rate(greptime_mito_compaction_memory_rejected_total[$__rate_interval])) | timeseries | Datanode scan and compaction memory rejection/exhaustion counters. | prometheus | rps | scan-rejected | |
| OpenDAL Errors | sum by (scheme, operation, error) (rate(opendal_operation_errors_total{error!="NotFound"}[$__rate_interval])) | timeseries | Object-store errors by scheme, operation, and error, excluding NotFound noise. | prometheus | eps | {{scheme}}-{{operation}}-{{error}} |
| Metasrv Failures | sum(rate(greptime_meta_region_migration_fail[$__rate_interval])) | |||||
sum(rate(greptime_meta_reconciliation_procedure_error[$__rate_interval])) | timeseries | Region migration and reconciliation failures in metasrv. | prometheus | eps | migration-fail | |
| Flow and Trigger Failures | sum by (code) (rate(greptime_flow_errors[$__rate_interval])) | |||||
sum(rate(greptime_trigger_evaluate_failure_count[$__rate_interval])) | ||||||
sum(rate(greptime_trigger_send_alert_failure_count[$__rate_interval])) | ||||||
sum(rate(greptime_trigger_save_alert_record_failure_count[$__rate_interval])) | timeseries | Derived-data and alerting pipeline failures. | prometheus | eps | flow-{{code}} | |
| Mito GC Failures | sum(rate(greptime_mito_gc_errors_total[$__rate_interval])) | |||||
sum(rate(greptime_mito_gc_orphaned_index_files[$__rate_interval])) | ||||||
sum(rate(greptime_mito_gc_skipped_unparsable_files[$__rate_interval])) | timeseries | Mito garbage-collection errors and skipped/orphaned files on datanodes. | prometheus | short | gc-errors |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Runtime Threads | sum by (instance, pod) (greptime_runtime_threads_alive) | |||||
sum by (instance, pod) (greptime_runtime_threads_idle) | timeseries | Runtime thread pool size and idle threads by instance. Low idle threads during latency spikes can indicate executor saturation. | prometheus | short | alive-[{{instance}}]-[{{pod}}] | |
| Request Memory Utilization | sum by (instance, pod) (greptime_servers_request_memory_in_use_bytes) / sum by (instance, pod) (greptime_servers_request_memory_limit_bytes) | timeseries | Frontend request memory usage divided by configured request memory limit. | prometheus | percentunit | [{{instance}}]-[{{pod}}] |
| Query Memory Usage | sum by (instance, pod) (greptime_query_memory_pool_usage_bytes) | timeseries | Query memory pool usage. Use this with query memory rejection panels to diagnose query saturation. | prometheus | bytes | [{{instance}}]-[{{pod}}] |
| Scan and Compaction Memory | sum by (instance, pod) (greptime_mito_scan_memory_usage_bytes) | |||||
sum by (instance, pod) (greptime_mito_compaction_memory_in_use_bytes) | ||||||
sum by (instance, pod) (greptime_mito_compaction_memory_limit_bytes) | timeseries | Datanode scan memory usage and compaction memory utilization. | prometheus | bytes | scan-[{{instance}}]-[{{pod}}] | |
| Write Buffer and Active Stalling | sum by (instance, pod) (greptime_mito_write_buffer_bytes) | |||||
sum by (instance, pod) (greptime_mito_write_stalling_count) | timeseries | Mito write buffer bytes and active write-stalling gauges. Growth here indicates write-path backpressure. | prometheus | bytes | buffer-[{{instance}}]-[{{pod}}] | |
| Prom Store Backlog | sum by (instance, pod) (greptime_prom_store_pending_rows) | |||||
sum by (instance, pod) (greptime_prom_store_pending_batches) | ||||||
sum by (instance, pod) (greptime_prom_store_pending_workers) | timeseries | Prometheus remote-write pending rows, batches, and workers. Rising pending rows indicate remote-write buffering backlog. | prometheus | short | rows-[{{instance}}]-[{{pod}}] | |
| Inflight Flush and Compaction | sum by (instance, pod) (greptime_mito_inflight_flush_count) | |||||
sum by (instance, pod) (greptime_mito_inflight_compaction_count) | timeseries | Current in-flight flush and compaction tasks on datanodes. | prometheus | short | flush-[{{instance}}]-[{{pod}}] |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Frontend CPU Usage per Instance | sum(rate(process_cpu_seconds_total[$__rate_interval]) * 1000) by (instance, pod) | |||||
max(greptime_cpu_limit_in_millicores) | timeseries | Current cpu usage by instance | prometheus | none | [{{ instance }}]-[{{ pod }}]-cpu | |
| Datanode CPU Usage per Instance | sum(rate(process_cpu_seconds_total[$__rate_interval]) * 1000) by (instance, pod) | |||||
max(greptime_cpu_limit_in_millicores) | timeseries | Current cpu usage by instance | prometheus | none | [{{ instance }}]-[{{ pod }}] | |
| Metasrv CPU Usage per Instance | sum(rate(process_cpu_seconds_total[$__rate_interval]) * 1000) by (instance, pod) | |||||
max(greptime_cpu_limit_in_millicores) | timeseries | Current cpu usage by instance | prometheus | none | [{{ instance }}]-[{{ pod }}] | |
| Frontend Memory per Instance | sum(process_resident_memory_bytes) by (instance, pod) | |||||
max(greptime_memory_limit_in_bytes) | timeseries | Current memory usage by instance | prometheus | bytes | [{{ instance }}]-[{{ pod }}] | |
| Datanode Memory per Instance | sum(process_resident_memory_bytes) by (instance, pod) | |||||
max(greptime_memory_limit_in_bytes) | timeseries | Current memory usage by instance | prometheus | bytes | [{{instance}}]-[{{ pod }}] | |
| Metasrv Memory per Instance | sum(process_resident_memory_bytes) by (instance, pod) | |||||
max(greptime_memory_limit_in_bytes) | timeseries | Current memory usage by instance | prometheus | bytes | [{{ instance }}]-[{{ pod }}]-resident | |
| Flownode CPU Usage per Instance | sum(rate(process_cpu_seconds_total[$__rate_interval]) * 1000) by (instance, pod) | |||||
max(greptime_cpu_limit_in_millicores) | timeseries | Current cpu usage by instance | prometheus | none | [{{ instance }}]-[{{ pod }}] | |
| Flownode Memory per Instance | sum(process_resident_memory_bytes) by (instance, pod) | |||||
max(greptime_memory_limit_in_bytes) | timeseries | Current memory usage by instance | prometheus | bytes | [{{ instance }}]-[{{ pod }}] |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Query Rate by Protocol | sum(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval])) | |||||
sum(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval])) | ||||||
sum(rate(greptime_servers_http_promql_elapsed_count[$__rate_interval])) | ||||||
sum(rate(greptime_servers_http_sql_elapsed_count[$__rate_interval])) | ||||||
sum(rate(greptime_frontend_grpc_handle_query_elapsed_count[$__rate_interval])) | timeseries | Query API call rates by protocol, collected from frontends. | prometheus | reqps | mysql | |
| Query Latency by Protocol | histogram_quantile(0.95, sum by (le) (rate(greptime_servers_mysql_query_elapsed_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_mysql_query_elapsed_bucket[$__rate_interval]))) | ||||||
histogram_quantile(0.95, sum by (le) (rate(greptime_servers_postgres_query_elapsed_bucket[$__rate_interval]))) | ||||||
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_postgres_query_elapsed_bucket[$__rate_interval]))) | ||||||
histogram_quantile(0.95, sum by (le) (rate(greptime_servers_http_promql_elapsed_bucket[$__rate_interval]))) | ||||||
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_promql_elapsed_bucket[$__rate_interval]))) | ||||||
sum(rate(greptime_servers_mysql_query_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval])) | ||||||
sum(rate(greptime_servers_postgres_query_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval])) | ||||||
sum(rate(greptime_servers_http_promql_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_promql_elapsed_count[$__rate_interval])) | ||||||
sum(rate(greptime_frontend_grpc_handle_query_elapsed_sum[$__rate_interval])) / sum(rate(greptime_frontend_grpc_handle_query_elapsed_count[$__rate_interval])) | timeseries | p95, p99, and average query latency by main frontend protocol. | prometheus | s | mysql-p95 | |
| Query Stage Latency | histogram_quantile(0.95, sum by (le, stage) (rate(greptime_query_stage_elapsed_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.99, sum by (le, stage) (rate(greptime_query_stage_elapsed_bucket[$__rate_interval]))) | timeseries | p95 and p99 latency by query stage. Use stage labels to identify planning, scan, or merge bottlenecks. | prometheus | s | p95-{{stage}} | |
| Merge Scan Fan-out and Errors | sum by (instance, pod) (greptime_query_merge_scan_regions) | |||||
sum by (instance, pod) (rate(greptime_query_merge_scan_errors_total[$__rate_interval])) | timeseries | Merge-scan region fan-out and errors. High fan-out can explain slow distributed table scans. | prometheus | short | regions-[{{instance}}]-[{{pod}}] | |
| Pushdown Fallback Errors | sum(rate(greptime_push_down_fallback_errors_total[$__rate_interval])) | timeseries | Failed query pushdown fallback attempts. Non-zero values can indicate optimization paths that increase scan work. | prometheus | eps | pushdown-fallback-errors |
| PromQL Series Count | sum by (instance, pod) (greptime_promql_series_count) | timeseries | Series count touched by PromQL queries. Correlate this with PromQL latency to identify cardinality-driven slowness. | prometheus | short | [{{instance}}]-[{{pod}}] |
| Connections and Prepared Statements | sum by (instance, pod) (greptime_servers_mysql_connection_count) | |||||
sum by (instance, pod) (greptime_servers_postgres_connection_count) | ||||||
sum by (instance, pod) (rate(greptime_servers_mysql_prepared_count[$__rate_interval])) | ||||||
sum by (instance, pod) (rate(greptime_servers_postgres_prepared_count[$__rate_interval])) | timeseries | MySQL/PostgreSQL connection and prepared-statement counts. Spikes can indicate client storms or leaks. | prometheus | short | mysql-connections-[{{instance}}]-[{{pod}}] |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| HTTP QPS per Instance | sum by(instance, pod, path, method, code) (rate(greptime_servers_http_requests_elapsed_count{path!~"/health|/metrics"}[$__rate_interval])) | timeseries | HTTP QPS per Instance. | prometheus | reqps | [{{instance}}]-[{{pod}}]-[{{path}}]-[{{method}}]-[{{code}}] |
| HTTP P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, path, method, code) (rate(greptime_servers_http_requests_elapsed_bucket{path!~"/health|/metrics"}[$__rate_interval]))) | |||||
sum by(instance, pod, path, method, code) (rate(greptime_servers_http_requests_elapsed_sum{path!~"/health|/metrics"}[$__rate_interval])) / sum by(instance, pod, path, method, code) (rate(greptime_servers_http_requests_elapsed_count{path!~"/health|/metrics"}[$__rate_interval])) | timeseries | HTTP P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{path}}]-[{{method}}]-[{{code}}]-p99 | |
| gRPC QPS per Instance | sum by(instance, pod, path, code) (rate(greptime_servers_grpc_requests_elapsed_count[$__rate_interval])) | timeseries | gRPC QPS per Instance. | prometheus | reqps | [{{instance}}]-[{{pod}}]-[{{path}}]-[{{code}}] |
| gRPC P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, path, code) (rate(greptime_servers_grpc_requests_elapsed_bucket[$__rate_interval]))) | |||||
sum by(instance, pod, path, code) (rate(greptime_servers_grpc_requests_elapsed_sum[$__rate_interval])) / sum by(instance, pod, path, code) (rate(greptime_servers_grpc_requests_elapsed_count[$__rate_interval])) | timeseries | gRPC P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{path}}]-[{{method}}]-[{{code}}]-p99 | |
| MySQL QPS per Instance | sum by(pod, instance)(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval])) | timeseries | MySQL QPS per Instance. | prometheus | reqps | [{{instance}}]-[{{pod}}] |
| MySQL P99 and Avg per Instance | histogram_quantile(0.99, sum by(pod, instance, le) (rate(greptime_servers_mysql_query_elapsed_bucket[$__rate_interval]))) | |||||
sum by(pod, instance) (rate(greptime_servers_mysql_query_elapsed_sum[$__rate_interval])) / sum by(pod, instance) (rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval])) | timeseries | MySQL P99 and average per Instance. | prometheus | s | [{{ instance }}]-[{{ pod }}]-p99 | |
| PostgreSQL QPS per Instance | sum by(pod, instance)(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval])) | timeseries | PostgreSQL QPS per Instance. | prometheus | reqps | [{{instance}}]-[{{pod}}] |
| PostgreSQL P99 and Avg per Instance | histogram_quantile(0.99, sum by(pod,instance,le) (rate(greptime_servers_postgres_query_elapsed_bucket[$__rate_interval]))) | |||||
sum by(pod, instance) (rate(greptime_servers_postgres_query_elapsed_sum[$__rate_interval])) / sum by(pod, instance) (rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval])) | timeseries | PostgreSQL P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-p99 |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Region Call QPS per Instance | sum by(instance, pod, request_type) (rate(greptime_grpc_region_request_count[$__rate_interval])) | timeseries | Region Call QPS per Instance. | prometheus | ops | [{{instance}}]-[{{pod}}]-[{{request_type}}] |
| Region Call P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, request_type) (rate(greptime_grpc_region_request_bucket[$__rate_interval]))) | |||||
sum by(instance, pod, request_type) (rate(greptime_grpc_region_request_sum[$__rate_interval])) / sum by(instance, pod, request_type) (rate(greptime_grpc_region_request_count[$__rate_interval])) | timeseries | Region Call P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{request_type}}] | |
| Frontend Handle Bulk Insert Elapsed Time | sum by(instance, pod, stage) (rate(greptime_table_operator_handle_bulk_insert_sum[$__rate_interval]))/sum by(instance, pod, stage) (rate(greptime_table_operator_handle_bulk_insert_count[$__rate_interval])) | |||||
histogram_quantile(0.99, sum by(instance, pod, stage, le) (rate(greptime_table_operator_handle_bulk_insert_bucket[$__rate_interval]))) | timeseries | Per-stage time for frontend to handle bulk insert requests | prometheus | s | [{{instance}}]-[{{pod}}]-[{{stage}}]-AVG |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Region Request Failures and Failed Inserts | sum by (instance, pod) (rate(greptime_datanode_region_request_fail_count[$__rate_interval])) | |||||
sum by (instance, pod) (rate(greptime_datanode_region_failed_insert_count[$__rate_interval])) | timeseries | Datanode region request failures and failed inserts by instance. | prometheus | eps | request-fail-[{{instance}}]-[{{pod}}] | |
| Write Rejects and Stalls | sum by (instance, pod) (rate(greptime_mito_write_reject_total[$__rate_interval])) | |||||
sum by (instance, pod) (rate(greptime_mito_write_stall_total[$__rate_interval])) | ||||||
sum by (instance, pod) (greptime_mito_write_stalling_count) | timeseries | Mito write rejects, write stall events, and active write stalling by datanode. | prometheus | short | reject-[{{instance}}]-[{{pod}}] | |
| Flush and Compaction Failures | sum by (instance, pod) (rate(greptime_mito_flush_failure_total[$__rate_interval])) | |||||
sum by (instance, pod) (rate(greptime_mito_compaction_failure_total[$__rate_interval])) | timeseries | Mito flush and compaction failure rates by datanode. | prometheus | eps | flush-[{{instance}}]-[{{pod}}] | |
| Mito GC Health | sum(rate(greptime_mito_gc_runs_total[$__rate_interval])) | |||||
sum(rate(greptime_mito_gc_errors_total[$__rate_interval])) | ||||||
sum(rate(greptime_mito_gc_files_deleted_total[$__rate_interval])) | ||||||
sum(rate(greptime_mito_gc_orphaned_index_files[$__rate_interval])) | ||||||
sum(rate(greptime_mito_gc_skipped_unparsable_files[$__rate_interval])) | timeseries | Mito garbage-collection runs, errors, deleted files, orphaned index files, and skipped unparsable files. | prometheus | short | runs | |
| Mito GC Duration | histogram_quantile(0.99, sum by (le, stage) (rate(greptime_mito_gc_duration_seconds_bucket[$__rate_interval]))) | |||||
sum by (stage) (rate(greptime_mito_gc_duration_seconds_sum[$__rate_interval])) / sum by (stage) (rate(greptime_mito_gc_duration_seconds_count[$__rate_interval])) | timeseries | P99 and average Mito garbage-collection duration by stage. | prometheus | s | {{stage}}-p99 |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Request OPS per Instance | sum by(instance, pod, type) (rate(greptime_mito_handle_request_elapsed_count[$__rate_interval])) | timeseries | Request QPS per Instance. | prometheus | ops | [{{instance}}]-[{{pod}}]-[{{type}}] |
| Request P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, type) (rate(greptime_mito_handle_request_elapsed_bucket[$__rate_interval]))) | |||||
sum by(instance, pod, type) (rate(greptime_mito_handle_request_elapsed_sum[$__rate_interval])) / sum by(instance, pod, type) (rate(greptime_mito_handle_request_elapsed_count[$__rate_interval])) | timeseries | Request P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{type}}] | |
| Request Wait P99 and Avg per Worker | histogram_quantile(0.95, sum by(instance, pod, worker, le) (rate(greptime_mito_request_wait_time_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.99, sum by(instance, pod, worker, le) (rate(greptime_mito_request_wait_time_bucket[$__rate_interval]))) | ||||||
sum by(instance, pod, worker) (rate(greptime_mito_request_wait_time_sum[$__rate_interval])) / sum by(instance, pod, worker) (rate(greptime_mito_request_wait_time_count[$__rate_interval])) | timeseries | Time Mito requests spend waiting before region worker handling. Use this with request service latency to distinguish queueing from execution time. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{worker}}]-p95 | |
| Write Buffer per Instance | greptime_mito_write_buffer_bytes | timeseries | Write Buffer per Instance. | prometheus | decbytes | [{{instance}}]-[{{pod}}] |
| Write Rows per Instance | sum by (instance, pod) (rate(greptime_mito_write_rows_total[$__rate_interval])) | timeseries | Ingestion size by row counts. | prometheus | rowsps | [{{instance}}]-[{{pod}}] |
| Read Stage OPS per Instance | sum by(instance, pod) (rate(greptime_mito_read_stage_elapsed_count{stage="total"}[$__rate_interval])) | timeseries | Read Stage OPS per Instance. | prometheus | ops | [{{instance}}]-[{{pod}}] |
| Read Stage P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, stage) (rate(greptime_mito_read_stage_elapsed_bucket[$__rate_interval]))) | |||||
sum by(instance, pod, stage) (rate(greptime_mito_read_stage_elapsed_sum[$__rate_interval])) / sum by(instance, pod, stage) (rate(greptime_mito_read_stage_elapsed_count[$__rate_interval])) | timeseries | Read Stage P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{stage}}] | |
| Write Stage P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, stage) (rate(greptime_mito_write_stage_elapsed_bucket[$__rate_interval]))) | |||||
sum by(instance, pod, stage) (rate(greptime_mito_write_stage_elapsed_sum[$__rate_interval])) / sum by(instance, pod, stage) (rate(greptime_mito_write_stage_elapsed_count[$__rate_interval])) | timeseries | Write Stage P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{stage}}] | |
| Cached Bytes per Instance | greptime_mito_cache_bytes | timeseries | Cached Bytes per Instance. | prometheus | decbytes | [{{instance}}]-[{{pod}}]-[{{type}}] |
| Region Worker Handle Bulk Insert Requests | histogram_quantile(0.95, sum by(le,instance, stage, pod) (rate(greptime_region_worker_handle_write_bucket[$__rate_interval]))) | |||||
sum by(instance, stage, pod) (rate(greptime_region_worker_handle_write_sum[$__rate_interval]))/sum by(instance, stage, pod) (rate(greptime_region_worker_handle_write_count[$__rate_interval])) | timeseries | Per-stage elapsed time for region worker to handle bulk insert region requests. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{stage}}]-P95 | |
| Active Series and Field Builders Count | sum by(instance, pod) (greptime_mito_memtable_active_series_count) | |||||
sum by(instance, pod) (greptime_mito_memtable_field_builder_count) | timeseries | Active series and field-builder counts per memtable by instance. | prometheus | none | [{{instance}}]-[{{pod}}]-series | |
| Region Worker Convert Requests | histogram_quantile(0.95, sum by(le, instance, stage, pod) (rate(greptime_datanode_convert_region_request_bucket[$__rate_interval]))) | |||||
sum by(le,instance, stage, pod) (rate(greptime_datanode_convert_region_request_sum[$__rate_interval]))/sum by(le,instance, stage, pod) (rate(greptime_datanode_convert_region_request_count[$__rate_interval])) | timeseries | Per-stage elapsed time for region worker to decode requests. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{stage}}]-P95 | |
| Cache Miss | sum by (instance,pod, type) (rate(greptime_mito_cache_miss[$__rate_interval])) | timeseries | The local cache miss of the datanode. | prometheus | -- | [{{instance}}]-[{{pod}}]-[{{type}}] |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Flush OPS per Instance | sum by(instance, pod, reason) (rate(greptime_mito_flush_requests_total[$__rate_interval])) | timeseries | Flush QPS per Instance. | prometheus | ops | [{{instance}}]-[{{pod}}]-[{{reason}}] |
| Flush Elapsed Time | histogram_quantile(0.95, sum by (instance, pod, le, type) (rate(greptime_mito_flush_elapsed_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.99, sum by (instance, pod, le, type) (rate(greptime_mito_flush_elapsed_bucket[$__rate_interval]))) | ||||||
sum by (instance, pod, type) (rate(greptime_mito_flush_elapsed_sum[$__rate_interval])) / sum by (instance, pod, type) (rate(greptime_mito_flush_elapsed_count[$__rate_interval])) | timeseries | Mito flush p95 and p99 elapsed time by datanode and flush type. Use this to identify slow flush jobs. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{type}}]-p95 | |
| Flush Throughput | sum by (instance, pod) (rate(greptime_mito_flush_bytes_total[$__rate_interval])) | |||||
sum by (instance, pod) (rate(greptime_mito_flush_file_total[$__rate_interval])) | timeseries | Mito flushed bytes and flushed file rates. Use this with flush elapsed time to distinguish slow jobs from large jobs. | prometheus | Bps | [{{instance}}]-[{{pod}}]-bytes | |
| Inflight Flush | greptime_mito_inflight_flush_count | timeseries | Ongoing flush task count | prometheus | none | [{{instance}}]-[{{pod}}] |
| Compaction OPS per Instance | sum by(instance, pod) (rate(greptime_mito_compaction_total_elapsed_count[$__rate_interval])) | timeseries | Compaction OPS per Instance. | prometheus | ops | [{{ instance }}]-[{{pod}}] |
| Inflight Compaction | greptime_mito_inflight_compaction_count | timeseries | Ongoing compaction task count | prometheus | none | [{{instance}}]-[{{pod}}] |
| Compaction P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le) (rate(greptime_mito_compaction_total_elapsed_bucket[$__rate_interval]))) | |||||
sum by(instance, pod) (rate(greptime_mito_compaction_total_elapsed_sum[$__rate_interval])) / sum by(instance, pod) (rate(greptime_mito_compaction_total_elapsed_count[$__rate_interval])) | timeseries | Compaction P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-p99 | |
| Compaction Elapsed Time per Instance by Stage | histogram_quantile(0.99, sum by(instance, pod, le, stage) (rate(greptime_mito_compaction_stage_elapsed_bucket[$__rate_interval]))) | |||||
sum by(instance, pod, stage) (rate(greptime_mito_compaction_stage_elapsed_sum[$__rate_interval]))/sum by(instance, pod, stage) (rate(greptime_mito_compaction_stage_elapsed_count[$__rate_interval])) | timeseries | Compaction latency by stage | prometheus | s | [{{instance}}]-[{{pod}}]-[{{stage}}]-p99 | |
| Compaction Input/Output Bytes | sum by(instance, pod) (rate(greptime_mito_compaction_input_bytes[$__rate_interval])) | |||||
sum by(instance, pod) (rate(greptime_mito_compaction_output_bytes[$__rate_interval])) | timeseries | Compaction input and output bytes by datanode. Use this to correlate compaction latency with rewritten data volume. | prometheus | Bps | [{{instance}}]-[{{pod}}]-input |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Index Apply Elapsed Time | histogram_quantile(0.95, sum by (le, type) (rate(greptime_index_apply_elapsed_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.99, sum by (le, type) (rate(greptime_index_apply_elapsed_bucket[$__rate_interval]))) | timeseries | Index apply p95 and p99 elapsed time by index type. Slow apply can increase read latency for indexed predicates. | prometheus | s | {{type}}-p95 | |
| Index Create Elapsed Time | histogram_quantile(0.95, sum by (le, stage, type) (rate(greptime_index_create_elapsed_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.99, sum by (le, stage, type) (rate(greptime_index_create_elapsed_bucket[$__rate_interval]))) | timeseries | Index create p95 and p99 elapsed time by stage and index type. Slow stages can explain flush or compaction delays. | prometheus | s | {{type}}-{{stage}}-p95 | |
| Index Create Rows and Bytes | sum by (type) (rate(greptime_index_create_rows_total[$__rate_interval])) | |||||
sum by (type) (rate(greptime_index_create_bytes_total[$__rate_interval])) | timeseries | Rows and bytes produced by index creation by index type. Spikes here can explain storage write pressure. | prometheus | rowsps | {{type}}-rows | |
| Index Memory Usage | greptime_index_apply_memory_usage | |||||
sum by (type) (greptime_index_create_memory_usage) | timeseries | Memory used while applying and creating indexes. Growth here can explain memory pressure during indexed flush or compaction work. | prometheus | bytes | apply | |
| Index IO Bytes | sum by (type, file_type) (rate(greptime_index_io_bytes_total[$__rate_interval])) | timeseries | Index read and write byte rates by operation and file type for puffin and intermediate files. | prometheus | Bps | {{type}}-{{file_type}} |
| Index IO Operations | sum by (type, file_type) (rate(greptime_index_io_op_total[$__rate_interval])) | timeseries | Index IO operation rates by operation and file type, including read, write, seek, and flush operations. | prometheus | ops | {{type}}-{{file_type}} |
| Index Cache | sum by (type) (rate(greptime_mito_cache_hit{type=~"index.*|vector_index|index_result"}[$__rate_interval])) | |||||
sum by (type) (rate(greptime_mito_cache_miss{type=~"index.*|vector_index|index_result"}[$__rate_interval])) | ||||||
sum by (type, cause) (rate(greptime_mito_cache_eviction{type=~"index.*|vector_index|index_result"}[$__rate_interval])) | timeseries | Index-related cache hits, misses, and evictions from Mito caches. | prometheus | ops | hit-{{type}} |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Inactive and Lease-expired Regions | sum(greptime_meta_inactive_regions) | |||||
sum(greptime_lease_expired_region) | timeseries | Inactive regions and expired region leases. Non-zero values indicate metasrv or routing health issues. | prometheus | short | inactive-regions | |
| Heartbeat Health | sum(rate(greptime_meta_heartbeat_rate[$__rate_interval])) | |||||
sum(greptime_meta_heartbeat_connection_num) | ||||||
sum(rate(greptime_frontend_heartbeat_send_count[$__rate_interval])) | ||||||
sum(rate(greptime_frontend_heartbeat_recv_count[$__rate_interval])) | ||||||
sum(rate(greptime_datanode_heartbeat_send_count[$__rate_interval])) | ||||||
sum(rate(greptime_datanode_heartbeat_recv_count[$__rate_interval])) | timeseries | Metasrv heartbeat receive rate, heartbeat connections, and frontend/datanode heartbeat send/receive counters. | prometheus | short | meta-recv-rate | |
| Region migration datanode | greptime_meta_region_migration_stat{datanode_type="src"} | |||||
greptime_meta_region_migration_stat{datanode_type="desc"} | status-history | Counter of region migration by source and destination | prometheus | -- | from-datanode-{{datanode_id}} | |
| Region migration error | rate(greptime_meta_region_migration_error[$__rate_interval]) | timeseries | Counter of region migration error | prometheus | none | {{pod}}-{{state}}-{{error_type}} |
| Datanode load | greptime_datanode_load | timeseries | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | prometheus | binBps | Datanode-{{datanode_id}}-writeload |
| Rate of SQL Executions (RDS) | rate(greptime_meta_rds_pg_sql_execute_elapsed_ms_count[$__rate_interval]) | timeseries | Displays the rate of SQL executions processed by the Meta service using the RDS backend. | prometheus | none | {{pod}} {{op}} {{type}} {{result}} |
| SQL Execution Latency (RDS) | histogram_quantile(0.90, sum by(pod, op, type, result, le) (rate(greptime_meta_rds_pg_sql_execute_elapsed_ms_bucket[$__rate_interval]))) | |||||
sum by(pod, op, type, result) (rate(greptime_meta_rds_pg_sql_execute_elapsed_ms_sum[$__rate_interval])) / sum by(pod, op, type, result) (rate(greptime_meta_rds_pg_sql_execute_elapsed_ms_count[$__rate_interval])) | timeseries | Measures the response time of SQL executions via the RDS backend. | prometheus | ms | {{pod}} {{op}} {{type}} {{result}} p90 | |
| Handler Execution Latency | `histogram_quantile(0.90, sum by(pod, le, name) ( | |||||
| rate(greptime_meta_handler_execute_bucket[$__rate_interval]) | ||||||
| ))` | ||||||
sum by(pod, name) (rate(greptime_meta_handler_execute_sum[$__rate_interval])) / sum by(pod, name) (rate(greptime_meta_handler_execute_count[$__rate_interval])) | timeseries | Shows latency of Meta handlers by pod and handler name, useful for monitoring handler performance and detecting latency spikes. | ||||
prometheus | s | {{pod}} {{name}} p90 | ||||
| Heartbeat Packet Size | histogram_quantile(0.9, sum by(pod, le) (rate(greptime_meta_heartbeat_stat_memory_size_bucket[$__rate_interval]))) | timeseries | Shows p90 heartbeat message sizes, helping track network usage and identify anomalies in heartbeat payload. | |||
prometheus | bytes | {{pod}} | ||||
| Meta Heartbeat Receive Rate | rate(greptime_meta_heartbeat_rate[$__rate_interval]) | timeseries | Rate of heartbeats received by metasrv from datanodes and frontends. | prometheus | s | {{pod}} |
| Meta KV Ops Latency | histogram_quantile(0.99, sum by(pod, le, op, target) (rate(greptime_meta_kv_request_elapsed_bucket[$__rate_interval]))) | |||||
sum by(pod, op, target) (rate(greptime_meta_kv_request_elapsed_sum[$__rate_interval])) / sum by(pod, op, target) (rate(greptime_meta_kv_request_elapsed_count[$__rate_interval])) | timeseries | p99 and average latency of metasrv key-value store operations by op and target. | prometheus | s | {{pod}}-{{op}} p99 | |
| Rate of meta KV Ops | rate(greptime_meta_kv_request_elapsed_count[$__rate_interval]) | timeseries | Rate of metasrv key-value store operations by op. | prometheus | none | {{pod}}-{{op}} p99 |
| DDL Latency | histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_create_tables_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_create_table_bucket[$__rate_interval]))) | ||||||
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_create_view_bucket[$__rate_interval]))) | ||||||
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_create_flow_bucket[$__rate_interval]))) | ||||||
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_drop_table_bucket[$__rate_interval]))) | ||||||
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_alter_table_bucket[$__rate_interval]))) | timeseries | p90 latency of metasrv DDL procedures (create/alter/drop table, create view/flow) by step. | prometheus | s | CreateLogicalTables-{{step}} p90 | |
| Reconciliation stats | rate(greptime_meta_reconciliation_stats[$__rate_interval]) | timeseries | Reconciliation stats | prometheus | ops | {{pod}}-{{table_type}}-{{type}} |
| Reconciliation steps | histogram_quantile(0.9, sum by(le, procedure_name, step) (rate(greptime_meta_reconciliation_procedure_bucket[$__rate_interval]))) | timeseries | Elapsed of Reconciliation steps | prometheus | s | {{procedure_name}}-{{step}}-P90 |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Hotspot Regions | `WITH table_stats AS ( | |||||
| SELECT |
table_id,
COUNT(*) AS region_count,
SUM(disk_size) AS total_disk_size,
SUM(region_rows) as total_region_rows
FROM information_schema.region_statistics WHERE region_role = 'Leader' GROUP BY table_id HAVING COUNT(*) > 1 )
SELECT t.table_schema, t.table_name,
r.region_id, t.table_id, r.region_number,
p.partition_description,
ROUND( r.disk_size * 100.0 / NULLIF(ts.total_disk_size, 0), 2 ) AS disk_size_share_percent,
r.disk_size,
ROUND( r.region_rows * 100.0 / NULLIF(ts.total_region_rows, 0), 2 ) AS region_rows_share_percent, r.region_rows
FROM information_schema.region_statistics r
JOIN table_stats ts ON r.table_id = ts.table_id
JOIN information_schema.tables t ON r.table_id = t.table_id
LEFT JOIN information_schema.partitions p ON p.table_schema = t.table_schema AND p.table_name = t.table_name AND p.greptime_partition_id = r.region_id
WHERE r.region_role = 'Leader'
ORDER BY region_rows_share_percent DESC
LIMIT 100;|table| |mysql| -- | -- | | Datanode Load(Write) |greptime_datanode_history_load|timeseries| Write load of each datanode over time. |prometheus|binBps|datanode-{{datanode_id}}({{instance}})| | Datanode Load(Write) Distribution |greptime_datanode_history_load|piechart| Distribution of write load across datanodes. |prometheus|binBps|datanode-{{datanode_id}}({{instance}})| | Datanode Data Distribution |WITH leader_regions AS (
SELECT
CONCAT(
'datanode-',
p.peer_id,
' (',
p.peer_addr,
')'
) AS datanode,
r.disk_size
FROM information_schema.region_statistics r
JOIN information_schema.region_peers p
ON r.region_id = p.region_id
WHERE r.region_role = 'Leader'
AND p.is_leader = 'Yes'
)
SELECT
datanode,
COUNT(*) AS leader_region_count,
SUM(disk_size) AS data_size
FROM leader_regions
GROUP BY datanode
ORDER BY data_size DESC;|piechart| Distribution of leader regions and data size across datanodes. |mysql|bytes` | -- |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Region Balancer Actions | `sum by (result) ( | |||||
| changes(greptime_region_balancer_actions_total[$__rate_interval]) | ||||||
| )` | timeseries | Region balancer action count | prometheus | short | {{result}} | |
| Region Balancer Gate Stops | sum by (gate, reason) (changes(greptime_region_balancer_gate_stop_total[$__rate_interval])) | timeseries | Region balancer gate stop count by gate and reason | prometheus | short | {{gate}} / {{reason}} |
| Region Balancer Datanodes | sum by (state) (greptime_region_balancer_datanodes) | stat | Region balancer datanode count by state | prometheus | short | {{state}} |
| Region Balancer Regions | sum by (state) (greptime_region_balancer_regions) | stat | Region balancer region count by state | prometheus | short | {{state}} |
| Region Balancer Datanode Stability | sum by (state) (greptime_region_balancer_datanode_stability) | stat | Region balancer datanode stability statistics by state | prometheus | binBps | {{state}} |
| Auto Repartition Actions | sum by (result) (changes(greptime_auto_repartition_actions_total[$__rate_interval])) | timeseries | Auto repartition action count by result | prometheus | short | {{result}} |
| Auto Repartition Gate Stops | sum by (gate, reason) (changes(greptime_auto_repartition_gate_stop_total[$__rate_interval])) | timeseries | Auto repartition gate stop count by gate and reason | prometheus | short | {{gate}} / {{reason}} |
| Auto Repartition Sampling P99 | histogram_quantile(0.99, sum by (le, stage) (rate(greptime_auto_repartition_sampling_elapsed_bucket[$__rate_interval]))) | timeseries | Auto repartition sampling elapsed time by stage | prometheus | s | {{stage}} |
| Auto Repartition Executor P99 | histogram_quantile(0.99, sum by (le, stage) (rate(greptime_auto_repartition_executor_elapsed_bucket[$__rate_interval]))) | timeseries | Auto repartition executor elapsed time by stage | prometheus | s | {{stage}} |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| QPS per Instance | sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count[$__rate_interval])) | timeseries | QPS per Instance. | prometheus | ops | [{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}] |
| Read QPS per Instance | sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation=~"read|Reader::read"}[$__rate_interval])) | timeseries | Read QPS per Instance. | prometheus | ops | [{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}] |
| Read P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{operation=~"read|Reader::read"}[$__rate_interval]))) | |||||
sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_sum{operation=~"read|Reader::read"}[$__rate_interval])) / sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation=~"read|Reader::read"}[$__rate_interval])) | timeseries | Read P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}] | |
| Write QPS per Instance | sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation=~"write|Writer::write|Writer::close"}[$__rate_interval])) | timeseries | Write QPS per Instance. | prometheus | ops | [{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}] |
| Write P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{operation =~ "Writer::write|Writer::close|write"}[$__rate_interval]))) | |||||
sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_sum{operation=~"write|Writer::write|Writer::close"}[$__rate_interval])) / sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation=~"write|Writer::write|Writer::close"}[$__rate_interval])) | timeseries | Write P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}] | |
| List QPS per Instance | sum by(instance, pod, scheme) (rate(opendal_operation_duration_seconds_count{operation="list"}[$__rate_interval])) | timeseries | List QPS per Instance. | prometheus | ops | [{{instance}}]-[{{pod}}]-[{{scheme}}] |
| List P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, scheme) (rate(opendal_operation_duration_seconds_bucket{operation="list"}[$__rate_interval]))) | |||||
sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_sum{operation="list"}[$__rate_interval])) / sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation="list"}[$__rate_interval])) | timeseries | List P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{scheme}}] | |
| Other Requests per Instance | sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation!~"read|Reader::read|write|Writer::write|Writer::close|list|stat"}[$__rate_interval])) | timeseries | Other Requests per Instance. | prometheus | ops | [{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}] |
| Other Request P99 and Avg per Instance | histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{operation!~"read|Reader::read|write|Writer::write|Writer::close|list|stat"}[$__rate_interval]))) | |||||
sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_sum{operation!~"read|Reader::read|write|Writer::write|Writer::close|list|stat"}[$__rate_interval])) / sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation!~"read|Reader::read|write|Writer::write|Writer::close|list|stat"}[$__rate_interval])) | timeseries | Other Request P99 and average per Instance. | prometheus | s | [{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}] | |
| Opendal traffic | sum by(instance, pod, scheme, operation) (rate(opendal_operation_bytes_sum[$__rate_interval])) | timeseries | Total traffic as in bytes by instance and operation | prometheus | decbytes | [{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}] |
| OpenDAL errors per Instance | sum by(instance, pod, scheme, operation, error) (rate(opendal_operation_errors_total{error!="NotFound"}[$__rate_interval])) | timeseries | OpenDAL error counts per Instance. | prometheus | -- | [{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]-[{{error}}] |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| WAL write size | histogram_quantile(0.95, sum by(le,instance, pod) (rate(raft_engine_write_size_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.99, sum by(le,instance,pod) (rate(raft_engine_write_size_bucket[$__rate_interval]))) | ||||||
sum by (instance, pod)(rate(raft_engine_write_size_sum[$__rate_interval])) | timeseries | Write-ahead logs write size as bytes. This chart includes stats of p95 and p99 size by instance, total WAL write rate. | prometheus | bytes | [{{instance}}]-[{{pod}}]-req-size-p95 | |
| WAL sync duration seconds | histogram_quantile(0.99, sum by(le, type, node, instance, pod) (rate(raft_engine_sync_log_duration_seconds_bucket[$__rate_interval]))) | timeseries | Raft engine (local disk) log store sync latency, p99 | prometheus | s | [{{instance}}]-[{{pod}}]-p99 |
| Log Store op duration seconds | histogram_quantile(0.99, sum by(le,logstore,optype,instance, pod) (rate(greptime_logstore_op_elapsed_bucket[$__rate_interval]))) | timeseries | Write-ahead log operations latency at p99 | prometheus | s | [{{instance}}]-[{{pod}}]-[{{logstore}}]-[{{optype}}]-p99 |
| Triggered region flush total | meta_triggered_region_flush_total | timeseries | Triggered region flush total | prometheus | none | {{pod}}-{{topic_name}} |
| Triggered region checkpoint total | meta_triggered_region_checkpoint_total | timeseries | Triggered region checkpoint total | prometheus | none | {{pod}}-{{topic_name}} |
| Topic estimated replay size | meta_topic_estimated_replay_size | timeseries | Topic estimated max replay size | prometheus | bytes | {{pod}}-{{topic_name}} |
| Kafka logstore's bytes traffic | rate(greptime_logstore_kafka_client_bytes_total[$__rate_interval]) | timeseries | Kafka logstore's bytes traffic | prometheus | bytes | {{pod}}-{{logstore}} |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Flow Ingest / Output Rate | sum by(instance, pod, direction) (rate(greptime_flow_processed_rows[$__rate_interval])) | timeseries | Flow Ingest / Output Rate. | prometheus | -- | [{{pod}}]-[{{instance}}]-[{{direction}}] |
| Flow Ingest Latency | histogram_quantile(0.95, sum(rate(greptime_flow_insert_elapsed_bucket[$__rate_interval])) by (le, instance, pod)) | |||||
histogram_quantile(0.99, sum(rate(greptime_flow_insert_elapsed_bucket[$__rate_interval])) by (le, instance, pod)) | ||||||
sum by(instance, pod) (rate(greptime_flow_insert_elapsed_sum[$__rate_interval])) / sum by(instance, pod) (rate(greptime_flow_insert_elapsed_count[$__rate_interval])) | timeseries | Flow Ingest Latency. | prometheus | -- | [{{instance}}]-[{{pod}}]-p95 | |
| Flow Operation Latency | histogram_quantile(0.95, sum(rate(greptime_flow_processing_time_bucket[$__rate_interval])) by (le,instance,pod,type)) | |||||
histogram_quantile(0.99, sum(rate(greptime_flow_processing_time_bucket[$__rate_interval])) by (le,instance,pod,type)) | timeseries | Flow Operation Latency. | prometheus | -- | [{{instance}}]-[{{pod}}]-[{{type}}]-p95 | |
| Flow Buffer Size per Instance | greptime_flow_input_buf_size | timeseries | Flow Buffer Size per Instance. | prometheus | -- | [{{instance}}]-[{{pod}}] |
| Flow Processing Error per Instance | sum by(instance,pod,code) (rate(greptime_flow_errors[$__rate_interval])) | timeseries | Flow Processing Error per Instance. | prometheus | -- | [{{instance}}]-[{{pod}}]-[{{code}}] |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
|---|---|---|---|---|---|---|
| Trigger Count | greptime_trigger_count | timeseries | Total number of triggers currently defined. | prometheus | -- | __auto |
| Trigger Eval Elapsed | histogram_quantile(0.99, sum by (le) (rate(greptime_trigger_evaluate_elapsed_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.75, sum by (le) (rate(greptime_trigger_evaluate_elapsed_bucket[$__rate_interval]))) | ||||||
sum(rate(greptime_trigger_evaluate_elapsed_sum[$__rate_interval])) / sum(rate(greptime_trigger_evaluate_elapsed_count[$__rate_interval])) | timeseries | Elapsed time for trigger evaluation, including query execution and condition evaluation. | prometheus | s | p99 | |
| Trigger Eval Failure Rate | rate(greptime_trigger_evaluate_failure_count[$__rate_interval]) | timeseries | Rate of failed trigger evaluations. | prometheus | none | __auto |
| Send Alert Elapsed | histogram_quantile(0.99, sum by (le) (rate(greptime_trigger_send_alert_elapsed_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.75, sum by (le) (rate(greptime_trigger_send_alert_elapsed_bucket[$__rate_interval]))) | ||||||
sum(rate(greptime_trigger_send_alert_elapsed_sum[$__rate_interval])) / sum(rate(greptime_trigger_send_alert_elapsed_count[$__rate_interval])) | timeseries | Elapsed time to send trigger alerts to notification channels. | prometheus | s | p99 | |
| Send Alert Failure Rate | rate(greptime_trigger_send_alert_failure_count[$__rate_interval]) | timeseries | Rate of failures when sending trigger alerts. | prometheus | none | __auto |
| Save Alert Elapsed | histogram_quantile(0.99, sum by (le) (rate(greptime_trigger_save_alert_record_elapsed_bucket[$__rate_interval]))) | |||||
histogram_quantile(0.75, sum by (le) (rate(greptime_trigger_save_alert_record_elapsed_bucket[$__rate_interval]))) | ||||||
sum(rate(greptime_trigger_save_alert_record_elapsed_sum[$__rate_interval])) / sum(rate(greptime_trigger_save_alert_record_elapsed_count[$__rate_interval])) | timeseries | Elapsed time to persist trigger alert records. | prometheus | s | p99 | |
| Save Alert Failure Rate | rate(greptime_trigger_save_alert_record_failure_count[$__rate_interval]) | timeseries | Rate of failures when persisting trigger alert records. | prometheus | none | __auto |