Back to Greptimedb

Overview

grafana/dashboards/metrics/standalone/dashboard.md

1.1.259.6 KB
Original Source

Overview

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Uptimetime() - process_start_time_secondsstatThe start time of GreptimeDB.prometheuss__auto
VersionSELECT pkg_version FROM information_schema.build_infostatGreptimeDB version.mysql----
Total Ingestion Ratesum(rate(greptime_table_operator_ingest_rows[$__rate_interval]))statTotal ingestion rate.prometheusrowsps__auto
Total Query Ratesum(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_http_promql_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_http_sql_elapsed_count[$__rate_interval])) + sum(rate(greptime_frontend_grpc_handle_query_elapsed_count[$__rate_interval]))statTotal query API call rate across MySQL, PostgreSQL, and PromQL frontends.prometheusreqpsqueries
User-facing Error Ratesum(rate(greptime_servers_error[$__rate_interval]))statServer protocol errors returned by frontends. Sustained non-zero values indicate user-visible failures.prometheusepserrors
Recent Restartssum(changes(process_start_time_seconds[$__range]))statProcess restarts over the selected time range across GreptimeDB roles.prometheusshortrestarts
DeploymentSELECT count(*) as datanode FROM information_schema.cluster_info WHERE peer_type = 'DATANODE';
SELECT count(*) as frontend FROM information_schema.cluster_info WHERE peer_type = 'FRONTEND';
SELECT count(*) as metasrv FROM information_schema.cluster_info WHERE peer_type = 'METASRV';
SELECT count(*) as flownode FROM information_schema.cluster_info WHERE peer_type = 'FLOWNODE';statThe deployment topology of GreptimeDB.mysql----
Database ResourcesSELECT COUNT(*) as databases FROM information_schema.schemata WHERE schema_name NOT IN ('greptime_private', 'information_schema')
SELECT COUNT(*) as tables FROM information_schema.tables WHERE table_schema != 'information_schema'
SELECT COUNT(region_id) as regions FROM information_schema.region_peers
SELECT COUNT(*) as flows FROM information_schema.flowsstatThe number of the key resources in GreptimeDB.mysql----
Total Storage Sizeselect SUM(disk_size) from information_schema.region_statistics;statTotal number of data file size.mysqldecbytes--
Total Rowsselect SUM(region_rows) from information_schema.region_statistics;statTotal number of data rows in the cluster. Calculated by sum of rows from each region.mysqlsishort--
Data SizeSELECT SUM(memtable_size) * 0.42825 as WAL FROM information_schema.region_statistics;
SELECT SUM(index_size) as index FROM information_schema.region_statistics;
SELECT SUM(manifest_size) as manifest FROM information_schema.region_statistics;statThe data size of wal/index/manifest in the GreptimeDB.mysqldecbytes--
Total Ingestion Rate Trendsum(rate(greptime_table_operator_ingest_rows[$__rate_interval]))timeseriesTotal ingestion throughput trend across frontends. Protocol breakdown is in the Ingestion row.prometheusrowspsingestion
Total Query Rate Trendsum(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_http_promql_elapsed_count[$__rate_interval])) + sum(rate(greptime_servers_http_sql_elapsed_count[$__rate_interval])) + sum(rate(greptime_frontend_grpc_handle_query_elapsed_count[$__rate_interval]))timeseriesTotal query API call rate trend across frontend protocols. Protocol breakdown is in the Queries row.prometheusreqpsqueries
HTTP Request P99 and Avghistogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_requests_elapsed_bucket{path!~"/health|/metrics"}[$__rate_interval])))
sum(rate(greptime_servers_http_requests_elapsed_sum{path!~"/health|/metrics"}[$__rate_interval])) / sum(rate(greptime_servers_http_requests_elapsed_count{path!~"/health|/metrics"}[$__rate_interval]))timeseriesTail and average latency for HTTP requests served by frontends. Excludes health and metrics endpoints.prometheusshttp-p99
gRPC Request P99 and Avghistogram_quantile(0.99, sum by (le) (rate(greptime_servers_grpc_requests_elapsed_bucket[$__rate_interval])))
sum(rate(greptime_servers_grpc_requests_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_grpc_requests_elapsed_count[$__rate_interval]))timeseriesTail and average latency for gRPC requests served by frontends.prometheussgrpc-p99

Ingestion

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Total Ingestion Ratesum(rate(greptime_table_operator_ingest_rows[$__rate_interval]))timeseriesTotal ingestion rate.

Here we listed 3 primary protocols:

  • Prometheus remote write
  • Greptime's gRPC API (when using our ingest SDK)
  • Log ingestion http API | prometheus | rowsps | ingestion | | Ingestion Rate by Protocol | sum(rate(greptime_table_operator_ingest_rows[$__rate_interval])) sum(rate(greptime_servers_prometheus_remote_write_samples[$__rate_interval])) sum(rate(greptime_servers_http_logs_ingestion_counter[$__rate_interval])) sum(rate(greptime_servers_loki_logs_ingestion_counter[$__rate_interval])) sum(rate(greptime_servers_elasticsearch_logs_docs_count[$__rate_interval])) sum(rate(greptime_frontend_otlp_metrics_rows[$__rate_interval])) sum(rate(greptime_frontend_otlp_logs_rows[$__rate_interval])) sum(rate(greptime_frontend_otlp_traces_rows[$__rate_interval])) | timeseries | Rows, samples, or documents ingested by primary observability and table-ingestion protocols. | prometheus | rowsps | table-operator | | Ingestion Latency by Protocol | histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_prometheus_write_elapsed_bucket[$__rate_interval]))) histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_logs_ingestion_elapsed_bucket[$__rate_interval]))) histogram_quantile(0.99, sum by (le) (rate(greptime_servers_loki_logs_ingestion_elapsed_bucket[$__rate_interval]))) histogram_quantile(0.99, sum by (le) (rate(greptime_servers_elasticsearch_logs_ingestion_elapsed_bucket[$__rate_interval]))) histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_otlp_metrics_elapsed_bucket[$__rate_interval]))) histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_otlp_logs_elapsed_bucket[$__rate_interval]))) histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_otlp_traces_elapsed_bucket[$__rate_interval]))) sum(rate(greptime_servers_http_prometheus_write_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_prometheus_write_elapsed_count[$__rate_interval])) sum(rate(greptime_servers_http_logs_ingestion_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_logs_ingestion_elapsed_count[$__rate_interval])) sum(rate(greptime_servers_loki_logs_ingestion_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_loki_logs_ingestion_elapsed_count[$__rate_interval])) sum(rate(greptime_servers_elasticsearch_logs_ingestion_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_elasticsearch_logs_ingestion_elapsed_count[$__rate_interval])) sum(rate(greptime_servers_http_otlp_metrics_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_otlp_metrics_elapsed_count[$__rate_interval])) sum(rate(greptime_servers_http_otlp_logs_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_otlp_logs_elapsed_count[$__rate_interval])) sum(rate(greptime_servers_http_otlp_traces_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_otlp_traces_elapsed_count[$__rate_interval])) | timeseries | p99 and average HTTP ingestion latency for Prometheus remote write, logs, Loki, Elasticsearch, and OTLP endpoints. | prometheus | s | prometheus-write | | Bulk Insert Message Rows and Size | sum(rate(greptime_table_operator_bulk_insert_message_rows_sum[$__rate_interval])) sum(rate(greptime_table_operator_bulk_insert_message_size_sum[$__rate_interval])) | timeseries | Bulk-insert message row and byte rates. Spikes here can explain frontend bulk-insert latency. | prometheus | rowsps | rows | | Prom Store Flush Pipeline | sum(rate(greptime_prom_store_flush_total[$__rate_interval])) sum(rate(greptime_prom_store_flush_rows_sum[$__rate_interval])) histogram_quantile(0.99, sum by (le) (rate(greptime_prom_store_flush_elapsed_bucket[$__rate_interval]))) | timeseries | Remote-write pending-row flush operations, flushed rows, and p99 flush latency. | prometheus | short | flush-ops | | OTLP Trace Failures | sum(rate(greptime_frontend_otlp_traces_failure_count[$__rate_interval])) | timeseries | OTLP trace ingestion failures reported by frontends. | prometheus | eps | trace-failures |

Health

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Protocol Error Ratessum by (protocol) (rate(greptime_servers_error[$__rate_interval]))
sum by (code) (rate(greptime_servers_auth_failure_count[$__rate_interval]))
sum by (path, method, code) (rate(greptime_servers_http_requests_elapsed_count{path!~"/health|/metrics",code!~"2.."}[$__rate_interval]))
sum by (path, code) (rate(greptime_servers_grpc_requests_elapsed_count{code!~"0|OK"}[$__rate_interval]))timeseriesUser-facing and protocol-level error rates. Use labels to identify whether failures are server, auth, HTTP, or gRPC related.prometheusepsserver-{{protocol}}
Frontend and Query Rejectionssum(rate(greptime_servers_request_memory_rejected_total[$__rate_interval]))
sum(rate(greptime_query_memory_pool_rejected_total[$__rate_interval]))timeseriesRequest and query memory rejections. Non-zero values indicate requests are being rejected before or during execution.prometheusrpsrequest-memory
Datanode Write Failuressum by (instance, pod) (rate(greptime_datanode_region_request_fail_count[$__rate_interval]))
sum by (instance, pod) (rate(greptime_datanode_region_failed_insert_count[$__rate_interval]))timeseriesRegion request failures and failed inserts on datanodes. These indicate backend write-path errors after routing.prometheusepsregion-request-[{{instance}}]-[{{pod}}]
Buffered Ingestion Losssum(rate(greptime_pending_rows_flush_failures[$__rate_interval]))
sum(rate(greptime_pending_rows_flush_dropped_rows[$__rate_interval]))timeseriesPending-row flush failures and dropped rows. Sustained non-zero dropped rows are a data-loss signal.prometheusepsflush-failures
Mito Backpressure and Failuressum(rate(greptime_mito_write_reject_total[$__rate_interval]))
sum(rate(greptime_mito_write_stall_total[$__rate_interval]))
sum(rate(greptime_mito_flush_failure_total[$__rate_interval]))
sum(rate(greptime_mito_compaction_failure_total[$__rate_interval]))timeseriesStorage-engine write rejects, write stalls, flush failures, and compaction failures on datanodes.prometheusepswrite-reject
Scan and Compaction Memory Rejectssum(rate(greptime_mito_scan_requests_rejected_total[$__rate_interval]))
sum(rate(greptime_mito_scan_memory_exhausted_total[$__rate_interval]))
sum(rate(greptime_mito_compaction_memory_rejected_total[$__rate_interval]))timeseriesDatanode scan and compaction memory rejection/exhaustion counters.prometheusrpsscan-rejected
OpenDAL Errorssum by (scheme, operation, error) (rate(opendal_operation_errors_total{error!="NotFound"}[$__rate_interval]))timeseriesObject-store errors by scheme, operation, and error, excluding NotFound noise.prometheuseps{{scheme}}-{{operation}}-{{error}}
Metasrv Failuressum(rate(greptime_meta_region_migration_fail[$__rate_interval]))
sum(rate(greptime_meta_reconciliation_procedure_error[$__rate_interval]))timeseriesRegion migration and reconciliation failures in metasrv.prometheusepsmigration-fail
Flow and Trigger Failuressum by (code) (rate(greptime_flow_errors[$__rate_interval]))
sum(rate(greptime_trigger_evaluate_failure_count[$__rate_interval]))
sum(rate(greptime_trigger_send_alert_failure_count[$__rate_interval]))
sum(rate(greptime_trigger_save_alert_record_failure_count[$__rate_interval]))timeseriesDerived-data and alerting pipeline failures.prometheusepsflow-{{code}}
Mito GC Failuressum(rate(greptime_mito_gc_errors_total[$__rate_interval]))
sum(rate(greptime_mito_gc_orphaned_index_files[$__rate_interval]))
sum(rate(greptime_mito_gc_skipped_unparsable_files[$__rate_interval]))timeseriesMito garbage-collection errors and skipped/orphaned files on datanodes.prometheusshortgc-errors

Capacity

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Runtime Threadssum by (instance, pod) (greptime_runtime_threads_alive)
sum by (instance, pod) (greptime_runtime_threads_idle)timeseriesRuntime thread pool size and idle threads by instance. Low idle threads during latency spikes can indicate executor saturation.prometheusshortalive-[{{instance}}]-[{{pod}}]
Request Memory Utilizationsum by (instance, pod) (greptime_servers_request_memory_in_use_bytes) / sum by (instance, pod) (greptime_servers_request_memory_limit_bytes)timeseriesFrontend request memory usage divided by configured request memory limit.prometheuspercentunit[{{instance}}]-[{{pod}}]
Query Memory Usagesum by (instance, pod) (greptime_query_memory_pool_usage_bytes)timeseriesQuery memory pool usage. Use this with query memory rejection panels to diagnose query saturation.prometheusbytes[{{instance}}]-[{{pod}}]
Scan and Compaction Memorysum by (instance, pod) (greptime_mito_scan_memory_usage_bytes)
sum by (instance, pod) (greptime_mito_compaction_memory_in_use_bytes)
sum by (instance, pod) (greptime_mito_compaction_memory_limit_bytes)timeseriesDatanode scan memory usage and compaction memory utilization.prometheusbytesscan-[{{instance}}]-[{{pod}}]
Write Buffer and Active Stallingsum by (instance, pod) (greptime_mito_write_buffer_bytes)
sum by (instance, pod) (greptime_mito_write_stalling_count)timeseriesMito write buffer bytes and active write-stalling gauges. Growth here indicates write-path backpressure.prometheusbytesbuffer-[{{instance}}]-[{{pod}}]
Prom Store Backlogsum by (instance, pod) (greptime_prom_store_pending_rows)
sum by (instance, pod) (greptime_prom_store_pending_batches)
sum by (instance, pod) (greptime_prom_store_pending_workers)timeseriesPrometheus remote-write pending rows, batches, and workers. Rising pending rows indicate remote-write buffering backlog.prometheusshortrows-[{{instance}}]-[{{pod}}]
Inflight Flush and Compactionsum by (instance, pod) (greptime_mito_inflight_flush_count)
sum by (instance, pod) (greptime_mito_inflight_compaction_count)timeseriesCurrent in-flight flush and compaction tasks on datanodes.prometheusshortflush-[{{instance}}]-[{{pod}}]

Resources

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Frontend CPU Usage per Instancesum(rate(process_cpu_seconds_total[$__rate_interval]) * 1000) by (instance, pod)
max(greptime_cpu_limit_in_millicores)timeseriesCurrent cpu usage by instanceprometheusnone[{{ instance }}]-[{{ pod }}]-cpu
Datanode CPU Usage per Instancesum(rate(process_cpu_seconds_total[$__rate_interval]) * 1000) by (instance, pod)
max(greptime_cpu_limit_in_millicores)timeseriesCurrent cpu usage by instanceprometheusnone[{{ instance }}]-[{{ pod }}]
Metasrv CPU Usage per Instancesum(rate(process_cpu_seconds_total[$__rate_interval]) * 1000) by (instance, pod)
max(greptime_cpu_limit_in_millicores)timeseriesCurrent cpu usage by instanceprometheusnone[{{ instance }}]-[{{ pod }}]
Frontend Memory per Instancesum(process_resident_memory_bytes) by (instance, pod)
max(greptime_memory_limit_in_bytes)timeseriesCurrent memory usage by instanceprometheusbytes[{{ instance }}]-[{{ pod }}]
Datanode Memory per Instancesum(process_resident_memory_bytes) by (instance, pod)
max(greptime_memory_limit_in_bytes)timeseriesCurrent memory usage by instanceprometheusbytes[{{instance}}]-[{{ pod }}]
Metasrv Memory per Instancesum(process_resident_memory_bytes) by (instance, pod)
max(greptime_memory_limit_in_bytes)timeseriesCurrent memory usage by instanceprometheusbytes[{{ instance }}]-[{{ pod }}]-resident
Flownode CPU Usage per Instancesum(rate(process_cpu_seconds_total[$__rate_interval]) * 1000) by (instance, pod)
max(greptime_cpu_limit_in_millicores)timeseriesCurrent cpu usage by instanceprometheusnone[{{ instance }}]-[{{ pod }}]
Flownode Memory per Instancesum(process_resident_memory_bytes) by (instance, pod)
max(greptime_memory_limit_in_bytes)timeseriesCurrent memory usage by instanceprometheusbytes[{{ instance }}]-[{{ pod }}]

Queries

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Query Rate by Protocolsum(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_http_promql_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_http_sql_elapsed_count[$__rate_interval]))
sum(rate(greptime_frontend_grpc_handle_query_elapsed_count[$__rate_interval]))timeseriesQuery API call rates by protocol, collected from frontends.prometheusreqpsmysql
Query Latency by Protocolhistogram_quantile(0.95, sum by (le) (rate(greptime_servers_mysql_query_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_mysql_query_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.95, sum by (le) (rate(greptime_servers_postgres_query_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_postgres_query_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.95, sum by (le) (rate(greptime_servers_http_promql_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le) (rate(greptime_servers_http_promql_elapsed_bucket[$__rate_interval])))
sum(rate(greptime_servers_mysql_query_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_postgres_query_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval]))
sum(rate(greptime_servers_http_promql_elapsed_sum[$__rate_interval])) / sum(rate(greptime_servers_http_promql_elapsed_count[$__rate_interval]))
sum(rate(greptime_frontend_grpc_handle_query_elapsed_sum[$__rate_interval])) / sum(rate(greptime_frontend_grpc_handle_query_elapsed_count[$__rate_interval]))timeseriesp95, p99, and average query latency by main frontend protocol.prometheussmysql-p95
Query Stage Latencyhistogram_quantile(0.95, sum by (le, stage) (rate(greptime_query_stage_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le, stage) (rate(greptime_query_stage_elapsed_bucket[$__rate_interval])))timeseriesp95 and p99 latency by query stage. Use stage labels to identify planning, scan, or merge bottlenecks.prometheussp95-{{stage}}
Merge Scan Fan-out and Errorssum by (instance, pod) (greptime_query_merge_scan_regions)
sum by (instance, pod) (rate(greptime_query_merge_scan_errors_total[$__rate_interval]))timeseriesMerge-scan region fan-out and errors. High fan-out can explain slow distributed table scans.prometheusshortregions-[{{instance}}]-[{{pod}}]
Pushdown Fallback Errorssum(rate(greptime_push_down_fallback_errors_total[$__rate_interval]))timeseriesFailed query pushdown fallback attempts. Non-zero values can indicate optimization paths that increase scan work.prometheusepspushdown-fallback-errors
PromQL Series Countsum by (instance, pod) (greptime_promql_series_count)timeseriesSeries count touched by PromQL queries. Correlate this with PromQL latency to identify cardinality-driven slowness.prometheusshort[{{instance}}]-[{{pod}}]
Connections and Prepared Statementssum by (instance, pod) (greptime_servers_mysql_connection_count)
sum by (instance, pod) (greptime_servers_postgres_connection_count)
sum by (instance, pod) (rate(greptime_servers_mysql_prepared_count[$__rate_interval]))
sum by (instance, pod) (rate(greptime_servers_postgres_prepared_count[$__rate_interval]))timeseriesMySQL/PostgreSQL connection and prepared-statement counts. Spikes can indicate client storms or leaks.prometheusshortmysql-connections-[{{instance}}]-[{{pod}}]

Frontend Requests

TitleQueryTypeDescriptionDatasourceUnitLegend Format
HTTP QPS per Instancesum by(instance, pod, path, method, code) (rate(greptime_servers_http_requests_elapsed_count{path!~"/health|/metrics"}[$__rate_interval]))timeseriesHTTP QPS per Instance.prometheusreqps[{{instance}}]-[{{pod}}]-[{{path}}]-[{{method}}]-[{{code}}]
HTTP P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, path, method, code) (rate(greptime_servers_http_requests_elapsed_bucket{path!~"/health|/metrics"}[$__rate_interval])))
sum by(instance, pod, path, method, code) (rate(greptime_servers_http_requests_elapsed_sum{path!~"/health|/metrics"}[$__rate_interval])) / sum by(instance, pod, path, method, code) (rate(greptime_servers_http_requests_elapsed_count{path!~"/health|/metrics"}[$__rate_interval]))timeseriesHTTP P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{path}}]-[{{method}}]-[{{code}}]-p99
gRPC QPS per Instancesum by(instance, pod, path, code) (rate(greptime_servers_grpc_requests_elapsed_count[$__rate_interval]))timeseriesgRPC QPS per Instance.prometheusreqps[{{instance}}]-[{{pod}}]-[{{path}}]-[{{code}}]
gRPC P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, path, code) (rate(greptime_servers_grpc_requests_elapsed_bucket[$__rate_interval])))
sum by(instance, pod, path, code) (rate(greptime_servers_grpc_requests_elapsed_sum[$__rate_interval])) / sum by(instance, pod, path, code) (rate(greptime_servers_grpc_requests_elapsed_count[$__rate_interval]))timeseriesgRPC P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{path}}]-[{{method}}]-[{{code}}]-p99
MySQL QPS per Instancesum by(pod, instance)(rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval]))timeseriesMySQL QPS per Instance.prometheusreqps[{{instance}}]-[{{pod}}]
MySQL P99 and Avg per Instancehistogram_quantile(0.99, sum by(pod, instance, le) (rate(greptime_servers_mysql_query_elapsed_bucket[$__rate_interval])))
sum by(pod, instance) (rate(greptime_servers_mysql_query_elapsed_sum[$__rate_interval])) / sum by(pod, instance) (rate(greptime_servers_mysql_query_elapsed_count[$__rate_interval]))timeseriesMySQL P99 and average per Instance.prometheuss[{{ instance }}]-[{{ pod }}]-p99
PostgreSQL QPS per Instancesum by(pod, instance)(rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval]))timeseriesPostgreSQL QPS per Instance.prometheusreqps[{{instance}}]-[{{pod}}]
PostgreSQL P99 and Avg per Instancehistogram_quantile(0.99, sum by(pod,instance,le) (rate(greptime_servers_postgres_query_elapsed_bucket[$__rate_interval])))
sum by(pod, instance) (rate(greptime_servers_postgres_query_elapsed_sum[$__rate_interval])) / sum by(pod, instance) (rate(greptime_servers_postgres_query_elapsed_count[$__rate_interval]))timeseriesPostgreSQL P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-p99

Frontend to Datanode

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Region Call QPS per Instancesum by(instance, pod, request_type) (rate(greptime_grpc_region_request_count[$__rate_interval]))timeseriesRegion Call QPS per Instance.prometheusops[{{instance}}]-[{{pod}}]-[{{request_type}}]
Region Call P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, request_type) (rate(greptime_grpc_region_request_bucket[$__rate_interval])))
sum by(instance, pod, request_type) (rate(greptime_grpc_region_request_sum[$__rate_interval])) / sum by(instance, pod, request_type) (rate(greptime_grpc_region_request_count[$__rate_interval]))timeseriesRegion Call P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{request_type}}]
Frontend Handle Bulk Insert Elapsed Timesum by(instance, pod, stage) (rate(greptime_table_operator_handle_bulk_insert_sum[$__rate_interval]))/sum by(instance, pod, stage) (rate(greptime_table_operator_handle_bulk_insert_count[$__rate_interval]))
histogram_quantile(0.99, sum by(instance, pod, stage, le) (rate(greptime_table_operator_handle_bulk_insert_bucket[$__rate_interval])))timeseriesPer-stage time for frontend to handle bulk insert requestsprometheuss[{{instance}}]-[{{pod}}]-[{{stage}}]-AVG

Datanode

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Region Request Failures and Failed Insertssum by (instance, pod) (rate(greptime_datanode_region_request_fail_count[$__rate_interval]))
sum by (instance, pod) (rate(greptime_datanode_region_failed_insert_count[$__rate_interval]))timeseriesDatanode region request failures and failed inserts by instance.prometheusepsrequest-fail-[{{instance}}]-[{{pod}}]
Write Rejects and Stallssum by (instance, pod) (rate(greptime_mito_write_reject_total[$__rate_interval]))
sum by (instance, pod) (rate(greptime_mito_write_stall_total[$__rate_interval]))
sum by (instance, pod) (greptime_mito_write_stalling_count)timeseriesMito write rejects, write stall events, and active write stalling by datanode.prometheusshortreject-[{{instance}}]-[{{pod}}]
Flush and Compaction Failuressum by (instance, pod) (rate(greptime_mito_flush_failure_total[$__rate_interval]))
sum by (instance, pod) (rate(greptime_mito_compaction_failure_total[$__rate_interval]))timeseriesMito flush and compaction failure rates by datanode.prometheusepsflush-[{{instance}}]-[{{pod}}]
Mito GC Healthsum(rate(greptime_mito_gc_runs_total[$__rate_interval]))
sum(rate(greptime_mito_gc_errors_total[$__rate_interval]))
sum(rate(greptime_mito_gc_files_deleted_total[$__rate_interval]))
sum(rate(greptime_mito_gc_orphaned_index_files[$__rate_interval]))
sum(rate(greptime_mito_gc_skipped_unparsable_files[$__rate_interval]))timeseriesMito garbage-collection runs, errors, deleted files, orphaned index files, and skipped unparsable files.prometheusshortruns
Mito GC Durationhistogram_quantile(0.99, sum by (le, stage) (rate(greptime_mito_gc_duration_seconds_bucket[$__rate_interval])))
sum by (stage) (rate(greptime_mito_gc_duration_seconds_sum[$__rate_interval])) / sum by (stage) (rate(greptime_mito_gc_duration_seconds_count[$__rate_interval]))timeseriesP99 and average Mito garbage-collection duration by stage.prometheuss{{stage}}-p99

Storage

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Request OPS per Instancesum by(instance, pod, type) (rate(greptime_mito_handle_request_elapsed_count[$__rate_interval]))timeseriesRequest QPS per Instance.prometheusops[{{instance}}]-[{{pod}}]-[{{type}}]
Request P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, type) (rate(greptime_mito_handle_request_elapsed_bucket[$__rate_interval])))
sum by(instance, pod, type) (rate(greptime_mito_handle_request_elapsed_sum[$__rate_interval])) / sum by(instance, pod, type) (rate(greptime_mito_handle_request_elapsed_count[$__rate_interval]))timeseriesRequest P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{type}}]
Request Wait P99 and Avg per Workerhistogram_quantile(0.95, sum by(instance, pod, worker, le) (rate(greptime_mito_request_wait_time_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by(instance, pod, worker, le) (rate(greptime_mito_request_wait_time_bucket[$__rate_interval])))
sum by(instance, pod, worker) (rate(greptime_mito_request_wait_time_sum[$__rate_interval])) / sum by(instance, pod, worker) (rate(greptime_mito_request_wait_time_count[$__rate_interval]))timeseriesTime Mito requests spend waiting before region worker handling. Use this with request service latency to distinguish queueing from execution time.prometheuss[{{instance}}]-[{{pod}}]-[{{worker}}]-p95
Write Buffer per Instancegreptime_mito_write_buffer_bytestimeseriesWrite Buffer per Instance.prometheusdecbytes[{{instance}}]-[{{pod}}]
Write Rows per Instancesum by (instance, pod) (rate(greptime_mito_write_rows_total[$__rate_interval]))timeseriesIngestion size by row counts.prometheusrowsps[{{instance}}]-[{{pod}}]
Read Stage OPS per Instancesum by(instance, pod) (rate(greptime_mito_read_stage_elapsed_count{stage="total"}[$__rate_interval]))timeseriesRead Stage OPS per Instance.prometheusops[{{instance}}]-[{{pod}}]
Read Stage P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, stage) (rate(greptime_mito_read_stage_elapsed_bucket[$__rate_interval])))
sum by(instance, pod, stage) (rate(greptime_mito_read_stage_elapsed_sum[$__rate_interval])) / sum by(instance, pod, stage) (rate(greptime_mito_read_stage_elapsed_count[$__rate_interval]))timeseriesRead Stage P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{stage}}]
Write Stage P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, stage) (rate(greptime_mito_write_stage_elapsed_bucket[$__rate_interval])))
sum by(instance, pod, stage) (rate(greptime_mito_write_stage_elapsed_sum[$__rate_interval])) / sum by(instance, pod, stage) (rate(greptime_mito_write_stage_elapsed_count[$__rate_interval]))timeseriesWrite Stage P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{stage}}]
Cached Bytes per Instancegreptime_mito_cache_bytestimeseriesCached Bytes per Instance.prometheusdecbytes[{{instance}}]-[{{pod}}]-[{{type}}]
Region Worker Handle Bulk Insert Requestshistogram_quantile(0.95, sum by(le,instance, stage, pod) (rate(greptime_region_worker_handle_write_bucket[$__rate_interval])))
sum by(instance, stage, pod) (rate(greptime_region_worker_handle_write_sum[$__rate_interval]))/sum by(instance, stage, pod) (rate(greptime_region_worker_handle_write_count[$__rate_interval]))timeseriesPer-stage elapsed time for region worker to handle bulk insert region requests.prometheuss[{{instance}}]-[{{pod}}]-[{{stage}}]-P95
Active Series and Field Builders Countsum by(instance, pod) (greptime_mito_memtable_active_series_count)
sum by(instance, pod) (greptime_mito_memtable_field_builder_count)timeseriesActive series and field-builder counts per memtable by instance.prometheusnone[{{instance}}]-[{{pod}}]-series
Region Worker Convert Requestshistogram_quantile(0.95, sum by(le, instance, stage, pod) (rate(greptime_datanode_convert_region_request_bucket[$__rate_interval])))
sum by(le,instance, stage, pod) (rate(greptime_datanode_convert_region_request_sum[$__rate_interval]))/sum by(le,instance, stage, pod) (rate(greptime_datanode_convert_region_request_count[$__rate_interval]))timeseriesPer-stage elapsed time for region worker to decode requests.prometheuss[{{instance}}]-[{{pod}}]-[{{stage}}]-P95
Cache Misssum by (instance,pod, type) (rate(greptime_mito_cache_miss[$__rate_interval]))timeseriesThe local cache miss of the datanode.prometheus--[{{instance}}]-[{{pod}}]-[{{type}}]

Flush and Compaction

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Flush OPS per Instancesum by(instance, pod, reason) (rate(greptime_mito_flush_requests_total[$__rate_interval]))timeseriesFlush QPS per Instance.prometheusops[{{instance}}]-[{{pod}}]-[{{reason}}]
Flush Elapsed Timehistogram_quantile(0.95, sum by (instance, pod, le, type) (rate(greptime_mito_flush_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (instance, pod, le, type) (rate(greptime_mito_flush_elapsed_bucket[$__rate_interval])))
sum by (instance, pod, type) (rate(greptime_mito_flush_elapsed_sum[$__rate_interval])) / sum by (instance, pod, type) (rate(greptime_mito_flush_elapsed_count[$__rate_interval]))timeseriesMito flush p95 and p99 elapsed time by datanode and flush type. Use this to identify slow flush jobs.prometheuss[{{instance}}]-[{{pod}}]-[{{type}}]-p95
Flush Throughputsum by (instance, pod) (rate(greptime_mito_flush_bytes_total[$__rate_interval]))
sum by (instance, pod) (rate(greptime_mito_flush_file_total[$__rate_interval]))timeseriesMito flushed bytes and flushed file rates. Use this with flush elapsed time to distinguish slow jobs from large jobs.prometheusBps[{{instance}}]-[{{pod}}]-bytes
Inflight Flushgreptime_mito_inflight_flush_counttimeseriesOngoing flush task countprometheusnone[{{instance}}]-[{{pod}}]
Compaction OPS per Instancesum by(instance, pod) (rate(greptime_mito_compaction_total_elapsed_count[$__rate_interval]))timeseriesCompaction OPS per Instance.prometheusops[{{ instance }}]-[{{pod}}]
Inflight Compactiongreptime_mito_inflight_compaction_counttimeseriesOngoing compaction task countprometheusnone[{{instance}}]-[{{pod}}]
Compaction P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le) (rate(greptime_mito_compaction_total_elapsed_bucket[$__rate_interval])))
sum by(instance, pod) (rate(greptime_mito_compaction_total_elapsed_sum[$__rate_interval])) / sum by(instance, pod) (rate(greptime_mito_compaction_total_elapsed_count[$__rate_interval]))timeseriesCompaction P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-p99
Compaction Elapsed Time per Instance by Stagehistogram_quantile(0.99, sum by(instance, pod, le, stage) (rate(greptime_mito_compaction_stage_elapsed_bucket[$__rate_interval])))
sum by(instance, pod, stage) (rate(greptime_mito_compaction_stage_elapsed_sum[$__rate_interval]))/sum by(instance, pod, stage) (rate(greptime_mito_compaction_stage_elapsed_count[$__rate_interval]))timeseriesCompaction latency by stageprometheuss[{{instance}}]-[{{pod}}]-[{{stage}}]-p99
Compaction Input/Output Bytessum by(instance, pod) (rate(greptime_mito_compaction_input_bytes[$__rate_interval]))
sum by(instance, pod) (rate(greptime_mito_compaction_output_bytes[$__rate_interval]))timeseriesCompaction input and output bytes by datanode. Use this to correlate compaction latency with rewritten data volume.prometheusBps[{{instance}}]-[{{pod}}]-input

Index

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Index Apply Elapsed Timehistogram_quantile(0.95, sum by (le, type) (rate(greptime_index_apply_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le, type) (rate(greptime_index_apply_elapsed_bucket[$__rate_interval])))timeseriesIndex apply p95 and p99 elapsed time by index type. Slow apply can increase read latency for indexed predicates.prometheuss{{type}}-p95
Index Create Elapsed Timehistogram_quantile(0.95, sum by (le, stage, type) (rate(greptime_index_create_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by (le, stage, type) (rate(greptime_index_create_elapsed_bucket[$__rate_interval])))timeseriesIndex create p95 and p99 elapsed time by stage and index type. Slow stages can explain flush or compaction delays.prometheuss{{type}}-{{stage}}-p95
Index Create Rows and Bytessum by (type) (rate(greptime_index_create_rows_total[$__rate_interval]))
sum by (type) (rate(greptime_index_create_bytes_total[$__rate_interval]))timeseriesRows and bytes produced by index creation by index type. Spikes here can explain storage write pressure.prometheusrowsps{{type}}-rows
Index Memory Usagegreptime_index_apply_memory_usage
sum by (type) (greptime_index_create_memory_usage)timeseriesMemory used while applying and creating indexes. Growth here can explain memory pressure during indexed flush or compaction work.prometheusbytesapply
Index IO Bytessum by (type, file_type) (rate(greptime_index_io_bytes_total[$__rate_interval]))timeseriesIndex read and write byte rates by operation and file type for puffin and intermediate files.prometheusBps{{type}}-{{file_type}}
Index IO Operationssum by (type, file_type) (rate(greptime_index_io_op_total[$__rate_interval]))timeseriesIndex IO operation rates by operation and file type, including read, write, seek, and flush operations.prometheusops{{type}}-{{file_type}}
Index Cachesum by (type) (rate(greptime_mito_cache_hit{type=~"index.*|vector_index|index_result"}[$__rate_interval]))
sum by (type) (rate(greptime_mito_cache_miss{type=~"index.*|vector_index|index_result"}[$__rate_interval]))
sum by (type, cause) (rate(greptime_mito_cache_eviction{type=~"index.*|vector_index|index_result"}[$__rate_interval]))timeseriesIndex-related cache hits, misses, and evictions from Mito caches.prometheusopshit-{{type}}

Metasrv

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Inactive and Lease-expired Regionssum(greptime_meta_inactive_regions)
sum(greptime_lease_expired_region)timeseriesInactive regions and expired region leases. Non-zero values indicate metasrv or routing health issues.prometheusshortinactive-regions
Heartbeat Healthsum(rate(greptime_meta_heartbeat_rate[$__rate_interval]))
sum(greptime_meta_heartbeat_connection_num)
sum(rate(greptime_frontend_heartbeat_send_count[$__rate_interval]))
sum(rate(greptime_frontend_heartbeat_recv_count[$__rate_interval]))
sum(rate(greptime_datanode_heartbeat_send_count[$__rate_interval]))
sum(rate(greptime_datanode_heartbeat_recv_count[$__rate_interval]))timeseriesMetasrv heartbeat receive rate, heartbeat connections, and frontend/datanode heartbeat send/receive counters.prometheusshortmeta-recv-rate
Region migration datanodegreptime_meta_region_migration_stat{datanode_type="src"}
greptime_meta_region_migration_stat{datanode_type="desc"}status-historyCounter of region migration by source and destinationprometheus--from-datanode-{{datanode_id}}
Region migration errorrate(greptime_meta_region_migration_error[$__rate_interval])timeseriesCounter of region migration errorprometheusnone{{pod}}-{{state}}-{{error_type}}
Datanode loadgreptime_datanode_loadtimeseriesGauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads.prometheusbinBpsDatanode-{{datanode_id}}-writeload
Rate of SQL Executions (RDS)rate(greptime_meta_rds_pg_sql_execute_elapsed_ms_count[$__rate_interval])timeseriesDisplays the rate of SQL executions processed by the Meta service using the RDS backend.prometheusnone{{pod}} {{op}} {{type}} {{result}}
SQL Execution Latency (RDS)histogram_quantile(0.90, sum by(pod, op, type, result, le) (rate(greptime_meta_rds_pg_sql_execute_elapsed_ms_bucket[$__rate_interval])))
sum by(pod, op, type, result) (rate(greptime_meta_rds_pg_sql_execute_elapsed_ms_sum[$__rate_interval])) / sum by(pod, op, type, result) (rate(greptime_meta_rds_pg_sql_execute_elapsed_ms_count[$__rate_interval]))timeseriesMeasures the response time of SQL executions via the RDS backend.prometheusms{{pod}} {{op}} {{type}} {{result}} p90
Handler Execution Latency`histogram_quantile(0.90, sum by(pod, le, name) (
rate(greptime_meta_handler_execute_bucket[$__rate_interval])
))`
sum by(pod, name) (rate(greptime_meta_handler_execute_sum[$__rate_interval])) / sum by(pod, name) (rate(greptime_meta_handler_execute_count[$__rate_interval]))timeseriesShows latency of Meta handlers by pod and handler name, useful for monitoring handler performance and detecting latency spikes.
prometheuss{{pod}} {{name}} p90
Heartbeat Packet Sizehistogram_quantile(0.9, sum by(pod, le) (rate(greptime_meta_heartbeat_stat_memory_size_bucket[$__rate_interval])))timeseriesShows p90 heartbeat message sizes, helping track network usage and identify anomalies in heartbeat payload.
prometheusbytes{{pod}}
Meta Heartbeat Receive Raterate(greptime_meta_heartbeat_rate[$__rate_interval])timeseriesRate of heartbeats received by metasrv from datanodes and frontends.prometheuss{{pod}}
Meta KV Ops Latencyhistogram_quantile(0.99, sum by(pod, le, op, target) (rate(greptime_meta_kv_request_elapsed_bucket[$__rate_interval])))
sum by(pod, op, target) (rate(greptime_meta_kv_request_elapsed_sum[$__rate_interval])) / sum by(pod, op, target) (rate(greptime_meta_kv_request_elapsed_count[$__rate_interval]))timeseriesp99 and average latency of metasrv key-value store operations by op and target.prometheuss{{pod}}-{{op}} p99
Rate of meta KV Opsrate(greptime_meta_kv_request_elapsed_count[$__rate_interval])timeseriesRate of metasrv key-value store operations by op.prometheusnone{{pod}}-{{op}} p99
DDL Latencyhistogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_create_tables_bucket[$__rate_interval])))
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_create_table_bucket[$__rate_interval])))
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_create_view_bucket[$__rate_interval])))
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_create_flow_bucket[$__rate_interval])))
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_drop_table_bucket[$__rate_interval])))
histogram_quantile(0.9, sum by(le, pod, step) (rate(greptime_meta_procedure_alter_table_bucket[$__rate_interval])))timeseriesp90 latency of metasrv DDL procedures (create/alter/drop table, create view/flow) by step.prometheussCreateLogicalTables-{{step}} p90
Reconciliation statsrate(greptime_meta_reconciliation_stats[$__rate_interval])timeseriesReconciliation statsprometheusops{{pod}}-{{table_type}}-{{type}}
Reconciliation stepshistogram_quantile(0.9, sum by(le, procedure_name, step) (rate(greptime_meta_reconciliation_procedure_bucket[$__rate_interval])))timeseriesElapsed of Reconciliation stepsprometheuss{{procedure_name}}-{{step}}-P90

Hotspot

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Hotspot Regions`WITH table_stats AS (
SELECT
table_id,
COUNT(*) AS region_count,
SUM(disk_size) AS total_disk_size,
SUM(region_rows) as total_region_rows

FROM information_schema.region_statistics WHERE region_role = 'Leader' GROUP BY table_id HAVING COUNT(*) > 1 )

SELECT t.table_schema, t.table_name,

r.region_id, t.table_id, r.region_number,

p.partition_description,

ROUND( r.disk_size * 100.0 / NULLIF(ts.total_disk_size, 0), 2 ) AS disk_size_share_percent,

r.disk_size,

ROUND( r.region_rows * 100.0 / NULLIF(ts.total_region_rows, 0), 2 ) AS region_rows_share_percent, r.region_rows

FROM information_schema.region_statistics r

JOIN table_stats ts ON r.table_id = ts.table_id

JOIN information_schema.tables t ON r.table_id = t.table_id

LEFT JOIN information_schema.partitions p ON p.table_schema = t.table_schema AND p.table_name = t.table_name AND p.greptime_partition_id = r.region_id

WHERE r.region_role = 'Leader'

ORDER BY region_rows_share_percent DESC LIMIT 100;|table| |mysql| -- | -- | | Datanode Load(Write) |greptime_datanode_history_load|timeseries| Write load of each datanode over time. |prometheus|binBps|datanode-{{datanode_id}}({{instance}})| | Datanode Load(Write) Distribution |greptime_datanode_history_load|piechart| Distribution of write load across datanodes. |prometheus|binBps|datanode-{{datanode_id}}({{instance}})| | Datanode Data Distribution |WITH leader_regions AS ( SELECT CONCAT( 'datanode-', p.peer_id, ' (', p.peer_addr, ')' ) AS datanode, r.disk_size FROM information_schema.region_statistics r JOIN information_schema.region_peers p ON r.region_id = p.region_id WHERE r.region_role = 'Leader' AND p.is_leader = 'Yes' )

SELECT datanode, COUNT(*) AS leader_region_count, SUM(disk_size) AS data_size FROM leader_regions GROUP BY datanode ORDER BY data_size DESC;|piechart| Distribution of leader regions and data size across datanodes. |mysql|bytes` | -- |

Autopilot

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Region Balancer Actions`sum by (result) (
changes(greptime_region_balancer_actions_total[$__rate_interval])
)`timeseriesRegion balancer action countprometheusshort{{result}}
Region Balancer Gate Stopssum by (gate, reason) (changes(greptime_region_balancer_gate_stop_total[$__rate_interval]))timeseriesRegion balancer gate stop count by gate and reasonprometheusshort{{gate}} / {{reason}}
Region Balancer Datanodessum by (state) (greptime_region_balancer_datanodes)statRegion balancer datanode count by stateprometheusshort{{state}}
Region Balancer Regionssum by (state) (greptime_region_balancer_regions)statRegion balancer region count by stateprometheusshort{{state}}
Region Balancer Datanode Stabilitysum by (state) (greptime_region_balancer_datanode_stability)statRegion balancer datanode stability statistics by stateprometheusbinBps{{state}}
Auto Repartition Actionssum by (result) (changes(greptime_auto_repartition_actions_total[$__rate_interval]))timeseriesAuto repartition action count by resultprometheusshort{{result}}
Auto Repartition Gate Stopssum by (gate, reason) (changes(greptime_auto_repartition_gate_stop_total[$__rate_interval]))timeseriesAuto repartition gate stop count by gate and reasonprometheusshort{{gate}} / {{reason}}
Auto Repartition Sampling P99histogram_quantile(0.99, sum by (le, stage) (rate(greptime_auto_repartition_sampling_elapsed_bucket[$__rate_interval])))timeseriesAuto repartition sampling elapsed time by stageprometheuss{{stage}}
Auto Repartition Executor P99histogram_quantile(0.99, sum by (le, stage) (rate(greptime_auto_repartition_executor_elapsed_bucket[$__rate_interval])))timeseriesAuto repartition executor elapsed time by stageprometheuss{{stage}}

Object Store

TitleQueryTypeDescriptionDatasourceUnitLegend Format
QPS per Instancesum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count[$__rate_interval]))timeseriesQPS per Instance.prometheusops[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]
Read QPS per Instancesum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation=~"read|Reader::read"}[$__rate_interval]))timeseriesRead QPS per Instance.prometheusops[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]
Read P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{operation=~"read|Reader::read"}[$__rate_interval])))
sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_sum{operation=~"read|Reader::read"}[$__rate_interval])) / sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation=~"read|Reader::read"}[$__rate_interval]))timeseriesRead P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]
Write QPS per Instancesum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation=~"write|Writer::write|Writer::close"}[$__rate_interval]))timeseriesWrite QPS per Instance.prometheusops[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]
Write P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{operation =~ "Writer::write|Writer::close|write"}[$__rate_interval])))
sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_sum{operation=~"write|Writer::write|Writer::close"}[$__rate_interval])) / sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation=~"write|Writer::write|Writer::close"}[$__rate_interval]))timeseriesWrite P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]
List QPS per Instancesum by(instance, pod, scheme) (rate(opendal_operation_duration_seconds_count{operation="list"}[$__rate_interval]))timeseriesList QPS per Instance.prometheusops[{{instance}}]-[{{pod}}]-[{{scheme}}]
List P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, scheme) (rate(opendal_operation_duration_seconds_bucket{operation="list"}[$__rate_interval])))
sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_sum{operation="list"}[$__rate_interval])) / sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation="list"}[$__rate_interval]))timeseriesList P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{scheme}}]
Other Requests per Instancesum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation!~"read|Reader::read|write|Writer::write|Writer::close|list|stat"}[$__rate_interval]))timeseriesOther Requests per Instance.prometheusops[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]
Other Request P99 and Avg per Instancehistogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{operation!~"read|Reader::read|write|Writer::write|Writer::close|list|stat"}[$__rate_interval])))
sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_sum{operation!~"read|Reader::read|write|Writer::write|Writer::close|list|stat"}[$__rate_interval])) / sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation!~"read|Reader::read|write|Writer::write|Writer::close|list|stat"}[$__rate_interval]))timeseriesOther Request P99 and average per Instance.prometheuss[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]
Opendal trafficsum by(instance, pod, scheme, operation) (rate(opendal_operation_bytes_sum[$__rate_interval]))timeseriesTotal traffic as in bytes by instance and operationprometheusdecbytes[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]
OpenDAL errors per Instancesum by(instance, pod, scheme, operation, error) (rate(opendal_operation_errors_total{error!="NotFound"}[$__rate_interval]))timeseriesOpenDAL error counts per Instance.prometheus--[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]-[{{error}}]

WAL

TitleQueryTypeDescriptionDatasourceUnitLegend Format
WAL write sizehistogram_quantile(0.95, sum by(le,instance, pod) (rate(raft_engine_write_size_bucket[$__rate_interval])))
histogram_quantile(0.99, sum by(le,instance,pod) (rate(raft_engine_write_size_bucket[$__rate_interval])))
sum by (instance, pod)(rate(raft_engine_write_size_sum[$__rate_interval]))timeseriesWrite-ahead logs write size as bytes. This chart includes stats of p95 and p99 size by instance, total WAL write rate.prometheusbytes[{{instance}}]-[{{pod}}]-req-size-p95
WAL sync duration secondshistogram_quantile(0.99, sum by(le, type, node, instance, pod) (rate(raft_engine_sync_log_duration_seconds_bucket[$__rate_interval])))timeseriesRaft engine (local disk) log store sync latency, p99prometheuss[{{instance}}]-[{{pod}}]-p99
Log Store op duration secondshistogram_quantile(0.99, sum by(le,logstore,optype,instance, pod) (rate(greptime_logstore_op_elapsed_bucket[$__rate_interval])))timeseriesWrite-ahead log operations latency at p99prometheuss[{{instance}}]-[{{pod}}]-[{{logstore}}]-[{{optype}}]-p99
Triggered region flush totalmeta_triggered_region_flush_totaltimeseriesTriggered region flush totalprometheusnone{{pod}}-{{topic_name}}
Triggered region checkpoint totalmeta_triggered_region_checkpoint_totaltimeseriesTriggered region checkpoint totalprometheusnone{{pod}}-{{topic_name}}
Topic estimated replay sizemeta_topic_estimated_replay_sizetimeseriesTopic estimated max replay sizeprometheusbytes{{pod}}-{{topic_name}}
Kafka logstore's bytes trafficrate(greptime_logstore_kafka_client_bytes_total[$__rate_interval])timeseriesKafka logstore's bytes trafficprometheusbytes{{pod}}-{{logstore}}

Flownode

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Flow Ingest / Output Ratesum by(instance, pod, direction) (rate(greptime_flow_processed_rows[$__rate_interval]))timeseriesFlow Ingest / Output Rate.prometheus--[{{pod}}]-[{{instance}}]-[{{direction}}]
Flow Ingest Latencyhistogram_quantile(0.95, sum(rate(greptime_flow_insert_elapsed_bucket[$__rate_interval])) by (le, instance, pod))
histogram_quantile(0.99, sum(rate(greptime_flow_insert_elapsed_bucket[$__rate_interval])) by (le, instance, pod))
sum by(instance, pod) (rate(greptime_flow_insert_elapsed_sum[$__rate_interval])) / sum by(instance, pod) (rate(greptime_flow_insert_elapsed_count[$__rate_interval]))timeseriesFlow Ingest Latency.prometheus--[{{instance}}]-[{{pod}}]-p95
Flow Operation Latencyhistogram_quantile(0.95, sum(rate(greptime_flow_processing_time_bucket[$__rate_interval])) by (le,instance,pod,type))
histogram_quantile(0.99, sum(rate(greptime_flow_processing_time_bucket[$__rate_interval])) by (le,instance,pod,type))timeseriesFlow Operation Latency.prometheus--[{{instance}}]-[{{pod}}]-[{{type}}]-p95
Flow Buffer Size per Instancegreptime_flow_input_buf_sizetimeseriesFlow Buffer Size per Instance.prometheus--[{{instance}}]-[{{pod}}]
Flow Processing Error per Instancesum by(instance,pod,code) (rate(greptime_flow_errors[$__rate_interval]))timeseriesFlow Processing Error per Instance.prometheus--[{{instance}}]-[{{pod}}]-[{{code}}]

Trigger

TitleQueryTypeDescriptionDatasourceUnitLegend Format
Trigger Countgreptime_trigger_counttimeseriesTotal number of triggers currently defined.prometheus--__auto
Trigger Eval Elapsedhistogram_quantile(0.99, sum by (le) (rate(greptime_trigger_evaluate_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.75, sum by (le) (rate(greptime_trigger_evaluate_elapsed_bucket[$__rate_interval])))
sum(rate(greptime_trigger_evaluate_elapsed_sum[$__rate_interval])) / sum(rate(greptime_trigger_evaluate_elapsed_count[$__rate_interval]))timeseriesElapsed time for trigger evaluation, including query execution and condition evaluation.prometheussp99
Trigger Eval Failure Raterate(greptime_trigger_evaluate_failure_count[$__rate_interval])timeseriesRate of failed trigger evaluations.prometheusnone__auto
Send Alert Elapsedhistogram_quantile(0.99, sum by (le) (rate(greptime_trigger_send_alert_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.75, sum by (le) (rate(greptime_trigger_send_alert_elapsed_bucket[$__rate_interval])))
sum(rate(greptime_trigger_send_alert_elapsed_sum[$__rate_interval])) / sum(rate(greptime_trigger_send_alert_elapsed_count[$__rate_interval]))timeseriesElapsed time to send trigger alerts to notification channels.prometheussp99
Send Alert Failure Raterate(greptime_trigger_send_alert_failure_count[$__rate_interval])timeseriesRate of failures when sending trigger alerts.prometheusnone__auto
Save Alert Elapsedhistogram_quantile(0.99, sum by (le) (rate(greptime_trigger_save_alert_record_elapsed_bucket[$__rate_interval])))
histogram_quantile(0.75, sum by (le) (rate(greptime_trigger_save_alert_record_elapsed_bucket[$__rate_interval])))
sum(rate(greptime_trigger_save_alert_record_elapsed_sum[$__rate_interval])) / sum(rate(greptime_trigger_save_alert_record_elapsed_count[$__rate_interval]))timeseriesElapsed time to persist trigger alert records.prometheussp99
Save Alert Failure Raterate(greptime_trigger_save_alert_record_failure_count[$__rate_interval])timeseriesRate of failures when persisting trigger alert records.prometheusnone__auto