docs/sources/operations/meta-monitoring/metrics.md
Loki exposes many metrics, and each component behaves differently under load. This page focuses on the highest-signal metrics for detecting negative trends early.
{{< admonition type="note" >}} The example queries on this page are PromQL. Run them against the Prometheus-compatible data source where your Loki metrics are stored (for example, Prometheus, Mimir, or Grafana Cloud Metrics). {{< /admonition >}}
For setup and prebuilt dashboards and alerts, refer to:
Watch request failures first. A sustained increase in 5xx responses is usually the earliest sign of user-visible impact.
Key metric:
loki_request_duration_seconds_count (counter with labels including status_code, job, and route)Example query:
100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])) by (cluster, namespace, job, route)
/
sum(rate(loki_request_duration_seconds_count[2m])) by (cluster, namespace, job, route)
Abnormal behavior:
LokiRequestErrors fires when this ratio is greater than 10% for 15 minutes.Latency degradation can appear before hard failures. Track p99 for read and write routes.
Key metric:
loki_request_duration_seconds_bucket (histogram buckets)Example query:
histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket[1m])) by (le, cluster, namespace, job, route))
Abnormal behavior:
LokiRequestLatency fires when p99 exceeds 1 second for 15 minutes.Panics are high-severity faults and should stay at zero.
Key metric:
loki_panic_totalExample query:
sum(increase(loki_panic_total[10m])) by (cluster, namespace, job)
Abnormal behavior:
LokiRequestPanics treats this as critical.Discarded samples indicate data that Loki rejected or dropped. This is one of the most important ingestion-quality signals.
Key metric:
loki_discarded_samples_totalExample query:
topk(10, sum by (tenant, reason) (rate(loki_discarded_samples_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval])))
Abnormal behavior:
reason values (for example, tenant limits or stream limits).Compaction issues can silently degrade read performance and retention behavior over time.
{{< admonition type="note" >}}
The compaction and retention metrics use a loki_boltdb_shipper_ prefix for historical reasons. The compactor emits these metrics regardless of which index type you use, including TSDB.
{{< /admonition >}}
Key metrics:
loki_boltdb_shipper_compactor_runningloki_boltdb_shipper_compact_tables_operation_last_successful_run_timestamp_secondsloki_boltdb_shipper_compact_tables_operation_totalloki_boltdb_shipper_compact_tables_operation_duration_secondsExample queries:
sum(loki_boltdb_shipper_compactor_running) by (cluster, namespace)
time() - (loki_boltdb_shipper_compact_tables_operation_last_successful_run_timestamp_seconds > 0)
Abnormal behavior:
Ingester pressure often appears as memory growth, poor chunk utilization, or flush backlog.
Key metrics:
loki_ingester_memory_streamsloki_ingester_memory_chunksloki_ingester_flush_queue_lengthloki_ingester_chunk_utilizationloki_ingester_chunks_flushed_totalExample queries:
sum(loki_ingester_memory_streams{cluster="$cluster", namespace="$namespace"})
sum(loki_ingester_flush_queue_length{cluster="$cluster", namespace="$namespace"})
Abnormal behavior:
Throughput changes help identify upstream sender issues, sudden traffic shifts, or ingestion bottlenecks.
Key metrics:
loki_distributor_bytes_received_totalloki_distributor_lines_received_totalExample queries:
sum(rate(loki_distributor_bytes_received_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval]))
sum(rate(loki_distributor_lines_received_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval]))
Abnormal behavior:
Object store latency and failures directly impact query and retention workflows.
Key metrics:
loki_objstore_bucket_operations_totalloki_objstore_bucket_operation_failures_totalloki_objstore_bucket_operation_duration_secondsExample queries:
sum by (operation) (rate(loki_objstore_bucket_operation_failures_total{cluster="$cluster", namespace="$namespace"}[$__rate_interval]))
histogram_quantile(0.99, sum(rate(loki_objstore_bucket_operation_duration_seconds_bucket{cluster="$cluster", namespace="$namespace"}[$__rate_interval])) by (le, operation))
Abnormal behavior:
get, get_range, or upload.Resource pressure can explain or predict service degradation before alert thresholds are crossed.
Common signals to track:
Abnormal behavior:
If you run Loki Canary, use it as an end-to-end correctness signal, not only a performance signal.
Key metrics:
loki_canary_missing_entries_totalloki_canary_spot_check_missing_entries_totalloki_canary_response_latency_seconds_bucketExample query:
sum(increase(loki_canary_missing_entries_total{cluster=~"$cluster", namespace=~"$namespace"}[$__range]))
/
sum(increase(loki_canary_entries_total{cluster=~"$cluster", namespace=~"$namespace"}[$__range]))
* 100
Abnormal behavior:
Internal logs provide fast context when metrics indicate degradation.
Key metric:
loki_internal_log_messages_totalUse this metric with component logs to correlate where failures begin.
Retention and sweeper lag can cause storage growth and delayed data lifecycle actions.
Key metrics:
loki_compactor_apply_retention_last_successful_run_timestamp_secondsloki_boltdb_shipper_retention_sweeper_marker_file_processing_current_timeloki_boltdb_shipper_retention_sweeper_chunk_deleted_duration_seconds_countExample query:
time() - (loki_boltdb_shipper_retention_sweeper_marker_file_processing_current_time{cluster="$cluster", namespace="$namespace"} > 0)
Abnormal behavior:
metrics.go, during incident response.