operator/docs/lokistack/sop.md
The following page describes Standard Operational Procedures for alerts provided and managed by the Loki Operator for any LokiStack instance.
A service(s) is unable to perform its duties for a number of requests, resulting in potential loss of data.
A service(s) is failing to process at least 10% of all incoming requests.
Critical
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)ingester, querier, index-gateway, compactor) can communicate with backend storageloki_ingester_wal_disk_full_failures_totalloki_ingester_wal_corruptions_totalThe LokiStack Gateway component is unable to perform its duties for a number of write requests, resulting in potential loss of data.
The LokiStack Gateway is failing to process at least 10% of all incoming write requests.
Critical
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)distributor, ingester, and index-gateway components are ready and availableingester, querier, index-gateway, compactor) can communicate with backend storageloki_ingester_wal_disk_full_failures_totalloki_ingester_wal_corruptions_totalThe LokiStack Gateway component is unable to perform its duties for a number of query requests, resulting in a potential disruption.
The LokiStack Gateway is failing to process at least 10% of all incoming query requests.
Critical
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)query-frontend, querier, ingester, and index-gateway components are ready and availableingester, querier, index-gateway, compactor) can communicate with backend storageloki_ingester_wal_disk_full_failures_totalloki_ingester_wal_corruptions_totalA service(s) is unavailable to unavailable, resulting in potential loss of data.
A service(s) has crashed.
Critical
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)A service(s) is affected by slow request responses.
A service(s) is slower than expected at processing data.
Critical
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)cortex_query_scheduler_inflight_requestsA tenant is being rate limited, resulting in potential loss of data.
A service(s) is rate limiting at least 10% of all incoming requests.
Warning
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)loki_discarded_samples_total{namespace="<namespace>"}MaxEntriesLimitPerQuery, MaxChunksPerQuery, or MaxQuerySeries can be changed to raise the limit| Reason | Corresponding Ingestion Limit Keys |
|---|---|
rate_limited | ingestionRate, ingestionBurstSize |
stream_limit | maxGlobalStreamsPerTenant |
label_name_too_long | maxLabelNameLength |
label_value_too_long | maxLabelValueLength |
line_too_long | maxLineSize |
max_label_names_per_series | maxLabelNamesPerSeries |
per_stream_rate_limit | perStreamRateLimit, perStreamRateLimitBurst |
The cluster is unable to push logs to backend storage in a timely manner.
The cluster is unable to push logs to backend storage in a timely manner.
Warning
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)The cluster is unable to retrieve logs to backend storage in a timely manner.
The cluster is unable to retrieve logs to backend storage in a timely manner.
Warning
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)The write path is under high pressure and requires a storage flush.
The write path is flushing the storage in response to back-pressuring.
Warning
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)The read path is under high load.
The query queue is currently under high load.
Warning
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)Loki is discarding samples (log entries) because they fail validation. This alert only fires for errors that are not retryable. This means that the discarded samples are lost.
Loki can reject log entries (samples) during submission when they fail validation. This happens on a per-stream basis, so only the specific samples or streams failing validation are lost.
The possible validation errors are documented in the Loki documentation. This alert only fires for the validation errors that are not retryable, which means that discarded samples are permanently lost.
The alerting can only show the affected Loki tenant. Since Loki 3.1.0 more detailed information about the affected streams is provided in an error message emitted by the distributor component.
This information can be used to pinpoint the application sending the offending logs. For some of the validations there are configuration parameters that can be tuned in LokiStack's limits structure, if the messages should be accepted. Usually it is recommended to fix the issue either on the emitting application (if possible) or by changing collector configuration to fix non-compliant messages before sending them to Loki.
Warning
openshift-logging (LokiStack)Loki ingesters are unable to flush chunks to backend storage at a critical rate (>20% failure rate), resulting in potential data loss and Write Ahead Log (WAL) disk pressure.
One or more Loki ingesters are failing to flush at least 20% of their chunks to backend storage over a 5-minute period. This indicates issues with storage connectivity, authentication, or storage capacity that require immediate intervention.
Critical
openshift-logging (LokiStack)openshift-operators-redhat (Loki Operator)Immediate Actions:
kubectl logs -n <namespace> <ingester-pod>kubectl -n <namespace> describe Lokistack <lokistack-name>sum(loki_ingester_wal_bytes_in_use) by (pod, namespace): current WAL disk usagesum(rate(loki_ingester_wal_disk_full_failures_total[5m])) by (pod, namespace): number of failures due to fill wall diskRoot Cause Analysis:
Resolution Steps:
The LokiStack warns on a newer object storage schema being available for configuration.
The schema configuration does not contain the most recent schema version and needs an update.
Warning
openshift-logging (LokiStack)One or more LokiStack components are not ready, which can disrupt ingestion or querying and lead to degraded service.
The LokiStack reports that some components have not reached the Ready state. This might be related to Kubernetes resources (Pods/Deployments), configuration, or external dependencies.
Warning
openshift-logging (LokiStack)kubectl -n <namespace> describe lokistack <name>Ready statekubectl -n <operator-namespace> logs deploy/loki-operator-controller-managerdistributor, ingester, querier, query-frontend, index-gateway, compactor, gatewaykubectl -n <namespace> get podskubectl -n <namespace> describe pod <pod>kubectl -n <namespace> get events --sort-by=.lastTimestampSecrets and ConfigMaps exist and have correct keysReady:
kubectl -n <namespace> logs <pod>