docs/sources/datasources/prometheus/troubleshooting/index.md
This document provides troubleshooting information for common errors you may encounter when using the Prometheus data source in Grafana.
The following errors occur when Grafana cannot establish or maintain a connection to Prometheus.
Error message: "There was an error returned querying the Prometheus API"
Cause: Grafana cannot establish a network connection to the Prometheus server.
Solution:
http:// or https://).9090).localhost.Error message: "context deadline exceeded" or "request timeout"
Cause: The connection to Prometheus timed out before receiving a response.
Solution:
Error message: "Failed to parse data source URL"
Cause: The URL entered in the data source configuration is not valid.
Solution:
http://localhost:9090 or https://prometheus.example.com:9090).http:// or https://).Symptom: You've successfully tested the data source connection, but no metric data appears in Explore or Metrics Drilldown.
Cause: The wrong data source is selected, or the data source name doesn't match expectations.
Solution:
remote_write to send metrics to Grafana Cloud, the data source name follows the convention grafanacloud-<stackname>-prom.up in the Explore view.remote_write endpoint URL and credentials are correct.Error messages: "host unreachable", "EOF", "network unreachable", "connection reset by peer", "dial tcp: lookup ... no such host"
Symptom: Prometheus queries fail intermittently or consistently when using Private data source connect (PDC) to reach a Prometheus instance behind a private network. The data source test may pass occasionally but queries fail under load.
Cause: PDC tunnels traffic through an SSH connection between Grafana Cloud and your PDC agent. Connectivity failures are most commonly caused by DNS resolution issues, network configuration on the customer side, or the PDC agent's default connection limits being too low for the query volume.
Solutions:
Verify DNS resolution from the PDC agent host. The PDC agent must be able to resolve the Prometheus hostname from the machine it runs on. Run nslookup or dig for the Prometheus URL from the agent host to confirm.
Check network connectivity from the agent. Ensure the PDC agent can reach the Prometheus endpoint directly (for example, curl http://prometheus-host:9090/-/healthy from the agent machine).
Increase parallel SSH connections. The PDC agent defaults to 1 parallel SSH connection, which can bottleneck under load from multiple alert evaluations or dashboard queries. Increase this by setting the --ssh-connections flag (or PDC_SSH_CONNECTIONS environment variable) to a higher value (for example, 4 or 8):
pdc-agent --ssh-connections=4
Check firewall rules. Ensure the PDC agent's outbound SSH connection to Grafana Cloud isn't being interrupted by firewall rules, NAT gateways, or idle connection timeouts.
Verify the PDC agent is running and healthy. Check agent logs for connection errors or restarts. The agent must maintain a persistent connection to Grafana Cloud.
Check for idle timeout issues. If the connection drops after periods of inactivity, configure TCP keepalives on the host or add a keepalive setting to the PDC agent configuration.
{{< admonition type="note" >}} PDC connectivity issues are almost always caused by networking on the customer side (DNS, firewall rules, routing), not by Grafana Cloud. The data source test passing doesn't guarantee sustained connectivity under load — it only verifies a single query succeeds. {{< /admonition >}}
For general PDC setup and configuration, refer to Private data source connect (PDC) and Configure PDC.
Error message: "x509: certificate signed by unknown authority" or "certificate verify failed"
Cause: Grafana cannot verify the TLS certificate presented by Prometheus.
Solution:
Error message: "TLS: handshake failure" or "connection reset"
Cause: The TLS handshake between Grafana and Prometheus failed.
Solution:
The following errors occur when there are issues with authentication credentials or permissions.
Error message: "401 Unauthorized" or "Authorization failed"
Cause: The authentication credentials are invalid or missing.
Solution:
Error messages: "ACCESS_TOKEN_EXPIRED", "401 Unauthorized" in alerting but not in Explore
Symptom: Queries in Explore and dashboards work correctly, but alert rule evaluations fail intermittently with 401 errors. This is most common with Google Managed Prometheus (GMP) and Azure-managed Prometheus endpoints using OAuth/OIDC authentication.
Cause: The Grafana alerting backend and the interactive query path (Explore, dashboards) handle credential refreshes differently. The alerting evaluator can use a cached OAuth token beyond its expiry window due to a token staleness check issue in the Prometheus data source. This causes alerting to fail with expired credentials while interactive queries succeed because they trigger a fresh token exchange.
Solutions:
For Google Managed Prometheus (GMP):
For Azure Managed Prometheus:
General steps:
{{< admonition type="note" >}} This token caching behavior is a known issue that has received code fixes in recent Grafana releases. If you're experiencing this on an older Grafana version, upgrading may resolve it. Check the Grafana changelog for relevant fixes. {{< /admonition >}}
Symptom: You've enabled teamHttpHeadersMimir and configured Team LBAC rules, but users can still see all metrics regardless of their team assignments.
Cause: Label-Based Access Control (LBAC) for the Prometheus data source only works when the backend is Grafana Cloud Metrics (Mimir) or Grafana Enterprise Metrics (GEM). It doesn't work with Google Managed Prometheus, self-hosted Prometheus, Thanos, or other Prometheus-compatible endpoints. The LBAC enforcement relies on Mimir-specific HTTP headers (X-Scope-OrgID and team-scoped label matchers) that other backends ignore.
Solution:
Symptom: The Azure AD or SigV4 authentication options don't appear in the authentication drop-down when configuring the Prometheus data source.
Cause: These authentication methods require server-side feature flags that aren't enabled by default, particularly on Grafana Cloud.
Solution:
azure_auth_enabled = true under [auth]sigv4_auth_enabled = true under [auth]Error message: "403 Forbidden" or "Access denied"
Cause: The authenticated user does not have permission to access the requested resource.
Solution:
The following errors occur when there are issues with PromQL syntax or query execution.
Error message: "parse error: unexpected character" or "bad_data: 1:X: parse error"
Cause: The PromQL query contains invalid syntax.
Alternative cause: A proxy between Grafana and Prometheus requires authentication. When proxy authentication fails, the proxy redirects the request to an HTML authentication page. Grafana cannot parse the HTML response, which results in a parse error. This appears to be a query issue but is actually a proxy authentication issue.
Solution:
Symptom: The query returns no data and the visualization is empty.
Cause: The specified metric does not exist in Prometheus, or there is no data for the selected time range.
Solution:
/api/v1/label/__name__/values.Error message: "query timed out in expression evaluation" or "query processing would load too many samples"
Cause: The query took longer than the configured timeout limit or would return too many samples.
Solution:
sum(), avg(), or rate() to reduce the number of time series.query.timeout or query.max-samples settings in Prometheus if you have admin access.Error message: "exceeded maximum resolution of 11,000 points per timeseries" or "maximum number of series limit exceeded"
Cause: The query is returning more time series or data points than the configured limits allow.
Solution:
Error messages: "max-estimated-memory-consumption-per-query limit exceeded", "query requires too much memory", "max samples limit reached"
Symptom: Queries against high-cardinality metrics (thousands of unique label combinations) over long time ranges (days or weeks) fail with memory or sample limit errors. Short time ranges may work but expanding the range causes the failure.
Cause: Prometheus and Mimir enforce per-query memory and sample limits to protect the system from resource exhaustion. High-cardinality metrics (for example, metrics with a pod, request_id, or user_id label) multiplied by long time ranges produce result sets that exceed these limits.
Solutions:
Reduce the query scope:
namespace or job).Use recording rules to pre-aggregate:
Create recording rules that pre-compute the aggregation you need. For example, if you commonly query sum(rate(http_requests_total[5m])) by (service), create a recording rule for it and query the pre-aggregated metric instead.
groups:
- name: aggregations
rules:
- record: service:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (service)
Use Adaptive Metrics (Grafana Cloud): If you're on Grafana Cloud, Adaptive Metrics automatically identifies and aggregates high-cardinality metrics that aren't being queried at full resolution, reducing storage and query costs.
Restructure dashboards for high-cardinality data:
Increase limits (self-managed only): If you have admin access to Prometheus or Mimir, you can increase limits in the server configuration:
--query.max-samples flag-querier.max-samples, -querier.max-estimated-memory-consumption-per-query{{< admonition type="note" >}} Increasing limits allows larger queries to succeed but also increases the risk of resource exhaustion. Prefer reducing query scope or using recording rules over raising limits. {{< /admonition >}}
Error message: "unknown function" or "parse error: unexpected aggregation"
Cause: The query uses an invalid or unsupported PromQL function.
Solution:
rate() or increase() returning unexpected valuesSymptom: increase() returns fractional values on integer counters, rate() shows an ever-increasing value instead of a steady per-second rate, or counter resets cause large spikes in visualizations.
Possible causes and solutions:
| Cause | Solution |
|---|---|
increase() fractional values | Expected behavior — Prometheus uses linear interpolation. Use ceil() or floor() if you need integers. |
rate() grows over time | Multiple instances write to the same series without unique labels. Ensure each target has unique instance/pod labels and aggregate with sum by. |
| Counter reset spikes after pod restarts | Use $__rate_interval or a longer range vector to smooth spikes. Investigate frequent restarts as the root cause. |
| Values differ between edit mode and dashboard | Panel width affects $__interval which affects rate() window calculations. Set a Min step on the query. |
For detailed explanations of these behaviors, refer to Expected PromQL behaviors.
Symptom: Queries that aggregate by label names containing dots (for example, container.name) return incorrect or incomplete results.
Cause: Prior to Grafana 13, there was a bug where labels with dots in their names were not handled correctly during aggregation operations like sum by or avg by.
Solution:
sum by (container.name) (metric_name)).The following errors occur when the data source is not configured correctly.
Error message: Unexpected behavior when querying metrics or labels
Cause: The Prometheus type setting does not match your actual Prometheus-compatible database.
Solution:
Symptom: Data appears sparse, or rate() queries return no data or incomplete results.
Cause: The Scrape interval setting in Grafana does not match the actual scrape interval in Prometheus. This especially affects rate() queries, which require at least two data points within the specified time window. For example, if your actual scrape interval is 5 minutes but Grafana uses the default (15 seconds for OSS, 1 minute for Grafana Cloud), a query like rate(http_requests_total[1m]) returns no data because there are no data points within that 1-minute window.
Solution:
scrape_interval setting.$__rate_interval instead of hard-coded time windows in rate() queries. This variable automatically adjusts based on your scrape interval.$__rate_interval returns no data or incorrect valuesSymptom: Queries using $__rate_interval return no data, return different values in edit mode versus the dashboard, or produce unexpected gaps.
Cause: $__rate_interval is calculated as max($__interval + scrape_interval, 4 * scrape_interval). If any input to this formula is incorrect, the resulting window is wrong — either too small (no data) or inconsistent across contexts.
Common causes and solutions:
| Cause | Solution |
|---|---|
Data source scrape interval left at default 15s while actual Prometheus scrape interval is longer (for example, 60s) | Set the Scrape interval under Interval behavior in the data source configuration to match your Prometheus scrape_interval. |
| Query works in edit mode but shows gaps on the dashboard | Panel size affects $__interval. Smaller panels produce larger intervals. Set a Min step on the query to enforce a consistent floor. |
| LBAC-enabled data source doesn't inherit scrape interval | Set the Min step explicitly on each query panel rather than relying on data source inheritance. |
Using $__rate_interval in recording rules or alerting | Use a fixed interval (for example, [5m]) instead of $__rate_interval in contexts without a panel/dashboard. |
To debug the current value:
$__rate_interval is visible in the request.For detailed documentation on how $__rate_interval works and how to configure it, refer to Use $__rate_interval.
The following issues affect query speed and data freshness.
Symptom: Queries take a long time to execute, dashboards are slow to load, or the loading spinner persists.
Cause: Queries scan too much data, the Prometheus server is overloaded, or the network connection is slow.
Solution:
Symptom: The visualization doesn't show the most recent data, even after refreshing.
Cause: Scrape timing, clock drift, or dashboard refresh settings.
Solution:
rate() and similar functions, the most recent partial scrape interval won't have enough data points — this is expected.Symptom: Exemplar data doesn't appear on graphs even though you expect it.
Cause: Exemplars require specific configuration in both the data source and the query editor.
Solution:
The following issues occur when using Prometheus as a data source for annotations.
Symptom: You've configured a Prometheus annotation query, but no annotations appear on your dashboard.
Possible causes and solutions:
| Cause | Solution |
|---|---|
| Query returns no data in the current time range | Verify the query returns results in Explore for the dashboard's time range. |
| Query returns continuous data (too many annotations) | Every returned data point creates an annotation. If the query returns hundreds of points, annotations may render but are too dense to see. Increase the Min step or refine your query to only return data at event moments. |
| Wrong data source selected | Verify the correct Prometheus data source is selected in the annotation configuration. |
| Annotation is disabled | Check that the annotation toggle is enabled (eye icon) in the dashboard's annotation settings. |
| Time range mismatch | Expand the dashboard time range to include the events you expect to see. |
{{< admonition type="note" >}}
Prometheus annotations create a marker for every data point returned by the query. There's no automatic filtering of zero values. If you only want annotations at specific moments, your PromQL expression must return data only at those times (for example, using > 0, changes() > 0, or the ALERTS metric).
{{< /admonition >}}
For more information on configuring annotations, refer to Prometheus annotations.
The following issues occur when using Prometheus with Grafana Alerting.
Error messages: sse.dependencyError, sse.dataQueryError, "context deadline exceeded", "i/o timeout"
Symptom: Alert rules intermittently fire due to execution errors rather than genuine threshold breaches. On-call teams receive false positive notifications. Alert state history shows error states caused by transient backend issues (network blips, HTTP 502/500 responses, timeouts) rather than actual metric conditions being met.
Cause: By default, when an alert rule encounters an execution error or timeout, Grafana sets the alert state to Alerting — which fires the alert. Transient connectivity issues between Grafana and Prometheus (i/o timeouts, deadline exceeded, brief outages) trigger this behavior even though the underlying metric hasn't crossed its threshold.
Solution:
This ensures the alert retains its previous state during transient errors and only fires when a successful evaluation confirms the threshold is breached.
If errors are frequent, also investigate:
For configuration details, refer to Configure alert state for execution errors.
Symptom: An alert rule using a Prometheus query shows evaluation errors or remains in a "No Data" state.
Possible causes and solutions:
| Cause | Solution |
|---|---|
| Template variables in query | Alert queries don't support template variables. Replace variables with hard-coded values. |
| Query timeout | Simplify the query or increase the evaluation timeout. Use recording rules for complex expressions. |
| Data source unreachable | Verify the Prometheus data source connection is working (test it in the data source settings). |
| No data in range | Ensure the metric has recent data. Check that Prometheus is actively scraping the target. |
Symptom: Prometheus alerting rules don't appear in the Grafana Alerting UI.
Solution:
/api/v1/rules).For more information on alerting with Prometheus, refer to Prometheus alerting.
If you continue to experience issues after following this troubleshooting guide: