docs/sources/datasources/prometheus/alerting/index.md
You can use Grafana Alerting with Prometheus to create alerts based on your time-series data. This allows you to monitor metrics, detect anomalies, and receive notifications when specific conditions are met.
For general information about Grafana Alerting, refer to Grafana Alerting.
Before creating alerts with Prometheus, ensure you have:
Prometheus supports two alerting workflows in Grafana:
| Type | Description |
|---|---|
| Grafana-managed alert rules | Alert rules defined and evaluated within Grafana, using Prometheus as the query data source. You create and manage these entirely in the Grafana Alerting UI. |
| Data source-managed rules | Alerting rules defined in Prometheus itself (in prometheus.yml or rule files). When Manage alerts via Alerting UI is enabled in the data source configuration, Grafana displays these existing rules in the Alerting UI. For Prometheus (unlike Mimir), this is read-only. |
To create a Grafana-managed alert rule using Prometheus:
For detailed instructions, refer to Create a Grafana-managed alert rule.
When Manage alerts via Alerting UI is enabled in the Prometheus data source configuration, Grafana fetches and displays alerting rules defined in Prometheus. These appear in the Alerting UI alongside Grafana-managed rules but are marked as data source-managed.
For Prometheus data sources, this view is read-only. To modify these rules, update your Prometheus rule files directly.
{{< admonition type="note" >}} For Mimir and Cortex data sources, the Alerting UI supports both viewing and creating data source-managed rules. Prometheus only supports viewing. {{< /admonition >}}
Alert rules are organized into evaluation groups. Each group has an evaluation interval that determines how frequently the rules in that group are evaluated. For example, an evaluation interval of 1m means the alert query runs every 60 seconds.
The pending period determines how long a condition must be continuously true before the alert fires. For example, with a 1-minute evaluation interval and a 5-minute pending period, the condition must be true for 5 consecutive evaluations before firing.
Choose evaluation intervals based on your use case:
The following examples show common alerting scenarios with Prometheus. Each example shows the PromQL query and how to configure the alert condition.
Monitor CPU usage and alert when it exceeds 90%:
Query A:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) * 100)
Condition: Set the threshold to alert when the last value of A is above 90.
This approach separates the metric query from the threshold, making it easier to adjust the threshold later without editing the PromQL.
Monitor memory usage across nodes:
Query A:
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Condition: Alert when the last value of A exceeds 85.
Monitor HTTP error rates per service:
Query A:
sum(rate(http_requests_total{status=~"5.."}[$__rate_interval])) by (job)
/
sum(rate(http_requests_total[$__rate_interval])) by (job)
* 100
Condition: Alert when the last value of A exceeds 5 (meaning error rate above 5%).
Monitor whether Prometheus scrape targets are reachable:
Query A:
up{job="myservice"}
Condition: Alert when the last value of A is below 1.
Use absent() to detect when a metric stops being scraped entirely — for example, when a service crashes and no longer reports metrics:
Query A:
absent(up{job="myservice"})
Condition: Alert when the last value of A equals 1 (the absent() function returns 1 when the metric is missing, and nothing when the metric exists).
For detecting staleness over a time window (metric exists but hasn't reported recently):
absent_over_time(up{job="myservice"}[5m])
Use multiple queries and expressions to alert only when multiple conditions are true simultaneously. This reduces noise by avoiding alerts during low-traffic periods.
Query A — P95 latency:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[$__rate_interval])) by (le))
Query B — Request rate:
sum(rate(http_requests_total{job="api"}[$__rate_interval]))
Expression C — Math (both conditions must be true):
$A > 2 && $B > 100
Condition: Alert when C has a value (it only returns data when both latency exceeds 2 seconds AND request rate exceeds 100 req/s).
Grafana-managed recording rules pre-compute expensive PromQL expressions on a schedule and write the results back to a Prometheus-compatible data source as a new metric. This lets you query the pre-aggregated metric instead of re-evaluating the expensive expression on every dashboard load or alert evaluation.
Enable the data source as a recording rules target: In the Prometheus data source configuration, verify that Allow as recording rules target is toggled on (it's on by default). This allows Grafana to write recording rule results back to this instance.
Verify write access: The Prometheus-compatible backend must support remote write. Grafana Cloud Metrics (Mimir) and self-hosted Mimir support this natively. Standard Prometheus requires the --web.enable-remote-write-receiver flag (Prometheus 2.33+).
Create a recording rule:
Navigate to Alerting > Alert rules.
Click New alert rule.
Select Recording rule as the rule type (under the Grafana-managed section).
Enter the PromQL expression you want to pre-compute:
sum(rate(http_requests_total[$__rate_interval])) by (service)
Enter a metric name for the result (for example, service:http_requests:rate5m). Follow the Prometheus recording rule naming convention: level:metric:operations.
Select the Target data source — the Prometheus instance where results will be written.
Set the evaluation interval (for example, every 1 minute).
Click Save rule.
Query the recorded metric: After the first evaluation, the new metric is available for dashboards and alerts:
service:http_requests:rate5m{service="api"}
If your Prometheus instance is behind Private data source connect (PDC), Grafana can write recording rule results through the PDC tunnel. No additional configuration is needed — PDC supports both reads and writes.
--web.enable-remote-write-receiver flag.[5m]) rather than $__rate_interval.For more information, refer to Create Grafana-managed recording rules.
When using Prometheus with Grafana Alerting, be aware of the following limitations.
Alert queries don't support template variables. Grafana evaluates alert rules on the backend without dashboard context, so variables like $instance or $job aren't resolved.
If your dashboard query uses template variables, create a separate query for alerting with hard-coded values or use label matchers directly.
Complex queries with many nested functions or large result sets may timeout or fail to evaluate. Simplify queries for alerting by:
When using OAuth-authenticated Prometheus endpoints (Google Managed Prometheus, Azure Managed Prometheus), queries may succeed in Explore and dashboards but fail intermittently during alert evaluation. This happens because the alerting backend handles token refresh differently from the interactive query path.
If you're using GCP, consider the datasource-syncer pattern — a sidecar process that refreshes OAuth tokens and updates the data source credentials on a schedule shorter than the token lifetime.
For detailed troubleshooting steps, refer to OAuth token expiration errors.
Grafana can display Prometheus alerting rules but can't create or modify them through the UI. To manage Prometheus-native alerting rules, edit your Prometheus rule files directly and reload the configuration.
By default, when a Grafana-managed alert rule encounters an execution error or timeout (such as a network blip, i/o timeout, or a transient 502 from Prometheus), the rule enters an Error state — which fires the alert. This can cause false alarms and spam on-call teams when the underlying issue is a brief connectivity interruption rather than a genuine threshold breach.
To prevent false positives from transient errors, configure the Alert state if execution error or timeout setting on each alert rule:
{{< admonition type="note" >}} If your alert rules frequently enter an error state, investigate the root cause (network stability, Prometheus resource limits, query timeout settings) rather than relying solely on this setting to suppress notifications. {{< /admonition >}}
Common transient errors that trigger this behavior include:
sse.dependencyError or sse.dataQueryError in alert state historyFor more details on troubleshooting these errors, refer to Troubleshoot Prometheus data source issues.
Follow these best practices when creating Prometheus alerts:
$__rate_interval: When using rate() or increase() in alert queries, use $__rate_interval to ensure the range window is always large enough relative to the scrape interval. Grafana resolves this variable based on the evaluation interval and scrape interval configuration.absent() for availability monitoring: Detect when metrics stop being reported, which often indicates a crashed or unresponsive service.If you encounter errors when creating or evaluating alert rules, refer to Troubleshoot Prometheus data source issues.