Prometheus alerting

You can use Grafana Alerting with Prometheus to create alerts based on your time-series data. This allows you to monitor metrics, detect anomalies, and receive notifications when specific conditions are met.

For general information about Grafana Alerting, refer to Grafana Alerting.

Before you begin

Before creating alerts with Prometheus, ensure you have:

A Prometheus data source configured in Grafana
Appropriate permissions to create alert rules
Understanding of the PromQL metrics you want to monitor

Alert rule types

Prometheus supports two alerting workflows in Grafana:

Type	Description
Grafana-managed alert rules	Alert rules defined and evaluated within Grafana, using Prometheus as the query data source. You create and manage these entirely in the Grafana Alerting UI.
Data source-managed rules	Alerting rules defined in Prometheus itself (in `prometheus.yml` or rule files). When Manage alerts via Alerting UI is enabled in the data source configuration, Grafana displays these existing rules in the Alerting UI. For Prometheus (unlike Mimir), this is read-only.

Create a Grafana-managed alert rule

To create a Grafana-managed alert rule using Prometheus:

Navigate to Alerting > Alert rules.
Click New alert rule.
Enter a name for the alert rule.
Select your Prometheus data source.
Write a PromQL query in the query editor.
Configure the alert condition (for example, when the last value is above a threshold).
Set the evaluation interval and pending period.
Configure notifications and labels.
Click Save rule.

For detailed instructions, refer to Create a Grafana-managed alert rule.

View data source-managed rules

When Manage alerts via Alerting UI is enabled in the Prometheus data source configuration, Grafana fetches and displays alerting rules defined in Prometheus. These appear in the Alerting UI alongside Grafana-managed rules but are marked as data source-managed.

For Prometheus data sources, this view is read-only. To modify these rules, update your Prometheus rule files directly.

{{< admonition type="note" >}} For Mimir and Cortex data sources, the Alerting UI supports both viewing and creating data source-managed rules. Prometheus only supports viewing. {{< /admonition >}}

Evaluation groups and intervals

Alert rules are organized into evaluation groups. Each group has an evaluation interval that determines how frequently the rules in that group are evaluated. For example, an evaluation interval of 1m means the alert query runs every 60 seconds.

The pending period determines how long a condition must be continuously true before the alert fires. For example, with a 1-minute evaluation interval and a 5-minute pending period, the condition must be true for 5 consecutive evaluations before firing.

Choose evaluation intervals based on your use case:

15s–30s — Critical infrastructure alerts where fast detection matters.
1m — Standard monitoring alerts (recommended default).
5m — Non-urgent or noisy metrics where you want to reduce evaluation load.

Example alert queries

The following examples show common alerting scenarios with Prometheus. Each example shows the PromQL query and how to configure the alert condition.

Alert on high CPU usage

Monitor CPU usage and alert when it exceeds 90%:

Query A:

promql

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) * 100)

Condition: Set the threshold to alert when the last value of A is above 90.

This approach separates the metric query from the threshold, making it easier to adjust the threshold later without editing the PromQL.

Alert on high memory usage

Monitor memory usage across nodes:

Query A:

promql

(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Condition: Alert when the last value of A exceeds 85.

Alert on high error rate

Monitor HTTP error rates per service:

Query A:

promql

sum(rate(http_requests_total{status=~"5.."}[$__rate_interval])) by (job)
  /
sum(rate(http_requests_total[$__rate_interval])) by (job)
  * 100

Condition: Alert when the last value of A exceeds 5 (meaning error rate above 5%).

Alert on target down

Monitor whether Prometheus scrape targets are reachable:

Query A:

promql

up{job="myservice"}

Condition: Alert when the last value of A is below 1.

Alert when a metric disappears

Use absent() to detect when a metric stops being scraped entirely — for example, when a service crashes and no longer reports metrics:

Query A:

promql

absent(up{job="myservice"})

Condition: Alert when the last value of A equals 1 (the absent() function returns 1 when the metric is missing, and nothing when the metric exists).

For detecting staleness over a time window (metric exists but hasn't reported recently):

promql

absent_over_time(up{job="myservice"}[5m])

Multi-condition alert (high latency AND high traffic)

Use multiple queries and expressions to alert only when multiple conditions are true simultaneously. This reduces noise by avoiding alerts during low-traffic periods.

Query A — P95 latency:

promql

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[$__rate_interval])) by (le))

Query B — Request rate:

promql

sum(rate(http_requests_total{job="api"}[$__rate_interval]))

Expression C — Math (both conditions must be true):

$A > 2 && $B > 100

Condition: Alert when C has a value (it only returns data when both latency exceeds 2 seconds AND request rate exceeds 100 req/s).

Recording rules

Grafana-managed recording rules pre-compute expensive PromQL expressions on a schedule and write the results back to a Prometheus-compatible data source as a new metric. This lets you query the pre-aggregated metric instead of re-evaluating the expensive expression on every dashboard load or alert evaluation.

When to use recording rules

Dashboard panels that query the same expensive expression repeatedly — pre-compute it once and query the result.
Alert rules on complex expressions — simplify the alert query by alerting on the pre-aggregated metric.
High-cardinality aggregations — reduce thousands of series to a handful of pre-computed series.

Set up Grafana-managed recording rules for Prometheus

Enable the data source as a recording rules target: In the Prometheus data source configuration, verify that Allow as recording rules target is toggled on (it's on by default). This allows Grafana to write recording rule results back to this instance.
Verify write access: The Prometheus-compatible backend must support remote write. Grafana Cloud Metrics (Mimir) and self-hosted Mimir support this natively. Standard Prometheus requires the --web.enable-remote-write-receiver flag (Prometheus 2.33+).
Create a recording rule:
1. Navigate to Alerting > Alert rules.
2. Click New alert rule.
3. Select Recording rule as the rule type (under the Grafana-managed section).
4. Enter the PromQL expression you want to pre-compute:
  promql
```
sum(rate(http_requests_total[$__rate_interval])) by (service)
```
5. Enter a metric name for the result (for example, service:http_requests:rate5m). Follow the Prometheus recording rule naming convention: level:metric:operations.
6. Select the Target data source — the Prometheus instance where results will be written.
7. Set the evaluation interval (for example, every 1 minute).
8. Click Save rule.
Query the recorded metric: After the first evaluation, the new metric is available for dashboards and alerts:
promql
```
service:http_requests:rate5m{service="api"}
```

Recording rules with PDC

If your Prometheus instance is behind Private data source connect (PDC), Grafana can write recording rule results through the PDC tunnel. No additional configuration is needed — PDC supports both reads and writes.

Limitations

Recording rules require the target data source to support remote write. Standard Prometheus needs the --web.enable-remote-write-receiver flag.
Thanos does not support recording rules as a write target (refer to the Prometheus type comparison).
Recording rule evaluation uses the configured evaluation interval, not dashboard time ranges. Use a fixed range vector (for example, [5m]) rather than $__rate_interval.

For more information, refer to Create Grafana-managed recording rules.

Limitations

When using Prometheus with Grafana Alerting, be aware of the following limitations.

Template variables not supported

Alert queries don't support template variables. Grafana evaluates alert rules on the backend without dashboard context, so variables like $instance or $job aren't resolved.

If your dashboard query uses template variables, create a separate query for alerting with hard-coded values or use label matchers directly.

Query complexity

Complex queries with many nested functions or large result sets may timeout or fail to evaluate. Simplify queries for alerting by:

Reducing the time range used in range vectors
Using appropriate aggregation to limit the number of returned series
Adding label filters to narrow the data scanned
Using recording rules to pre-compute expensive expressions

OAuth token handling differs between Explore and Alerting

When using OAuth-authenticated Prometheus endpoints (Google Managed Prometheus, Azure Managed Prometheus), queries may succeed in Explore and dashboards but fail intermittently during alert evaluation. This happens because the alerting backend handles token refresh differently from the interactive query path.

If you're using GCP, consider the datasource-syncer pattern — a sidecar process that refreshes OAuth tokens and updates the data source credentials on a schedule shorter than the token lifetime.

For detailed troubleshooting steps, refer to OAuth token expiration errors.

Data source-managed rules are read-only

Grafana can display Prometheus alerting rules but can't create or modify them through the UI. To manage Prometheus-native alerting rules, edit your Prometheus rule files directly and reload the configuration.

Configure alert state for execution errors

By default, when a Grafana-managed alert rule encounters an execution error or timeout (such as a network blip, i/o timeout, or a transient 502 from Prometheus), the rule enters an Error state — which fires the alert. This can cause false alarms and spam on-call teams when the underlying issue is a brief connectivity interruption rather than a genuine threshold breach.

To prevent false positives from transient errors, configure the Alert state if execution error or timeout setting on each alert rule:

Open the alert rule for editing.
In the alert conditions section, locate Alert state if execution error or timeout.
Change the value from Alerting (default) to one of:
- Keep Last State — The alert retains its previous state (firing or normal) until a successful evaluation occurs. This is the recommended setting for most Prometheus alert rules.
- OK — The alert is set to normal during the error, preventing it from firing.
Click Save rule.

{{< admonition type="note" >}} If your alert rules frequently enter an error state, investigate the root cause (network stability, Prometheus resource limits, query timeout settings) rather than relying solely on this setting to suppress notifications. {{< /admonition >}}

Common transient errors that trigger this behavior include:

sse.dependencyError or sse.dataQueryError in alert state history
"context deadline exceeded" or "i/o timeout" messages
HTTP 502 or 500 responses from the Prometheus server

For more details on troubleshooting these errors, refer to Troubleshoot Prometheus data source issues.

Best practices

Follow these best practices when creating Prometheus alerts:

Configure error state handling: Set Alert state if execution error or timeout to Keep Last State to prevent transient backend errors from triggering false alarms.
Use $__rate_interval: When using rate() or increase() in alert queries, use $__rate_interval to ensure the range window is always large enough relative to the scrape interval. Grafana resolves this variable based on the evaluation interval and scrape interval configuration.
Add label filters: Include specific label matchers to focus on relevant data and improve query performance.
Set realistic pending periods: Use the pending period to avoid alerting on brief spikes. For example, set a 5-minute pending period so the condition must persist before firing.
Test queries first: Verify your query returns expected results in Explore before creating an alert.
Use meaningful names: Give alert rules descriptive names that indicate what they monitor and the severity.
Pre-aggregate with recording rules: For complex or frequently evaluated expressions, create recording rules and alert on the pre-aggregated metric.
Use absent() for availability monitoring: Detect when metrics stop being reported, which often indicates a crashed or unresponsive service.

If you encounter errors when creating or evaluating alert rules, refer to Troubleshoot Prometheus data source issues.