docs/proposals/ruler-ha-new.md
Rulers in Cortex currently run with a replication factor of 1, wherein each RuleGroup is assigned to exactly 1 ruler. This lack of redundancy creates the following risks:
This proposal attempts to mitigate the above risks by enabling a ruler replication factor of greater than 1, allowing multiple rulers to evaluate the same rule group — effectively.
ReplicationFactor in Ruler is currently hardcoded to 1. Making this a configurable parameter is the first step to enabling HA in ruler. The parameter value will be 1 by default. To enable Ruler HA for rule group evaluation, a new flag will be created
A replication factor greater than 1 will result in the following
With this redundancy, the maximum duration of missed evaluations will be limited to the sync interval of the rule groups, reducing the impact of primary Ruler unavailability.
No Prometheus change is required for this proposal
An interim solution is addressed in this #5773 PR. This will be modified such that the replicas will return both active and passive rule groups and the API handler will continue to de-duplicate the results. The difference is that after Ruler HA, the replicas could potentially return proper rule group state if those replicas evaluated the rule group
PRs: