docs/proposals/crossplane-provider.md
This proposal outlines the development of a Crossplane provider for Cortex that enables declarative management of a Cortex tenant's recording rules, alerting rules, and Alertmanager configurations through Kubernetes Custom Resources.
Currently, a Cortex tenant's recording rules, alerting rules, and Alertmanager configurations are managed through a variety of methods:
This approach leads to several challenges:
This proposal introduces a Crossplane provider that enables declarative management of Cortex configurations through Kubernetes Custom Resources. The provider implements three core resources that address the key aspects of Cortex management:
The provider leverages Crossplane's composition and configuration capabilities to provide:
The provider follows Crossplane's standard architecture pattern with three main components:
┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ TenantConfig │ │ RuleGroup │ │AlertmanagerConfig│
│ │ │ │ │ │
│ - Connection │ │ - Rules Mgmt │ │ - AM Config │
│ - Auth Details │ │ - PromQL │ │ - Templates │
│ - TLS Config │ │ - Evaluation │ │ - Routing │
└─────────┬───────┘ └─────────┬───────┘ └─────────┬────────┘
│ │ │
└──────────────────────┼──────────────────────┘
│
┌─────────────────────┐
│ Crossplane Provider │
│ │
│ - Authentication │
│ - HTTP Client │
│ - Reconciliation │
│ - Status Reporting │
└─────────┬───────────┘
│
┌─────────────────────┐
│ Cortex Cluster │
│ │
│ - Ruler API │
│ - Alertmanager API │
│ - Multi-tenant │
└─────────────────────┘
The provider implements controllers for each Custom Resource Definition (CRD) that handle:
The TenantConfig CRD manages connection details and authentication for a specific Cortex tenant:
apiVersion: config.cortexmetrics.io/v1alpha1
kind: TenantConfig
metadata:
name: production-tenant
namespace: tenant
spec:
forProvider:
tenantId: "production"
endpoint: "https://cortex.company.com"
authMethod: "bearer" # bearer, basic, or none
bearerTokenSecretRef:
name: cortex-token
key: token
tlsConfig:
insecureSkipVerify: false
caBundleSecretRef:
name: cortex-ca-bundle
key: ca-bundle.crt
additionalHeaders:
"X-Custom-Header": "custom-value"
providerConfigRef:
name: cortex-config
status:
atProvider:
tenantId: "production"
connectionStatus: "Connected"
lastConnectionTime: "2025-01-01T10:30:00Z"
authMethod: "bearer"
conditions:
- type: Ready
status: "True"
- type: Synced
status: "True"
Key Features:
The RuleGroup CRD manages Prometheus alerting and recording rules within a Cortex namespace:
apiVersion: config.cortexmetrics.io/v1alpha1
kind: RuleGroup
metadata:
name: cpu-monitoring
namespace: tenant
spec:
forProvider:
tenantConfigRef:
name: production-tenant
namespace: "monitoring"
groupName: "cpu-alerts"
interval: "30s"
rules:
# Alerting rule - fires when node CPU usage exceeds 80%
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: "5m"
labels:
severity: "warning"
team: "platform"
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanize }}% (above 80% threshold for 5 minutes)"
# Recording rule - pre-calculates node CPU usage percentage for dashboards
- record: instance_cpu_usage_percent
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
labels:
job: "node-exporter"
status:
atProvider:
namespace: "monitoring"
groupName: "cpu-alerts"
ruleCount: 2
lastUpdated: "2025-01-01T10:35:00Z"
status: "Active"
Key Features:
alert: field) and recording rules (using record: field)alert: or record: fieldannotations and for duration fieldsThe AlertmanagerConfig CRD manages Alertmanager configuration including routing rules, receivers, and notification templates:
apiVersion: config.cortexmetrics.io/v1alpha1
kind: AlertmanagerConfig
metadata:
name: production-alerting
namespace: tenant
spec:
forProvider:
tenantConfigRef:
name: production-tenant
alertmanagerConfig: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.company.com:587'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'platform-team'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
receivers:
- name: 'platform-team'
email_configs:
- to: '[email protected]'
subject: 'Alert: {{ .GroupLabels.alertname }}'
body: '{{ template "alert.html" . }}'
- name: 'critical-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#critical-alerts'
title: 'Critical Alert: {{ .GroupLabels.alertname }}'
templateFiles:
alert.html: |
<h2>Alert Details</h2>
{{ range .Alerts }}
<p><strong>{{ .Annotations.summary }}</strong></p>
<p>{{ .Annotations.description }}</p>
{{ end }}
status:
atProvider:
configurationStatus: "Applied"
lastUpdated: "2025-01-01T10:40:00Z"
configHash: "abc123def456"
Key Features:
The provider implements three controllers following Crossplane's managed resource pattern:
TenantConfig Controller:
RuleGroup Controller:
alert name, for duration, and annotationsrecord name for pre-computed metricsAlertmanagerConfig Controller:
The provider manages resources through a consistent pattern:
Reconciliation Loop: Each controller runs a reconciliation loop that:
External Resource Identification: Resources are identified using:
Dependency Management: Resources reference each other using Crossplane's reference resolution:
The provider handles Cortex configuration through several mechanisms:
Authentication and Security:
API Client Management:
Configuration Validation:
Hash-Based Drift Detection:
The provider includes comprehensive observability features:
Metrics:
Logging: The provider will use a debug-first logging approach with structured logging:
--debug flag): Operational details
Status Reporting:
Note: Crossplane Runtime v2 is not backwards compatible with Crossplane v1.x releases. This provider requires Crossplane v2.0.0 or later.
The provider supports multiple Cortex versions through its flexible API client:
The provider requires Cortex to be configured with a ruler storage backend that supports the full Ruler API. The following backends are supported:
| Backend | Support Status | Notes |
|---|---|---|
| S3 | ✅ Fully Supported | Recommended for production use. Includes S3-compatible storage like MinIO |
| GCS | ✅ Fully Supported | Google Cloud Storage |
| Azure | ✅ Fully Supported | Azure Blob Storage |
| Swift | ✅ Fully Supported | OpenStack Swift |
| Local | ❌ NOT SUPPORTED | Read-only backend with no write API support |
Important: Cortex's ruler_storage.backend: local is a read-only storage backend that does not support the write operations (SetRuleGroup, DeleteRuleGroup) required by this provider.
The provider implements intelligent API fallback for maximum compatibility:
GetRuleGroup API for optimal performance on supported backendsListRules if GetRuleGroup returns "unsupported" errorsThis ensures compatibility with different Cortex configurations while maintaining optimal performance where possible.
Approach: Using Helm charts to deploy and manage Cortex configurations
Limitations:
Conclusion: Helm charts alone are insufficient for dynamic configuration management and lack the operational benefits of Kubernetes-native resources.
Approach: Building a traditional Kubernetes operator without Crossplane
Comparison:
Conclusion: While operators provide control, Crossplane offers a mature platform with proven patterns and extensive ecosystem benefits.
Approach: Using Terraform with custom providers for Cortex management
Comparison:
Conclusion: While Terraform is powerful for infrastructure management, Crossplane provides better integration with Kubernetes-native workflows and superior multi-tenancy support for application-level configuration management.