rfcs/0011-opentelemetry-tracing/README.md
Status: provisional
<!-- Status represents the current state of the RFC. Must be one of `provisional`, `implementable`, `implemented`, `deferred`, `rejected`, `withdrawn`, or `replaced`. -->Creation date: 2025-04-24
Last update: 2025-08-13
The aim is to be able to collect traces via OpenTelemetry (OTel) across all Flux related objects, such as HelmReleases, Kustomizations and among others. These may be sent towards a tracing provider where may be potentially stored and visualized. Flux does not have any responsibility on storing and visualizing those, it keeps being completely stateless. Thereby, being seamless for the user, the implementation is going to be part of the already existing Alert API Type. Therefore, EventSources is going to discriminate the events belonging to the specific sources, which are going to be looked up to and send them out towards the Provider set. In this way, it could facilitate the observability and monitoring of Flux related objects.
This RFC was born out of a need for end-to-end visibility into Flux’s multi-controller GitOps workflow. At the time Flux was one monolithic controller; it has since split into several specialized controllers (source-, kustomize-, helm-, notification-, etc.), which makes tracing the path of a single "Source change → applied resource → notification” much harder. Additionally, users may not have to implement tools/sidecars around to maintain.
Correlate any potential source (GitRepository, OCIRepository, HelmChart or Bucket) with all downstream actions. Therefore, you would like to see a single trace (with multiple spans underneath):
On top of this, can be built custom UIs that surface trace timelines alongside Git commit or Docker image tags, so operators can say “what exactly happened when I tagged v1.2.3?” in a single pane of glass.
Alerts to be able to populate this feature over, out-of-the-box. Therefore, users can link EventSources and Provider where trace will be sent.Provider.The implementation will extend the notification-controller with OpenTelemetry tracing capabilities by leveraging the existing Alert API object model and adding a new Provider API type called otel. This approach maintains Flux's declarative configuration paradigm while adding powerful distributed tracing functionality.
otel. Which already has visibility into events across the Flux ecosystem.EventSources define which Flux resources to trace (GitRepositories, Kustomizations, HelmReleases, etc.).Provider specifies where to send the trace data (any OpenTelemetry-compatible backends).This approach allows users to declaratively configure tracing using familiar Flux patterns, without requiring code changes to their applications or additional sidecar deployments. The notification-controller will handle the collection, correlation, and forwarding of spans to the configured tracing backend.
Example Configuration:
# Configure the alert
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
name: webapp-tracing
namespace: default
spec:
providerRef:
name: otel-collector
eventSources:
- kind: GitRepository # Source controller resources
name: webapp-source
- kind: Kustomization # Kustomize controller resources
name: webapp-backend
- kind: Kustomization # Kustomize controller resources
name: webapp-frontend
eventMetadata:
env: staging
cluster: cluster-1
region: us-east-2
---
# Define a tracing provider
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
name: otel-collector
namespace: default
spec:
type: otel
address: http://otel-collector.observability.svc.cluster.local:4318/v1/traces # OTEL Collector endpoint
secretRef:
name: otel-collector-secret # Optional: auth + additional headers
certSecretRef:
name: mtls-certs # Optional: enable mTLS auth
proxySecretRef:
name: otel-collector-proxy # Optional: proxy configuration
---
# OTEL Collector secret
apiVersion: v1
kind: Secret
metadata:
name: otel-collector-secret
namespace: default
stringData:
# Headers data prevails over auth fields (username/password or token)
# Must be used either username/password or token (considers if username is set in order to discriminate bearer token auth or basic auth)
username: "<otel-collector-username>"
password: "<otel-collector-password>"
token: "<otel-collector-api-token>"
headers: |
X-Forwarded-Proto: https
---
# TLS Certificates and keys
apiVersion: v1
kind: Secret
metadata:
name: mtls-certs
namespace: default
type: kubernetes.io/tls # or Opaque
stringData:
# All fields are required to enable mTLS
tls.crt: |
-----BEGIN CERTIFICATE-----
<client certificate>
-----END CERTIFICATE-----
tls.key: |
-----BEGIN PRIVATE KEY-----
<client private key>
-----END PRIVATE KEY-----
# Just ca.crt in case of CA-only
ca.crt: |
-----BEGIN CERTIFICATE-----
<certificate authority certificate>
-----END CERTIFICATE-----
---
# Proxy configuration
apiVersion: v1
kind: Secret
metadata:
name: otel-collector-proxy
namespace: default
stringData:
address: "http://<otel-collector-proxy-url>(:<otel-collector-proxy-port>)"
username: "<otel-collector-proxy-username>"
password: "<otel-collector-proxy-password>"
Based on this configuration, the notification-controller will:
Provider.A key challenge in distributed tracing is establishing a reliable correlation mechanism that works across multiple controllers in a stateless, potentially unreliable environment. Our solution addresses this with a robust span identification strategy.
The Trace ID is generated using a deterministic approach that combines:
These values are concatenated and passed through a configurable checksum algorithm (SHA-256 by default). This approach ensures:
Example:
# Input values
Alert UID: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
Source Revision: "sha256:2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae"
# Concatenated value
"a1b2c3d4-e5f6-7890-abcd-ef1234567890(<Alert-UID>):sha256:2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae(<source-revision>)"
# Apply SHA-256 (default algorithm)
Trace ID: "f7846f55cf23e14eebeab5b4e1550cad5b509e3348fbc4efa3a1413d393cb650"
When events occur in the system:
The design accounts for the distributed nature of Flux controllers and potential delays/downtimes that a distributed system always implies:
This design ensures trace continuity even in challenging distributed environments while maintaining Flux's core principles of statelessness and resilience.