docs/metrics-migration-otel.md
Breaking Changes - Action Required
Related PR: #9043
Tekton Pipelines has migrated from OpenCensus (deprecated) to OpenTelemetry (CNCF standard). This is a BREAKING CHANGE requiring updates to dashboards, alerts, and configuration.
Infrastructure Metrics (HIGH IMPACT)
tekton_pipelines_controller_workqueue_* → kn_workqueue_*tekton_pipelines_controller_client_* → Standard HTTP/K8s metricstekton_pipelines_controller_go_* → go_*Core Tekton Metrics (LOW IMPACT)
tekton_pipelines_controller_pipelinerun_* and tekton_pipelines_controller_taskrun_*reason label available (disabled by default, see Section 2.2)Configuration (MEDIUM IMPACT)
metrics.backend-destination → metrics-protocolRemoved Metrics (LOW IMPACT)
reconcile_count and reconcile_latency → Use kn_workqueue_* insteadmetrics.backend-destination: prometheus → metrics-protocol: prometheus| Old Name (OpenCensus) | New Name (OpenTelemetry) | Type | Change |
|---|---|---|---|
tekton_pipelines_controller_workqueue_adds_total | kn_workqueue_adds_total | Counter | Prefix change |
tekton_pipelines_controller_workqueue_depth | kn_workqueue_depth | Gauge | Prefix change |
tekton_pipelines_controller_workqueue_queue_latency_seconds | kn_workqueue_queue_duration_seconds | Histogram | Renamed latency → duration |
tekton_pipelines_controller_workqueue_work_duration_seconds | kn_workqueue_process_duration_seconds | Histogram | Renamed work → process |
tekton_pipelines_controller_workqueue_retries_total | kn_workqueue_retries_total | Counter | Prefix change |
tekton_pipelines_controller_workqueue_unfinished_work_seconds | kn_workqueue_unfinished_work_seconds | Gauge | Prefix change |
| Old Name (OpenCensus) | New Name (OpenTelemetry) | Type | Change |
|---|---|---|---|
tekton_pipelines_controller_client_latency | http_client_request_duration_seconds | Histogram | Standard HTTP metric |
tekton_pipelines_controller_client_results | kn_k8s_client_http_response_status_code_total | Counter | Detailed status tracking |
All Go runtime metrics renamed: tekton_pipelines_controller_go_* → go_*
Examples:
tekton_pipelines_controller_go_goroutines → go_goroutinestekton_pipelines_controller_go_memstats_alloc_bytes → go_memstats_alloc_bytestekton_pipelines_controller_go_threads → go_threadsCore metrics retain their names and labels. By default, there are no changes to PipelineRun and TaskRun metrics.
| Metric Name | Labels | Change |
|---|---|---|
tekton_pipelines_controller_pipelinerun_duration_seconds | pipeline, status | ✅ No change |
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds | pipeline, pipelinerun, status, task, taskrun | ✅ No change |
tekton_pipelines_controller_taskrun_duration_seconds | status, task, taskrun | ✅ No change |
tekton_pipelines_controller_pipelinerun_total | status | ✅ No change |
tekton_pipelines_controller_taskrun_total | status | ✅ No change |
These metrics maintain full backward compatibility with the OpenCensus implementation when using default configuration.
reason Label (Disabled by Default)An opt-in feature allows adding a reason label to duration metrics for more granular failure analysis.
To enable (not recommended for high-volume clusters):
data:
metrics.count.enable-reason: "true"
When enabled, duration metrics gain the reason label:
| Metric Name | Default Labels | With enable-reason: true |
|---|---|---|
tekton_pipelines_controller_pipelinerun_duration_seconds | pipeline, status | pipeline, status, reason |
tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds | pipeline, pipelinerun, status, task, taskrun | All previous + reason |
tekton_pipelines_controller_taskrun_duration_seconds | status, task, taskrun | All previous + reason |
Reason values: Succeeded, Failed, Completed, Cancelled, PipelineRunCancelled, TimedOut, StoppedRunFinally, CancelledRunFinally
⚠️ Cardinality Impact: Enabling this increases time series by 3-5x. Use sum by(le) aggregation in queries to mitigate.
Note: Total counters (*_total) never include the reason label, even when enabled.
| Feature | Implementation |
|---|---|
| Pod latency metric | Remains a Gauge (metric type preserved) |
| Duration histogram buckets | Custom explicit buckets: [10s, 30s, 1m, 5m, 15m, 30m, 1h, 1.5h, 3h, 6h, +Inf] |
| Metric cardinality | Unchanged with default configuration |
| Removed Metric | Replacement | Migration |
|---|---|---|
tekton_pipelines_controller_reconcile_count | kn_workqueue_adds_total | Monitor reconciliation activity |
tekton_pipelines_controller_reconcile_latency | kn_workqueue_process_duration_seconds | Monitor reconciliation duration |
apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: tekton-pipelines
data:
metrics.backend-destination: prometheus # DEPRECATED - no longer supported
apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: tekton-pipelines
data:
# ===== Metrics Configuration =====
metrics-protocol: prometheus # Required: prometheus, grpc, http/protobuf, none
metrics-endpoint: "" # Optional: OTLP endpoint (for grpc/http)
metrics-export-interval: "30s" # Optional: Export frequency
# ===== Tracing Configuration (NEW) =====
tracing-protocol: none # Optional: grpc, http/protobuf, none, stdout
tracing-endpoint: "" # Optional: OTLP tracing endpoint
tracing-sampling-rate: "1.0" # Optional: 0.0-1.0 (1.0 = 100%)
# ===== Runtime Configuration =====
runtime-profiling: disabled # Optional: enabled, disabled
runtime-export-interval: "15s" # Optional: Runtime metrics export interval
# ===== Tekton-Specific Settings =====
metrics.taskrun.level: "task" # task, namespace, cluster
metrics.taskrun.duration-type: "histogram" # histogram, gauge
metrics.pipelinerun.level: "pipeline" # pipeline, namespace, cluster
metrics.pipelinerun.duration-type: "histogram" # histogram, gauge
metrics.count.enable-reason: "false" # Add reason label to duration metrics (see Section 2.2)
Key Change: metrics.backend-destination → metrics-protocol
| Protocol | Use Case | Endpoint Required | Format |
|---|---|---|---|
| prometheus | Prometheus scraping (default) | No | Prometheus exposition |
| grpc | OTLP over gRPC | Yes | OpenTelemetry Protocol |
| http / http/protobuf | OTLP over HTTP | Yes | OpenTelemetry Protocol |
| none | Disable metrics | No | N/A |
Example 1: Prometheus (Default)
data:
metrics-protocol: prometheus
Example 2: OTLP gRPC to OpenTelemetry Collector
data:
metrics-protocol: grpc
metrics-endpoint: "otel-collector.observability.svc.cluster.local:4317"
metrics-export-interval: "30s"
Example 3: OTLP HTTP with Tracing
data:
metrics-protocol: http/protobuf
metrics-endpoint: "http://otel-collector.observability.svc.cluster.local:4318/v1/metrics"
tracing-protocol: grpc
tracing-endpoint: "otel-collector.observability.svc.cluster.local:4317"
tracing-sampling-rate: "0.1" # 10% sampling
config-observability ConfigMapStep 1: Update Configuration
kubectl edit configmap config-observability -n tekton-pipelines
# Change: metrics.backend-destination: prometheus
# To: metrics-protocol: prometheus
Step 2: Upgrade Tekton Pipelines
# Apply the new version containing the OTel migration
kubectl apply -f https://infra.tekton.dev/tekton-releases/pipeline/latest/release.yaml
# Wait for rollout
kubectl rollout status deployment/tekton-pipelines-controller -n tekton-pipelines
Step 3: Verify Metrics Endpoint
# Port-forward to controller
kubectl port-forward -n tekton-pipelines deployment/tekton-pipelines-controller 9090:9090
# Check metrics (in another terminal)
curl http://localhost:9090/metrics | grep -E "(kn_workqueue|tekton_pipelines_controller)"
Step 4: Update Dashboards
| Old Query | New Query |
|---|---|
rate(tekton_pipelines_controller_workqueue_adds_total[5m]) | rate(kn_workqueue_adds_total[5m]) |
tekton_pipelines_controller_pipelinerun_duration_seconds | ✅ No change needed |
tekton_pipelines_controller_go_goroutines | go_goroutines |
Note: Core Tekton metrics (PipelineRun/TaskRun) queries remain unchanged with default configuration.
Step 5: Update Alerts
Before:
- alert: HighWorkqueueDepth
expr: tekton_pipelines_controller_workqueue_depth > 100
After:
- alert: HighWorkqueueDepth
expr: kn_workqueue_depth > 100
Step 6: Verify and Monitor
Before:
sum(rate(tekton_pipelines_controller_pipelinerun_total{status="success"}[5m]))
/
sum(rate(tekton_pipelines_controller_pipelinerun_total[5m]))
After: ✅ No change needed (reason label not added to *_total)
sum(rate(tekton_pipelines_controller_pipelinerun_total{status="success"}[5m]))
/
sum(rate(tekton_pipelines_controller_pipelinerun_total[5m]))
Before:
histogram_quantile(0.95,
rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
)
After: ✅ No change needed (with default configuration)
histogram_quantile(0.95,
rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
)
Note: If you enable metrics.count.enable-reason: "true", you'll need to aggregate by le:
histogram_quantile(0.95,
sum by(le) (
rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
)
)
# OR keep reason for granular failure analysis:
histogram_quantile(0.95,
sum by(le, reason) (
rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
)
)
Before:
tekton_pipelines_controller_go_memstats_alloc_bytes
After: ✅ Simple prefix change
go_memstats_alloc_bytes
Before:
rate(tekton_pipelines_controller_workqueue_adds_total{name="pipelinerun"}[5m])
After: ✅ Prefix change only
rate(kn_workqueue_adds_total{name="pipelinerun"}[5m])
# Check metrics endpoint
kubectl port-forward -n tekton-pipelines deployment/tekton-pipelines-controller 9090:9090
# Verify new metrics exist
curl http://localhost:9090/metrics | grep kn_workqueue
curl http://localhost:9090/metrics | grep tekton_pipelines_controller
Check Prometheus targets:
tekton-pipelines-controllerUPVerify ServiceMonitor:
kubectl get servicemonitor -n tekton-pipelines -o yaml
Problem: Prometheus warns about high series cardinality after enabling metrics.count.enable-reason
Root Cause: The reason label increases time series by 3-5x
Solution 1: Disable the feature (recommended):
data:
metrics.count.enable-reason: "false"
Solution 2: Aggregate away the reason label in queries:
# Instead of:
histogram_quantile(0.95, rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m]))
# Use:
histogram_quantile(0.95, sum by(le) (rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])))
Note: This issue only occurs if you explicitly enabled metrics.count.enable-reason: "true".
# Restart controller to pick up ConfigMap changes
kubectl rollout restart deployment/tekton-pipelines-controller -n tekton-pipelines
kubectl rollout status deployment/tekton-pipelines-controller -n tekton-pipelines
# Check logs
kubectl logs -n tekton-pipelines deployment/tekton-pipelines-controller | grep -i "observability"
Check controller logs:
kubectl logs -n tekton-pipelines deployment/tekton-pipelines-controller | grep -i "otel\|export\|metric"
Test endpoint connectivity:
kubectl exec -n tekton-pipelines deployment/tekton-pipelines-controller -- \
nc -zv otel-collector.observability.svc.cluster.local 4317
kubectl edit configmap config-logging -n tekton-pipelines
# Add: loglevel.controller: "debug"
kubectl rollout restart deployment/tekton-pipelines-controller -n tekton-pipelines
Q: Do I need to upgrade immediately?
A: This migration is included starting from a future Tekton Pipelines release. Plan to migrate when upgrading to that version. Test in staging first.
Q: Will old metrics continue to work during transition?
A: No. This is a hard cutover. Only OpenTelemetry metrics will be available after upgrade.
Q: What if I don't update my configuration?
A: If using Prometheus, the default behavior is preserved, but dashboards/alerts will break due to metric name changes.
Q: Can I use both OpenCensus and OpenTelemetry?
A: No. The controller only emits OpenTelemetry metrics after upgrade.
Q: How do I test without affecting production?
A: Deploy in a test/staging environment first. Verify all metrics, dashboards, and alerts work before upgrading production.
Q: Where can I get help?
A: File an issue at https://github.com/tektoncd/pipeline/issues or ask in Slack (#tekton channel).
| Category | Old Prefix | New Prefix | Action |
|---|---|---|---|
| Workqueue | tekton_pipelines_controller_workqueue_* | kn_workqueue_* | Update all queries |
| K8s Client | tekton_pipelines_controller_client_* | http_client_* or kn_k8s_client_* | Update all queries |
| Go Runtime | tekton_pipelines_controller_go_* | go_* | Update all queries |
metrics.count.enable-reason, add sum by(le) aggregation to duration queriesmetrics.backend-destination → metrics-protocolFor questions or clarifications, please refer to PR #9043 or contact the Tekton team.