Back to Pipeline

Metrics Migration: OpenCensus to OpenTelemetry

docs/metrics-migration-otel.md

1.12.017.0 KB
Original Source
<!-- --- linkTitle: "OpenTelemetry Metrics Migration" weight: 305 --- -->

Metrics Migration: OpenCensus to OpenTelemetry

Breaking Changes - Action Required

Related PR: #9043


Executive Summary

Tekton Pipelines has migrated from OpenCensus (deprecated) to OpenTelemetry (CNCF standard). This is a BREAKING CHANGE requiring updates to dashboards, alerts, and configuration.

What Changed

Infrastructure Metrics (HIGH IMPACT)

  • Workqueue: tekton_pipelines_controller_workqueue_*kn_workqueue_*
  • K8s Client: tekton_pipelines_controller_client_* → Standard HTTP/K8s metrics
  • Go Runtime: tekton_pipelines_controller_go_*go_*

Core Tekton Metrics (LOW IMPACT)

  • Names preserved: tekton_pipelines_controller_pipelinerun_* and tekton_pipelines_controller_taskrun_*
  • Labels unchanged by default
  • Optional reason label available (disabled by default, see Section 2.2)

Configuration (MEDIUM IMPACT)

  • metrics.backend-destinationmetrics-protocol
  • New OTLP export options (gRPC, HTTP/protobuf)

Removed Metrics (LOW IMPACT)

  • reconcile_count and reconcile_latency → Use kn_workqueue_* instead

Action Required

  1. Update ConfigMap: metrics.backend-destination: prometheusmetrics-protocol: prometheus
  2. Update Dashboards: Replace infrastructure metric names (see tables below)
  3. Update Alerts: Use new metric names
  4. Test First: Validate in staging before production

1. Infrastructure Metrics - Breaking Changes

1.1 Workqueue Metrics

Old Name (OpenCensus)New Name (OpenTelemetry)TypeChange
tekton_pipelines_controller_workqueue_adds_totalkn_workqueue_adds_totalCounterPrefix change
tekton_pipelines_controller_workqueue_depthkn_workqueue_depthGaugePrefix change
tekton_pipelines_controller_workqueue_queue_latency_secondskn_workqueue_queue_duration_secondsHistogramRenamed latency → duration
tekton_pipelines_controller_workqueue_work_duration_secondskn_workqueue_process_duration_secondsHistogramRenamed work → process
tekton_pipelines_controller_workqueue_retries_totalkn_workqueue_retries_totalCounterPrefix change
tekton_pipelines_controller_workqueue_unfinished_work_secondskn_workqueue_unfinished_work_secondsGaugePrefix change

1.2 Kubernetes Client Metrics

Old Name (OpenCensus)New Name (OpenTelemetry)TypeChange
tekton_pipelines_controller_client_latencyhttp_client_request_duration_secondsHistogramStandard HTTP metric
tekton_pipelines_controller_client_resultskn_k8s_client_http_response_status_code_totalCounterDetailed status tracking

1.3 Go Runtime Metrics

All Go runtime metrics renamed: tekton_pipelines_controller_go_*go_*

Examples:

  • tekton_pipelines_controller_go_goroutinesgo_goroutines
  • tekton_pipelines_controller_go_memstats_alloc_bytesgo_memstats_alloc_bytes
  • tekton_pipelines_controller_go_threadsgo_threads

2. Core Tekton Metrics - No Breaking Changes

Core metrics retain their names and labels. By default, there are no changes to PipelineRun and TaskRun metrics.

Metric NameLabelsChange
tekton_pipelines_controller_pipelinerun_duration_secondspipeline, status✅ No change
tekton_pipelines_controller_pipelinerun_taskrun_duration_secondspipeline, pipelinerun, status, task, taskrun✅ No change
tekton_pipelines_controller_taskrun_duration_secondsstatus, task, taskrun✅ No change
tekton_pipelines_controller_pipelinerun_totalstatus✅ No change
tekton_pipelines_controller_taskrun_totalstatus✅ No change

2.1 Backward Compatibility

These metrics maintain full backward compatibility with the OpenCensus implementation when using default configuration.

2.2 Optional Feature: reason Label (Disabled by Default)

An opt-in feature allows adding a reason label to duration metrics for more granular failure analysis.

To enable (not recommended for high-volume clusters):

yaml
data:
  metrics.count.enable-reason: "true"

When enabled, duration metrics gain the reason label:

Metric NameDefault LabelsWith enable-reason: true
tekton_pipelines_controller_pipelinerun_duration_secondspipeline, statuspipeline, status, reason
tekton_pipelines_controller_pipelinerun_taskrun_duration_secondspipeline, pipelinerun, status, task, taskrunAll previous + reason
tekton_pipelines_controller_taskrun_duration_secondsstatus, task, taskrunAll previous + reason

Reason values: Succeeded, Failed, Completed, Cancelled, PipelineRunCancelled, TimedOut, StoppedRunFinally, CancelledRunFinally

⚠️ Cardinality Impact: Enabling this increases time series by 3-5x. Use sum by(le) aggregation in queries to mitigate.

Note: Total counters (*_total) never include the reason label, even when enabled.

2.3 Preserved Behavior

FeatureImplementation
Pod latency metricRemains a Gauge (metric type preserved)
Duration histogram bucketsCustom explicit buckets: [10s, 30s, 1m, 5m, 15m, 30m, 1h, 1.5h, 3h, 6h, +Inf]
Metric cardinalityUnchanged with default configuration

3. Removed Metrics

Removed MetricReplacementMigration
tekton_pipelines_controller_reconcile_countkn_workqueue_adds_totalMonitor reconciliation activity
tekton_pipelines_controller_reconcile_latencykn_workqueue_process_duration_secondsMonitor reconciliation duration

4. Configuration Changes

4.1 Before (OpenCensus) - REMOVED

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: tekton-pipelines
data:
  metrics.backend-destination: prometheus  # DEPRECATED - no longer supported

4.2 After (OpenTelemetry) - Current

yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: tekton-pipelines
data:
  # ===== Metrics Configuration =====
  metrics-protocol: prometheus                    # Required: prometheus, grpc, http/protobuf, none
  metrics-endpoint: ""                            # Optional: OTLP endpoint (for grpc/http)
  metrics-export-interval: "30s"                  # Optional: Export frequency

  # ===== Tracing Configuration (NEW) =====
  tracing-protocol: none                          # Optional: grpc, http/protobuf, none, stdout
  tracing-endpoint: ""                            # Optional: OTLP tracing endpoint
  tracing-sampling-rate: "1.0"                    # Optional: 0.0-1.0 (1.0 = 100%)

  # ===== Runtime Configuration =====
  runtime-profiling: disabled                     # Optional: enabled, disabled
  runtime-export-interval: "15s"                  # Optional: Runtime metrics export interval

  # ===== Tekton-Specific Settings =====
  metrics.taskrun.level: "task"                   # task, namespace, cluster
  metrics.taskrun.duration-type: "histogram"      # histogram, gauge
  metrics.pipelinerun.level: "pipeline"           # pipeline, namespace, cluster
  metrics.pipelinerun.duration-type: "histogram"  # histogram, gauge
  metrics.count.enable-reason: "false"            # Add reason label to duration metrics (see Section 2.2)

Key Change: metrics.backend-destinationmetrics-protocol

4.3 Supported Protocols

ProtocolUse CaseEndpoint RequiredFormat
prometheusPrometheus scraping (default)NoPrometheus exposition
grpcOTLP over gRPCYesOpenTelemetry Protocol
http / http/protobufOTLP over HTTPYesOpenTelemetry Protocol
noneDisable metricsNoN/A

4.4 Configuration Examples

Example 1: Prometheus (Default)

yaml
data:
  metrics-protocol: prometheus

Example 2: OTLP gRPC to OpenTelemetry Collector

yaml
data:
  metrics-protocol: grpc
  metrics-endpoint: "otel-collector.observability.svc.cluster.local:4317"
  metrics-export-interval: "30s"

Example 3: OTLP HTTP with Tracing

yaml
data:
  metrics-protocol: http/protobuf
  metrics-endpoint: "http://otel-collector.observability.svc.cluster.local:4318/v1/metrics"

  tracing-protocol: grpc
  tracing-endpoint: "otel-collector.observability.svc.cluster.local:4317"
  tracing-sampling-rate: "0.1"  # 10% sampling

5. Migration Steps

5.1 Pre-Migration Checklist

  • Inventory all dashboards using Tekton metrics
  • Inventory all alerts using Tekton metrics
  • Back up config-observability ConfigMap
  • Plan maintenance window

5.2 Step-by-Step Process

Step 1: Update Configuration

bash
kubectl edit configmap config-observability -n tekton-pipelines
# Change: metrics.backend-destination: prometheus
# To:     metrics-protocol: prometheus

Step 2: Upgrade Tekton Pipelines

bash
# Apply the new version containing the OTel migration
kubectl apply -f https://infra.tekton.dev/tekton-releases/pipeline/latest/release.yaml

# Wait for rollout
kubectl rollout status deployment/tekton-pipelines-controller -n tekton-pipelines

Step 3: Verify Metrics Endpoint

bash
# Port-forward to controller
kubectl port-forward -n tekton-pipelines deployment/tekton-pipelines-controller 9090:9090

# Check metrics (in another terminal)
curl http://localhost:9090/metrics | grep -E "(kn_workqueue|tekton_pipelines_controller)"

Step 4: Update Dashboards

Old QueryNew Query
rate(tekton_pipelines_controller_workqueue_adds_total[5m])rate(kn_workqueue_adds_total[5m])
tekton_pipelines_controller_pipelinerun_duration_secondsNo change needed
tekton_pipelines_controller_go_goroutinesgo_goroutines

Note: Core Tekton metrics (PipelineRun/TaskRun) queries remain unchanged with default configuration.

Step 5: Update Alerts

Before:

yaml
- alert: HighWorkqueueDepth
  expr: tekton_pipelines_controller_workqueue_depth > 100

After:

yaml
- alert: HighWorkqueueDepth
  expr: kn_workqueue_depth > 100

Step 6: Verify and Monitor

  • Confirm metrics appear in Prometheus
  • Check dashboards display correctly
  • Verify alerts work
  • Monitor for errors in controller logs

6. Query Examples

Example 1: PipelineRun Success Rate

Before:

promql
sum(rate(tekton_pipelines_controller_pipelinerun_total{status="success"}[5m]))
/
sum(rate(tekton_pipelines_controller_pipelinerun_total[5m]))

After: ✅ No change needed (reason label not added to *_total)

promql
sum(rate(tekton_pipelines_controller_pipelinerun_total{status="success"}[5m]))
/
sum(rate(tekton_pipelines_controller_pipelinerun_total[5m]))

Example 2: P95 PipelineRun Duration

Before:

promql
histogram_quantile(0.95,
  rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
)

After: ✅ No change needed (with default configuration)

promql
histogram_quantile(0.95,
  rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
)

Note: If you enable metrics.count.enable-reason: "true", you'll need to aggregate by le:

promql
histogram_quantile(0.95,
  sum by(le) (
    rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
  )
)

# OR keep reason for granular failure analysis:
histogram_quantile(0.95,
  sum by(le, reason) (
    rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
  )
)

Example 3: Controller Memory

Before:

promql
tekton_pipelines_controller_go_memstats_alloc_bytes

After: ✅ Simple prefix change

promql
go_memstats_alloc_bytes

Example 4: Workqueue Processing Rate

Before:

promql
rate(tekton_pipelines_controller_workqueue_adds_total{name="pipelinerun"}[5m])

After: ✅ Prefix change only

promql
rate(kn_workqueue_adds_total{name="pipelinerun"}[5m])

7. Troubleshooting

7.1 Verifying Metrics Export

bash
# Check metrics endpoint
kubectl port-forward -n tekton-pipelines deployment/tekton-pipelines-controller 9090:9090

# Verify new metrics exist
curl http://localhost:9090/metrics | grep kn_workqueue
curl http://localhost:9090/metrics | grep tekton_pipelines_controller

7.2 Prometheus Not Scraping

Check Prometheus targets:

  1. Navigate to Prometheus UI → Status → Targets
  2. Look for tekton-pipelines-controller
  3. Verify State is UP

Verify ServiceMonitor:

bash
kubectl get servicemonitor -n tekton-pipelines -o yaml

7.3 High Cardinality Warnings

Problem: Prometheus warns about high series cardinality after enabling metrics.count.enable-reason

Root Cause: The reason label increases time series by 3-5x

Solution 1: Disable the feature (recommended):

yaml
data:
  metrics.count.enable-reason: "false"

Solution 2: Aggregate away the reason label in queries:

promql
# Instead of:
histogram_quantile(0.95, rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m]))

# Use:
histogram_quantile(0.95, sum by(le) (rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])))

Note: This issue only occurs if you explicitly enabled metrics.count.enable-reason: "true".

7.4 Configuration Not Taking Effect

bash
# Restart controller to pick up ConfigMap changes
kubectl rollout restart deployment/tekton-pipelines-controller -n tekton-pipelines
kubectl rollout status deployment/tekton-pipelines-controller -n tekton-pipelines

# Check logs
kubectl logs -n tekton-pipelines deployment/tekton-pipelines-controller | grep -i "observability"

7.5 OTLP Export Failures

Check controller logs:

bash
kubectl logs -n tekton-pipelines deployment/tekton-pipelines-controller | grep -i "otel\|export\|metric"

Test endpoint connectivity:

bash
kubectl exec -n tekton-pipelines deployment/tekton-pipelines-controller -- \
  nc -zv otel-collector.observability.svc.cluster.local 4317

7.6 Enable Debug Logging

bash
kubectl edit configmap config-logging -n tekton-pipelines
# Add: loglevel.controller: "debug"

kubectl rollout restart deployment/tekton-pipelines-controller -n tekton-pipelines

8. Frequently Asked Questions

Q: Do I need to upgrade immediately?

A: This migration is included starting from a future Tekton Pipelines release. Plan to migrate when upgrading to that version. Test in staging first.

Q: Will old metrics continue to work during transition?

A: No. This is a hard cutover. Only OpenTelemetry metrics will be available after upgrade.

Q: What if I don't update my configuration?

A: If using Prometheus, the default behavior is preserved, but dashboards/alerts will break due to metric name changes.

Q: Can I use both OpenCensus and OpenTelemetry?

A: No. The controller only emits OpenTelemetry metrics after upgrade.

Q: How do I test without affecting production?

A: Deploy in a test/staging environment first. Verify all metrics, dashboards, and alerts work before upgrading production.

Q: Where can I get help?

A: File an issue at https://github.com/tektoncd/pipeline/issues or ask in Slack (#tekton channel).


9. Quick Reference Checklist

Infrastructure Metrics Update

CategoryOld PrefixNew PrefixAction
Workqueuetekton_pipelines_controller_workqueue_*kn_workqueue_*Update all queries
K8s Clienttekton_pipelines_controller_client_*http_client_* or kn_k8s_client_*Update all queries
Go Runtimetekton_pipelines_controller_go_*go_*Update all queries

Core Metrics Update

  • No changes needed - Core metrics are backward compatible
  • If you enable metrics.count.enable-reason, add sum by(le) aggregation to duration queries

Configuration Update

  • Change metrics.backend-destinationmetrics-protocol
  • Add OTLP endpoint if using gRPC/HTTP protocols
  • Configure tracing if desired (new capability)

Dashboard Update

  • Replace workqueue metric names
  • Replace Go runtime metric names
  • Replace K8s client metric names
  • No changes needed for core Tekton metrics (PipelineRun/TaskRun)
  • Test all panels

Alert Update

  • Update workqueue metrics in alert rules
  • Update Go runtime metrics in alert rules
  • Update K8s client metrics if used
  • Verify thresholds still appropriate
  • Test alerts fire correctly

Additional Resources


For questions or clarifications, please refer to PR #9043 or contact the Tekton team.