Documentation/observability/metrics.rst
.. only:: not (epub or latex or html)
WARNING: You are looking at unreleased Cilium documentation.
Please use the official rendered version released here:
https://docs.cilium.io
.. _metrics:
Monitoring & Metrics
Cilium and Hubble can both be configured to serve Prometheus <https://prometheus.io>_ metrics. Prometheus is a pluggable metrics collection
and storage system and can act as a data source for Grafana <https://grafana.com/>_, a metrics visualization frontend. Unlike some metrics
collectors like statsd, Prometheus requires the collectors to pull metrics from
each source.
Cilium and Hubble metrics can be enabled independently of each other.
Cilium metrics provide insights into the state of Cilium itself, namely
of the cilium-agent, cilium-envoy, and cilium-operator processes.
To run Cilium with Prometheus metrics enabled, deploy it with the
prometheus.enabled=true Helm value set.
Cilium metrics are exported under the cilium_ Prometheus namespace. Envoy
metrics are exported under the envoy_ Prometheus namespace, of which the
Cilium-defined metrics are exported under the envoy_cilium_ namespace.
When running and collecting in Kubernetes they will be tagged with a pod name
and namespace.
You can enable metrics for cilium-agent (including Envoy) with the Helm value
prometheus.enabled=true. cilium-operator metrics are enabled by default,
if you want to disable them, set Helm value operator.prometheus.enabled=false.
.. cilium-helm-install:: :namespace: kube-system :set: prometheus.enabled=true operator.prometheus.enabled=true
Prometheus Port Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ports can be configured via prometheus.port,
envoy.prometheus.port, or operator.prometheus.port respectively.
When metrics are enabled and ServiceMonitor is not enabled (hubble.metrics.serviceMonitor.enabled: false), all Cilium components will have the following annotations. These annotations can be used to signal Prometheus whether to scrape metrics.
If ServiceMonitor is enabled (hubble.metrics.serviceMonitor.enabled: true), these annotations are omitted and Prometheus discovers metrics via the ServiceMonitor resource.
.. code-block:: yaml
prometheus.io/scrape: true
prometheus.io/port: 9962
To collect Envoy metrics the Cilium chart will create a Kubernetes headless
service named cilium-agent with the prometheus.io/scrape:'true' annotation set:
.. code-block:: yaml
prometheus.io/scrape: true
prometheus.io/port: 9964
This additional headless service in addition to the other Cilium components is needed as each component can only have one Prometheus scrape and port annotation.
Prometheus will pick up the Cilium and Envoy metrics automatically if the following
option is set in the scrape_configs section:
.. code-block:: yaml
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: ${1}:${2}
target_label: __address__
Prometheus Operator ServiceMonitor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can automatically create a
Prometheus Operator <https://github.com/prometheus-operator/prometheus-operator>__
ServiceMonitor by setting prometheus.serviceMonitor.enabled=true, or
envoy.prometheus.serviceMonitor.enabled=true, or
operator.prometheus.serviceMonitor.enabled=true respectively.
.. _hubble_metrics:
While Cilium metrics allow you to monitor the state of Cilium itself, Hubble metrics on the other hand allow you to monitor the network behavior of your Cilium-managed Kubernetes pods with respect to connectivity and security.
Some of the metrics can also be configured with additional options.
See the :ref:Hubble exported metrics<hubble_exported_metrics>
section for the full list of available metrics and their options.
Hubble Metrics can either be configured with a static or dynamic exporter.
The dynamic metrics exporter allows you to change defined metrics as needed without requiring an agent restart.
To deploy Cilium with Hubble Metrics static exporter enabled, you need to enable
Hubble with hubble.enabled=true and provide a set of Hubble metrics you want to
enable via hubble.metrics.enabled.
.. cilium-helm-install:: :namespace: kube-system :set: prometheus.enabled=true operator.prometheus.enabled=true hubble.enabled=true hubble.metrics.enableOpenMetrics=true hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\,source_namespace\,source_workload\,destination_ip\,destination_namespace\,destination_workload\,traffic_direction}"
To deploy Cilium with Hubble dynamic metrics enabled, you need to enable Hubble
with hubble.enabled=true and hubble.metrics.dynamic.enabled=true.
In this example, a ConfigMap with a set of metrics will be applied before
enabling the exporter, but the desired set of metrics (together with the
ConfigMap) can be created during installation.
See the :ref:helm_reference (keys with hubble.metrics.dynamic.*)
.. code-block:: yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-dynamic-metrics-config
namespace: kube-system
data:
dynamic-metrics.yaml: |
metrics:
- name: dns
- contextOptions:
- name: sourceContext
values:
- workload-name
- reserved-identity
- name: destinationContext
values:
- workload-name
- reserved-identity
name: flow
- name: drop
- name: tcp
- contextOptions:
- name: sourceContext
values:
- workload-name
- reserved-identity
name: icmp
- contextOptions:
- name: exemplars
values:
- true
- name: labelsContext
values:
- source_ip
- source_namespace
- source_workload
- destination_ip
- destination_namespace
- destination_workload
- traffic_direction
- name: sourceContext
values:
- workload-name
- reserved-identity
- name: destinationContext
values:
- workload-name
- reserved-identity
name: httpV2
- contextOptions:
- name: sourceContext
values:
- app
- workload-name
- pod
- reserved-identity
- name: destinationContext
values:
- app
- workload-name
- pod
- dns
- reserved-identity
- name: labelsContext
values:
- source_namespace
- destination_namespace
excludeFilters:
- destination_pod:
- default/
name: policy
Deploy the :term:ConfigMap:
.. code-block:: shell-session
kubectl apply -f dynamic-metrics.yaml
.. cilium-helm-install:: :namespace: kube-system :set: prometheus.enabled=true operator.prometheus.enabled=true hubble.enabled=true hubble.metrics.enableOpenMetrics=true hubble.metrics.enabled=[] hubble.metrics.dynamic.enabled=true hubble.metrics.dynamic.config.configMapName=cilium-dynamic-metrics-config hubble.metrics.dynamic.config.createConfigMap=false
Prometheus Port Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The port of the Hubble metrics can be configured with the
hubble.metrics.port Helm value.
For details on enabling Hubble metrics with TLS see the
:ref:hubble_configure_metrics_tls section of the documentation.
.. Note::
L7 metrics such as HTTP, are only emitted for pods that enable
:ref:`Layer 7 Protocol Visibility <proxy_visibility>`.
When deployed with a non-empty hubble.metrics.enabled Helm value, the
Cilium chart will create a Kubernetes headless service named hubble-metrics
with the prometheus.io/scrape:'true' annotation set:
.. code-block:: yaml
prometheus.io/scrape: true
prometheus.io/port: 9965
Set the following options in the scrape_configs section of Prometheus to
have it scrape all Hubble metrics from the endpoints automatically:
.. code-block:: yaml
scrape_configs:
- job_name: 'kubernetes-endpoints'
scrape_interval: 30s
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)(?::\d+);(\d+)
replacement: $1:$2
Prometheus Operator ServiceMonitor ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can automatically create a
Prometheus Operator <https://github.com/prometheus-operator/prometheus-operator>__
ServiceMonitor by setting hubble.metrics.serviceMonitor.enabled=true.
.. _hubble_open_metrics:
Additionally, you can opt-in to OpenMetrics <https://openmetrics.io>_ by
setting hubble.metrics.enableOpenMetrics=true.
Enabling OpenMetrics configures the Hubble metrics endpoint to support exporting
metrics in OpenMetrics format when explicitly requested by clients.
Using OpenMetrics supports additional functionality such as Exemplars, which enables associating metrics with traces by embedding trace IDs into the exported metrics.
Prometheus needs to be configured to take advantage of OpenMetrics and will
only scrape exemplars when the exemplars storage feature is enabled <https://prometheus.io/docs/prometheus/latest/feature_flags/#exemplars-storage>_.
OpenMetrics imposes a few additional requirements on metrics names and labels, so this functionality is currently opt-in, though we believe all of the Hubble metrics conform to the OpenMetrics requirements.
.. _clustermesh_apiserver_metrics:
Cluster Mesh API Server metrics provide insights into the state of the
clustermesh-apiserver process, the kvstoremesh process (if enabled),
and the sidecar etcd instance.
Cluster Mesh API Server metrics are exported under the cilium_clustermesh_apiserver_
Prometheus namespace. KVStoreMesh metrics are exported under the cilium_kvstoremesh_
Prometheus namespace. Etcd metrics are exported under the etcd_ Prometheus namespace.
You can enable the metrics for different Cluster Mesh API Server components by setting the following values:
clustermesh.apiserver.metrics.enabled=trueclustermesh.apiserver.metrics.kvstoremesh.enabled=trueclustermesh.apiserver.metrics.etcd.enabled=true.. cilium-helm-install:: :namespace: kube-system :set: clustermesh.useAPIServer=true clustermesh.apiserver.metrics.enabled=true clustermesh.apiserver.metrics.kvstoremesh.enabled=true clustermesh.apiserver.metrics.etcd.enabled=true
You can figure the ports by way of clustermesh.apiserver.metrics.port,
clustermesh.apiserver.metrics.kvstoremesh.port and
clustermesh.apiserver.metrics.etcd.port respectively.
You can automatically create a
Prometheus Operator <https://github.com/prometheus-operator/prometheus-operator>_
ServiceMonitor by setting clustermesh.apiserver.metrics.serviceMonitor.enabled=true.
If you don't have an existing Prometheus and Grafana stack running, you can deploy a stack with:
.. parsed-literal::
kubectl apply -f \ |SCM_WEB|\/examples/kubernetes/addons/prometheus/monitoring-example.yaml
It will run Prometheus and Grafana in the cilium-monitoring namespace. If
you have either enabled Cilium or Hubble metrics, they will automatically
be scraped by Prometheus. You can then expose Grafana to access it via your browser.
.. code-block:: shell-session
kubectl -n cilium-monitoring port-forward service/grafana --address 0.0.0.0 --address :: 3000:3000
Open your browser and access http://localhost:3000/
Configuration ^^^^^^^^^^^^^
To expose any metrics, invoke cilium-agent with the
--prometheus-serve-addr option. This option takes a IP:Port pair but
passing an empty IP (e.g. :9962) will bind the server to all available
interfaces (there is usually only one in a container).
To customize metrics, use +/- prefix to enable/disable specific metrics.
For large clusters, consider disabling high-cardinality metrics like
cilium_node_connectivity_status and cilium_node_connectivity_latency_seconds.
.. tabs::
.. group-tab:: Helm
Use the ``prometheus.metrics`` value:
.. parsed-literal::
helm install cilium cilium/cilium |CHART_VERSION| \\
--namespace kube-system \\
--set prometheus.enabled=true \\
--set prometheus.metrics="{-cilium_node_connectivity_status,-cilium_node_connectivity_latency_seconds}"
.. group-tab:: CLI
Use the ``--metrics`` flag:
.. code-block:: shell-session
cilium-agent --prometheus-serve-addr=:9962 \
--metrics="-cilium_node_connectivity_status -cilium_node_connectivity_latency_seconds"
Feature Metrics
Cilium Feature Metrics are exported under the ``cilium_feature`` Prometheus
namespace.
The following tables categorize feature metrics into four groups:
- **Advanced Connectivity and Load Balancing** (:ref:`cilium-feature-adv-connect-and-lb`)
This category includes features related to advanced networking and load
balancing capabilities, such as Bandwidth Manager, BGP, Envoy Proxy, and
Cluster Mesh.
- **Control Plane** (:ref:`cilium-feature-controlplane`)
These metrics track control plane configurations, including identity
allocation modes and IP address management (IPAM).
- **Datapath** (:ref:`cilium-feature-datapath`)
Metrics in this group monitor datapath configurations, such as Internet
protocol modes, chaining modes, and network modes.
- **Network Policies** (:ref:`cilium-feature-network-policies`)
This group encompasses metrics related to policy enforcement, including
Cilium Network Policies, Host Firewall, DNS policies, and Mutual Auth.
For example, to check if the Bandwidth Manager is enabled on a Cilium agent,
observe the metric ``cilium_feature_adv_connect_and_lb_bandwidth_manager_enabled``.
All metrics follow the format ``cilium_feature`` + group name + metric name.
A value of ``0`` indicates that the feature is disabled, while ``1`` indicates it
is enabled.
.. note::
For metrics of type "counter", the agent has processed the associated object
(e.g., a network policy) but might not be actively enforcing it. These
metrics serve to observe if the object has been received and processed, but
not necessarily enforced by the agent.
.. include:: feature-metrics-agent.txt
Exported Metrics
^^^^^^^^^^^^^^^^
Endpoint
~~~~~~~~
============================================ ================================================== ========== ========================================================
Name Labels Default Description
============================================ ================================================== ========== ========================================================
``endpoint`` Enabled Number of endpoints managed by this agent
``endpoint_restoration_endpoints`` ``phase``, ``outcome`` Enabled Number of restored endpoints labeled by phase and outcome
``endpoint_restoration_duration_seconds`` ``phase`` Enabled Duration of restoration phases in seconds
``endpoint_regenerations_total`` ``outcome`` Enabled Count of all endpoint regenerations that have completed
``endpoint_regeneration_time_stats_seconds`` ``scope`` Enabled Endpoint regeneration time stats
``endpoint_state`` ``state`` Enabled Count of all endpoints
============================================ ================================================== ========== ========================================================
Services
~~~~~~~~
========================================== ================================================== ========== ========================================================
Name Labels Default Description
========================================== ================================================== ========== ========================================================
``services_events_total`` Enabled Number of services events labeled by action type
``service_implementation_delay`` ``action`` Enabled Duration in seconds to propagate the data plane programming of a service, its network and endpoints from the time the service or the service pod was changed excluding the event queue latency
========================================== ================================================== ========== ========================================================
Cluster health
~~~~~~~~~~~~~~
========================================== ================================================== ========== ========================================================
Name Labels Default Description
========================================== ================================================== ========== ========================================================
``unreachable_nodes`` Enabled Number of nodes that cannot be reached
``unreachable_health_endpoints`` Enabled Number of health endpoints that cannot be reached
========================================== ================================================== ========== ========================================================
Node Connectivity
============================================= ======================================== ========== ==========================================================================================================================================
Name Labels Default Description
============================================= ======================================== ========== ==========================================================================================================================================
node_health_connectivity_status type, status Enabled Number of endpoints with last observed status of both ICMP and HTTP connectivity between the current Cilium agent and other Cilium nodes
node_health_connectivity_latency_seconds type, address_type, protocol Enabled Histogram of the last observed latency between the current Cilium agent and other Cilium nodes in seconds
============================================= ======================================== ========== ==========================================================================================================================================
Clustermesh
================================================ ================== ========== =================================================================
Name Labels Default Description
================================================ ================== ========== =================================================================
``clustermesh_remote_cluster_services`` ``target_cluster`` Enabled The total number of services per remote cluster
``clustermesh_remote_cluster_endpoints`` ``target_cluster`` Enabled The total number of endpoints per remote cluster
``clustermesh_remote_cluster_nodes`` ``target_cluster`` Enabled The total number of nodes per remote cluster
``clustermesh_remote_clusters`` Enabled The total number of remote clusters meshed with the local cluster
``clustermesh_remote_cluster_failures`` ``target_cluster`` Enabled The total number of failures related to the remote cluster
``clustermesh_remote_cluster_last_failure_ts`` ``target_cluster`` Enabled The timestamp of the last failure of the remote cluster
``clustermesh_remote_cluster_readiness_status`` ``target_cluster`` Enabled The readiness status of the remote cluster
``clustermesh_remote_cluster_cache_revocations`` ``target_cluster`` Enabled The total number of cache revocations related to the remote cluster
================================================ ================== ========== =================================================================
Datapath
~~~~~~~~
============================================= ================================================== ========== ========================================================
Name Labels Default Description
============================================= ================================================== ========== ========================================================
``datapath_conntrack_dump_resets_total`` ``area``, ``name``, ``family`` Enabled Number of conntrack dump resets. Happens when a BPF entry gets removed while dumping the map is in progress.
``datapath_conntrack_gc_runs_total`` ``status`` Enabled Number of times that the conntrack garbage collector process was run
``datapath_conntrack_gc_key_fallbacks_total`` Enabled The number of alive and deleted conntrack entries at the end of a garbage collector run labeled by datapath family
``datapath_conntrack_gc_entries`` ``family`` Enabled The number of alive and deleted conntrack entries at the end of a garbage collector run
``datapath_conntrack_gc_duration_seconds`` ``status`` Enabled Duration in seconds of the garbage collector process
============================================= ================================================== ========== ========================================================
IPsec
~~~~~
============================================= ================================================== ========== ===========================================================
Name Labels Default Description
============================================= ================================================== ========== ===========================================================
``ipsec_xfrm_error`` ``error``, ``type`` Enabled Total number of xfrm errors
``ipsec_keys`` Enabled Number of keys in use
``ipsec_xfrm_states`` ``direction`` Enabled Number of XFRM states
``ipsec_xfrm_policies`` ``direction`` Enabled Number of XFRM policies
============================================= ================================================== ========== ===========================================================
eBPF
~~~~
========================================== ===================================================================== ========== ========================================================
Name Labels Default Description
========================================== ===================================================================== ========== ========================================================
``bpf_syscall_duration_seconds`` ``operation``, ``outcome`` Disabled Duration of eBPF system call performed
``bpf_map_ops_total`` ``map_name``, ``operation``, ``outcome`` Enabled Number of eBPF map operations performed.
``bpf_map_pressure`` ``map_name`` Enabled Map pressure is defined as a ratio of the required map size compared to its configured size. Values < 1.0 indicate the map's utilization, while values >= 1.0 indicate that the map is full. Policy map pressure metrics are emitted only when map utilization exceeds the threshold set by ``policyMapPressureMetricsThreshold`` helm value, which defaults to 0.1 (10% full).
``bpf_map_capacity`` ``map_group`` Enabled Maximum size of eBPF maps by group of maps (type of map that have the same max capacity size). Map types with size of 65536 are not emitted, missing map types can be assumed to be 65536.
``bpf_maps_virtual_memory_max_bytes`` Enabled Max memory used by eBPF maps installed in the system
``bpf_progs_virtual_memory_max_bytes`` Enabled Max memory used by eBPF programs installed in the system
``bpf_ratelimit_dropped_total`` ``usage`` Enabled Total drops resulting from BPF ratelimiter, tagged by source of drop
========================================== ===================================================================== ========== ========================================================
Both ``bpf_maps_virtual_memory_max_bytes`` and ``bpf_progs_virtual_memory_max_bytes``
are currently reporting the system-wide memory usage of eBPF that is directly
and not directly managed by Cilium. This might change in the future and only
report the eBPF memory usage directly managed by Cilium.
Drops/Forwards (L3/L4)
========================================== ================================================== ========== ========================================================
Name Labels Default Description
========================================== ================================================== ========== ========================================================
drop_count_total reason, direction Enabled Total dropped packets
drop_bytes_total reason, direction Enabled Total dropped bytes
forward_count_total direction Enabled Total forwarded packets
forward_bytes_total direction Enabled Total forwarded bytes
mtu_error_message_total direction Enabled Total number of icmp fragmentation-needed or ICMPv6 packet-too-big messages processed
fragmented_count_total direction Enabled Total number of fragmented packets processed
========================================== ================================================== ========== ========================================================
Policy
========================================== ================================================== ========== ========================================================
Name Labels Default Description
========================================== ================================================== ========== ========================================================
``policy`` Enabled Number of policies currently loaded
``policy_max_revision`` Enabled Highest policy revision number in the agent
``policy_change_total`` Enabled Number of policy changes by outcome
``policy_endpoint_enforcement_status`` Enabled Number of endpoints labeled by policy enforcement status
``policy_implementation_delay`` ``source`` Enabled Time in seconds between a policy change and it being fully deployed into the datapath, labeled by the policy's source
``policy_selector_match_count_max`` ``class`` Enabled The maximum number of identities selected by a network policy selector
``policy_incremental_update_duration`` ``scope`` Enabled The time taken for newly learned identities to be added to the policy system, including BPF policy maps and L7 proxies.
========================================== ================================================== ========== ========================================================
Policy L7 (HTTP/Kafka/FQDN)
======================================== ================================================== ========== ========================================================
Name Labels Default Description
======================================== ================================================== ========== ========================================================
proxy_redirects protocol Enabled Number of redirects installed for endpoints
proxy_upstream_reply_seconds error, protocol_l7, scope Enabled Seconds waited for upstream server to reply to a request
proxy_datapath_update_timeout_total Disabled Number of total datapath update timeouts due to FQDN IP updates
policy_l7_total rule, proxy_type Enabled Number of total L7 requests/responses
======================================== ================================================== ========== ========================================================
Identity
======================================== ================================================== ========== ========================================================
Name Labels Default Description
======================================== ================================================== ========== ========================================================
``identity`` ``type`` Enabled Number of identities currently allocated
``identity_label_sources`` ``source`` Enabled Number of identities which contain at least one label from the given label source
``identity_gc_entries`` ``identity_type`` Enabled Number of alive and deleted identities at the end of a garbage collector run
``identity_gc_runs`` ``outcome``, ``identity_type`` Enabled Number of times identity garbage collector has run
``identity_gc_latency`` ``outcome``, ``identity_type`` Enabled Duration of the last successful identity GC run
``ipcache_errors_total`` ``type``, ``error`` Enabled Number of errors interacting with the ipcache
``ipcache_events_total`` ``type`` Enabled Number of events interacting with the ipcache
``identity_cache_timer_duration`` ``name`` Enabled Seconds required to execute periodic policy processes. ``name="id-alloc-update-policy-maps"`` is the time taken to apply incremental updates to the BPF policy maps.
``identity_cache_timer_trigger_latency`` ``name`` Enabled Seconds spent waiting for a previous process to finish before starting the next round. ``name="id-alloc-update-policy-maps"`` is the time waiting before applying incremental updates to the BPF policy maps.
``identity_cache_timer_trigger_folds`` ``name`` Enabled Number of timer triggers that were coalesced in to one execution. ``name="id-alloc-update-policy-maps"`` applies the incremental updates to the BPF policy maps.
======================================== ================================================== ========== ========================================================
Events external to Cilium
======================================== ================================================== ========== ========================================================
Name Labels Default Description
======================================== ================================================== ========== ========================================================
event_ts source Enabled Last timestamp when Cilium received an event from a control plane source, per resource and per action
k8s_event_lag_seconds source Disabled Lag for Kubernetes events - computed value between receiving a CNI ADD event from kubelet and a Pod event received from kube-api-server
======================================== ================================================== ========== ========================================================
Controllers
======================================== ================================================== ========== ========================================================
Name Labels Default Description
======================================== ================================================== ========== ========================================================
``controllers_runs_total`` ``status`` Enabled Number of times that a controller process was run
``controllers_runs_duration_seconds`` ``status`` Enabled Duration in seconds of the controller process
``controllers_group_runs_total`` ``status``, ``group_name`` Enabled Number of times that a controller process was run, labeled by controller group name
``controllers_failing`` Enabled Number of failing controllers
======================================== ================================================== ========== ========================================================
The ``controllers_group_runs_total`` metric reports the success and failure
count of each controller within the system, labeled by controller group name
and completion status. Due to the large number of controllers, enabling this
metric is on a per-controller basis. This is configured using an allow-list
which is passed as the ``controller-group-metrics`` configuration flag,
or the ``prometheus.controllerGroupMetrics`` helm value. The current
recommended default set of group names can be found in the values file of
the Cilium Helm chart. The special names "all" and "none" are supported.
SubProcess
~~~~~~~~~~
======================================== ================================================== ========== ========================================================
Name Labels Default Description
======================================== ================================================== ========== ========================================================
``subprocess_start_total`` ``subsystem`` Enabled Number of times that Cilium has started a subprocess
======================================== ================================================== ========== ========================================================
Kubernetes
~~~~~~~~~~
=========================================== ================================================== ========== ========================================================
Name Labels Default Description
=========================================== ================================================== ========== ========================================================
``kubernetes_events_received_total`` ``scope``, ``action``, ``validity``, ``equal`` Enabled Number of Kubernetes events received
``kubernetes_events_total`` ``scope``, ``action``, ``outcome`` Enabled Number of Kubernetes events processed
``k8s_cnp_status_completion_seconds`` ``attempts``, ``outcome`` Enabled Duration in seconds in how long it took to complete a CNP status update
``k8s_terminating_endpoints_events_total`` Enabled Number of terminating endpoint events received from Kubernetes
=========================================== ================================================== ========== ========================================================
Kubernetes Rest Client
============================================= ============================================= ========== ===========================================================
Name Labels Default Description
============================================= ============================================= ========== ===========================================================
k8s_client_api_latency_time_seconds path, method Enabled Duration of processed API calls labeled by path and method
k8s_client_rate_limiter_duration_seconds Enabled Kubernetes client rate limiter latency in seconds.
k8s_client_api_calls_total host, method, return_code Enabled Number of API calls made to kube-apiserver labeled by host, method and return code
============================================= ============================================= ========== ===========================================================
Kubernetes workqueue
==================================================== ============================================= ========== ===========================================================
Name Labels Default Description
==================================================== ============================================= ========== ===========================================================
``k8s_workqueue_depth`` ``name`` Enabled Current depth of workqueue
``k8s_workqueue_adds_total`` ``name`` Enabled Total number of adds handled by workqueue
``k8s_workqueue_queue_duration_seconds`` ``name`` Enabled Duration in seconds an item stays in workqueue prior to request
``k8s_workqueue_work_duration_seconds`` ``name`` Enabled Duration in seconds to process an item from workqueue
``k8s_workqueue_unfinished_work_seconds`` ``name`` Enabled Duration in seconds of work in progress that hasn't been observed by work_duration. Large values indicate stuck threads. You can deduce the number of stuck threads by observing the rate at which this value increases.
``k8s_workqueue_longest_running_processor_seconds`` ``name`` Enabled Duration in seconds of the longest running processor for workqueue
``k8s_workqueue_retries_total`` ``name`` Enabled Total number of retries handled by workqueue
==================================================== ============================================= ========== ===========================================================
IPAM
~~~~
======================================== ============================================ ========== ========================================================
Name Labels Default Description
======================================== ============================================ ========== ========================================================
``ipam_capacity`` ``family`` Enabled Total number of IPs in the IPAM pool labeled by family
``ipam_events_total`` Enabled Number of IPAM events received labeled by action and datapath family type
``ip_addresses`` ``family`` Enabled Number of allocated IP addresses
======================================== ============================================ ========== ========================================================
KVstore
~~~~~~~
======================================== ============================================ ========== ========================================================
Name Labels Default Description
======================================== ============================================ ========== ========================================================
``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Enabled Duration of kvstore operation
``kvstore_events_queue_seconds`` ``action``, ``scope`` Enabled Seconds waited before a received event was queued
``kvstore_quorum_errors_total`` ``error`` Enabled Number of quorum errors
``kvstore_sync_errors_total`` ``scope``, ``source_cluster`` Enabled Number of times synchronization to the kvstore failed
``kvstore_sync_queue_size`` ``scope``, ``source_cluster`` Enabled Number of elements queued for synchronization in the kvstore
``kvstore_initial_sync_completed`` ``scope``, ``source_cluster``, ``action`` Enabled Whether the initial synchronization from/to the kvstore has completed
======================================== ============================================ ========== ========================================================
Agent
~~~~~
================================ ================================ ========== ========================================================
Name Labels Default Description
================================ ================================ ========== ========================================================
``agent_bootstrap_seconds`` ``scope``, ``outcome`` Enabled Deprecated, will be removed in Cilium 1.20 - use ``cilium_hive_jobs_oneshot_last_run_duration_seconds`` of respective job instead. Duration of various bootstrap phases
``api_process_time_seconds`` Enabled Processing time of all the API calls made to the cilium-agent, labeled by API method, API path and returned HTTP code.
================================ ================================ ========== ========================================================
FQDN
~~~~
================================== ================================ ============ ========================================================
Name Labels Default Description
================================== ================================ ============ ========================================================
``fqdn_gc_deletions_total`` Enabled Number of FQDNs that have been cleaned on FQDN garbage collector job
``fqdn_active_names`` ``endpoint`` Disabled Number of domains inside the DNS cache that have not expired (by TTL), per endpoint
``fqdn_active_ips`` ``endpoint`` Disabled Number of IPs inside the DNS cache associated with a domain that has not expired (by TTL), per endpoint
``fqdn_alive_zombie_connections`` ``endpoint`` Disabled Number of IPs associated with domains that have expired (by TTL) yet still associated with an active connection (aka zombie), per endpoint
``fqdn_selectors`` Enabled Number of registered ToFQDN selectors
================================== ================================ ============ ========================================================
Jobs
~~~~
=================================================== ================================ ============ ========================================================
Name Labels Default Description
=================================================== ================================ ============ ========================================================
``hive_jobs_runs_total`` ``module``, ``job_name`` Enabled Total number of jobs runs
``hive_jobs_runs_failed`` ``module``, ``job_name`` Enabled Number of jobs runs that returned an error
``hive_jobs_oneshot_last_run_duration_seconds`` ``module``, ``job_name`` Enabled Duration of last one shot job run
``hive_jobs_observer_last_run_duration_seconds`` ``module``, ``job_name`` Enabled Duration of last observer job run
``hive_jobs_observer_run_duration_seconds`` ``module``, ``job_name`` Enabled Histogram of observer job run duration
``hive_jobs_timer_last_run_duration_seconds`` ``module``, ``job_name`` Enabled Duration of last timer job run
``hive_jobs_timer_run_duration_seconds`` ``module``, ``job_name`` Enabled Histogram of timer job run duration
=================================================== ================================ ============ ========================================================
CIDRGroups
~~~~~~~~~~
=================================================== ===================== =============================
Name Labels Default Description
=================================================== ===================== =============================
``cidrgroups_referenced`` Enabled Number of CNPs and CCNPs referencing at least one CiliumCIDRGroup. CNPs with empty or non-existing CIDRGroupRefs are not considered
``cidrgroup_translation_time_stats_seconds`` Disabled CIDRGroup translation time stats
=================================================== ===================== =============================
.. _metrics_api_rate_limiting:
API Rate Limiting
~~~~~~~~~~~~~~~~~
============================================== ========================================== ========== ========================================================
Name Labels Default Description
============================================== ========================================== ========== ========================================================
``api_limiter_adjustment_factor`` ``api_call`` Enabled Most recent adjustment factor for automatic adjustment
``api_limiter_processed_requests_total`` ``api_call``, ``outcome``, ``return_code`` Enabled Total number of API requests processed
``api_limiter_processing_duration_seconds`` ``api_call``, ``value`` Enabled Mean and estimated processing duration in seconds
``api_limiter_rate_limit`` ``api_call``, ``value`` Enabled Current rate limiting configuration (limit and burst)
``api_limiter_requests_in_flight`` ``api_call`` ``value`` Enabled Current and maximum allowed number of requests in flight
``api_limiter_wait_duration_seconds`` ``api_call``, ``value`` Enabled Mean, min, and max wait duration
``api_limiter_wait_history_duration_seconds`` ``api_call`` Disabled Histogram of wait duration per API call processed
============================================== ========================================== ========== ========================================================
.. _metrics_bgp_control_plane:
BGP Control Plane
~~~~~~~~~~~~~~~~~
================================== =============================================================== ======== ===================================================================
Name Labels Default Description
================================== =============================================================== ======== ===================================================================
``session_state`` ``vrouter``, ``neighbor``, ``neighbor_asn`` Enabled Current state of the BGP session with the peer, Up = 1 or Down = 0
``advertised_routes`` ``vrouter``, ``neighbor``, ``neighbor_asn``, ``afi``, ``safi`` Enabled Number of routes advertised to the peer
``received_routes`` ``vrouter``, ``neighbor``, ``neighbor_asn``, ``afi``, ``safi`` Enabled Number of routes received from the peer
``reconcile_errors_total`` ``vrouter`` Enabled Number of reconciliation runs that returned an error
``reconcile_run_duration_seconds`` ``vrouter`` Enabled Histogram of reconciliation run duration
================================== =============================================================== ======== ===================================================================
All metrics are enabled only when the BGP Control Plane is enabled.
cilium-operator
---------------
Configuration
^^^^^^^^^^^^^
``cilium-operator`` can be configured to serve metrics by running with the
option ``--enable-metrics``. By default, the operator will expose metrics on
port 9963, the port can be changed with the option
``--operator-prometheus-serve-addr``.
Feature Metrics
~~~~~~~~~~~~~~~
Cilium Operator Feature Metrics are exported under the
``cilium_operator_feature`` Prometheus namespace.
The following tables categorize feature metrics into the following groups:
- **Advanced Connectivity and Load Balancing** (:ref:`cilium-operator-feature-adv-connect-and-lb`)
This category includes features related to advanced networking and load
balancing capabilities, such as Gateway API, Ingress Controller, LB IPAM,
Node IPAM and L7 Aware Traffic Management.
For example, to check if the Gateway API is enabled on a Cilium operator,
observe the metric ``cilium_operator_feature_adv_connect_and_lb_gateway_api_enabled``.
All metrics follows the format ``cilium_operator_feature`` + group name + metric name.
A value of ``0`` indicates that the feature is disabled, while ``1`` indicates it
is enabled.
.. note::
For metrics of type "counter," the operator has processed the associated object
(e.g., a network policy) but might not be actively enforcing it. These
metrics serve to observe if the object has been received and processed, but
not necessarily enforced by the operator.
.. include:: feature-metrics-operator.txt
Exported Metrics
^^^^^^^^^^^^^^^^
All metrics are exported under the ``cilium_operator_`` Prometheus namespace.
.. _metrics_bgp_control_plane_operator:
BGP Control Plane Operator
================================== ===================================== ======== ======================================================================
Name Labels Default Description
================================== ===================================== ======== ======================================================================
reconcile_errors_total resource_kind, resource_name Enabled Number of errors returned per BGP resource reconciliation
reconcile_run_duration_seconds Enabled Histogram of reconciliation run duration
================================== ===================================== ======== ======================================================================
All metrics are enabled only when the BGP Control Plane is enabled.
.. _ipam_metrics:
IPAM
.. Note::
IPAM metrics are all ``Enabled`` only if using the AWS, Alibabacloud or Azure IPAM plugins.
======================================== ================================================================= ========== ========================================================
Name Labels Default Description
======================================== ================================================================= ========== ========================================================
``ipam_ips`` ``type`` Enabled Number of IPs allocated
``ipam_ip_allocation_ops`` ``subnet_id`` Enabled Number of IP allocation operations.
``ipam_ip_release_ops`` ``subnet_id`` Enabled Number of IP release operations.
``ipam_interface_creation_ops`` ``subnet_id`` Enabled Number of interfaces creation operations.
``ipam_release_duration_seconds`` ``type``, ``status``, ``subnet_id`` Enabled Release ip or interface latency in seconds
``ipam_allocation_duration_seconds`` ``type``, ``status``, ``subnet_id`` Enabled Allocation ip or interface latency in seconds
``ipam_available_interfaces`` Enabled Number of interfaces with addresses available
``ipam_nodes`` ``category`` Enabled Number of nodes by category { total | in-deficit | at-capacity }
``ipam_resync_total`` Enabled Number of synchronization operations with external IPAM API
``ipam_api_duration_seconds`` ``operation``, ``response_code`` Enabled Duration of interactions with external IPAM API.
``ipam_api_rate_limit_duration_seconds`` ``operation`` Enabled Duration of rate limiting while accessing external IPAM API
``ipam_available_ips`` ``target_node`` Enabled Number of available IPs on a node (taking into account plugin specific NIC/Address limits).
``ipam_used_ips`` ``target_node`` Enabled Number of currently used IPs on a node.
``ipam_needed_ips`` ``target_node`` Enabled Number of IPs needed to satisfy allocation on a node.
======================================== ================================================================= ========== ========================================================
LB-IPAM
======================================== ================================================================= ========== ========================================================
Name Labels Default Description
======================================== ================================================================= ========== ========================================================
lbipam_conflicting_pools Enabled Number of conflicting pools
lbipam_ips_available pool Enabled Number of available IPs per pool
lbipam_ips_used pool Enabled Number of used IPs per pool
lbipam_services_matching Enabled Number of matching services
lbipam_services_unsatisfied Enabled Number of services which did not get requested IPs
======================================== ================================================================= ========== ========================================================
Controllers
======================================== ================================================== ========== ========================================================
Name Labels Default Description
======================================== ================================================== ========== ========================================================
``controllers_group_runs_total`` ``status``, ``group_name`` Enabled Number of times that a controller process was run, labeled by controller group name
======================================== ================================================== ========== ========================================================
The ``controllers_group_runs_total`` metric reports the success and failure
count of each controller within the system, labeled by controller group name
and completion status. Due to the large number of controllers, enabling this
metric is on a per-controller basis. This is configured using an allow-list
which is passed as the ``controller-group-metrics`` configuration flag,
or the ``prometheus.controllerGroupMetrics`` helm value. The current
recommended default set of group names can be found in the values file of
the Cilium Helm chart. The special names "all" and "none" are supported.
.. _ces_metrics:
CiliumEndpointSlices (CES)
============================================== ================================ ========================================================
Name Labels Description
============================================== ================================ ========================================================
number_of_ceps_per_ces The number of CEPs batched in a CES
number_of_cep_changes_per_ces opcode, failure_type The number of changed CEPs in each CES update
ces_sync_total outcome The number of completed CES syncs by outcome
ces_queueing_delay_seconds CiliumEndpointSlice queueing delay in seconds
============================================== ================================ ========================================================
Note that the CES controller has multiple internal queues for handling CES updates.
Detailed metrics which are emitted by these queues can be found in the
:ref:Internal WorkQueues <internal_workqueues_metrics> section below.
Unmanaged Pods
============================================ ======= ========== ====================================================================
Name Labels Default Description
============================================ ======= ========== ====================================================================
``unmanaged_pods`` Enabled The total number of pods observed to be unmanaged by Cilium operator
============================================ ======= ========== ====================================================================
"Double Write" Identity Allocation Mode
When the ":ref:Double Write <double_write_migration>" identity allocation mode is
enabled, the following metrics are available:
============================================ ======= ========== ============================================================
Name Labels Default Description
============================================ ======= ========== ============================================================
doublewrite_crd_identities Enabled The total number of CRD identities
doublewrite_kvstore_identities Enabled The total number of identities in the KVStore
doublewrite_crd_only_identities Enabled The number of CRD identities not present in the KVStore
doublewrite_kvstore_only_identities Enabled The number of identities in the KVStore not present as a CRD
============================================ ======= ========== ============================================================
.. _identity_management_metrics:
Identity Management Mode
=========================================== =========================== =====================================================================================
Name Labels Description
=========================================== =========================== =====================================================================================
``cid_controller_work_queue_event_count`` ``resource``, ``outcome`` Counts processed events by CID controller work queues
``cid_controller_work_queue_latency`` ``resource``, ``phase`` Duration of CID controller work queues enqueuing and processing latencies in seconds
=========================================== =========================== =====================================================================================
.. _internal_workqueues_metrics:
Internal WorkQueues
~~~~~~~~~~~~~~~~~~~~
The Operator uses internal queues to manage the processing of various tasks.
Currently, only the Cilium Node Synchronizer queues and Cilium EndpointSlice Controller queues are reporting the metrics listed below.
==================================================== ============================================= ========== ===========================================================
Name Labels Default Description
==================================================== ============================================= ========== ===========================================================
``workqueue_depth`` ``queue_name`` Enabled Current depth of workqueue
``workqueue_adds_total`` ``queue_name`` Enabled Total number of adds handled by workqueue
``workqueue_queue_duration_seconds`` ``queue_name`` Enabled Duration in seconds an item stays in workqueue prior to request
``workqueue_work_duration_seconds`` ``queue_name`` Enabled Duration in seconds to process an item from workqueue
``workqueue_unfinished_work_seconds`` ``queue_name`` Enabled Duration in seconds of work in progress that hasn't been observed by work_duration. Large values indicate stuck threads. You can deduce the number of stuck threads by observing the rate at which this value increases.
``workqueue_longest_running_processor_seconds`` ``queue_name`` Enabled Duration in seconds of the longest running processor for workqueue
``workqueue_retries_total`` ``queue_name`` Enabled Total number of retries handled by workqueue
==================================================== ============================================= ========== ===========================================================
MCS-API
~~~~~~~
========================================= ======================================================================== ========== =========================================================
Name Labels Default Description
========================================= ======================================================================== ========== =========================================================
``mcsapi_serviceexport_info`` ``serviceexport``, ``namespace`` Enabled Information about ServiceExport in the local cluster
``mcsapi_serviceexport_status_condition`` ``serviceexport``, ``namespace``, ``condition``, ``status``, ``reason`` Enabled Status Condition of ServiceExport in the local cluster
``mcsapi_serviceimport_info`` ``serviceimport``, ``namespace`` Enabled Information about ServiceImport in the local cluster
``mcsapi_serviceimport_status_condition`` ``serviceimport``, ``namespace``, ``condition``, ``status``, ``reason`` Enabled Status Condition of ServiceImport in the local cluster
``mcsapi_serviceimport_status_clusters`` ``serviceimport``, ``namespace`` Enabled The number of clusters currently backing a ServiceImport
========================================= ======================================================================== ========== =========================================================
Clustermesh
~~~~~~~~~~~
================================================= ================== ========== ====================================================================
Name Labels Default Description
================================================= ================== ========== ====================================================================
``clustermesh_remote_clusters`` Enabled The total number of remote clusters meshed with the local cluster
``clustermesh_remote_cluster_failures`` ``target_cluster`` Enabled The total number of failures related to the remote cluster
``clustermesh_remote_cluster_last_failure_ts`` ``target_cluster`` Enabled The timestamp of the last failure of the remote cluster
``clustermesh_remote_cluster_readiness_status`` ``target_cluster`` Enabled The readiness status of the remote cluster
``clustermesh_remote_cluster_cache_revocations`` ``target_cluster`` Enabled The total number of cache revocations related to the remote cluster
``clustermesh_remote_cluster_services`` ``target_cluster`` Enabled The total number of services per remote cluster
``clustermesh_remote_cluster_service_exports`` ``target_cluster`` Enabled The total number of MCS-API service exports per remote cluster
================================================= ================== ========== ====================================================================
Hubble
------
Configuration
^^^^^^^^^^^^^
Hubble metrics are served by a Hubble instance running inside ``cilium-agent``.
The command-line options to configure them are ``--enable-hubble``,
``--hubble-metrics-server``, and ``--hubble-metrics``.
``--hubble-metrics-server`` takes an ``IP:Port`` pair, but
passing an empty IP (e.g. ``:9965``) will bind the server to all available
interfaces. ``--hubble-metrics`` takes a space-separated list of metrics.
It's also possible to configure Hubble metrics to listen with TLS and
optionally use mTLS for authentication. For details see :ref:`hubble_configure_metrics_tls`.
Some metrics can take additional semicolon-separated options per metric, e.g.
``--hubble-metrics="dns:query;ignoreAAAA http:destinationContext=workload-name"``
will enable the ``dns`` metric with the ``query`` and ``ignoreAAAA`` options,
and the ``http`` metric with the ``destinationContext=workload-name`` option.
.. _hubble_context_options:
Context Options
^^^^^^^^^^^^^^^
Hubble metrics support configuration via context options.
Supported context options for all metrics:
- ``sourceContext`` - Configures the ``source`` label on metrics for both egress and ingress traffic.
- ``sourceEgressContext`` - Configures the ``source`` label on metrics for egress traffic (takes precedence over ``sourceContext``).
- ``sourceIngressContext`` - Configures the ``source`` label on metrics for ingress traffic (takes precedence over ``sourceContext``).
- ``destinationContext`` - Configures the ``destination`` label on metrics for both egress and ingress traffic.
- ``destinationEgressContext`` - Configures the ``destination`` label on metrics for egress traffic (takes precedence over ``destinationContext``).
- ``destinationIngressContext`` - Configures the ``destination`` label on metrics for ingress traffic (takes precedence over ``destinationContext``).
- ``labelsContext`` - Configures a list of labels to be enabled on metrics.
There are also some context options that are specific to certain metrics.
See the documentation for the individual metrics to see what options are available for each.
See below for details on each of the different context options.
Most Hubble metrics can be configured to add the source and/or destination
context as a label using the ``sourceContext`` and ``destinationContext``
options. The possible values are:
===================== ===================================================================================
Option Value Description
===================== ===================================================================================
``identity`` All Cilium security identity labels
``namespace`` Kubernetes namespace name
``pod`` Kubernetes pod name and namespace name in the form of ``namespace/pod``.
``pod-name`` Kubernetes pod name.
``dns`` All known DNS names of the source or destination (comma-separated)
``ip`` The IPv4 or IPv6 address
``reserved-identity`` Reserved identity label.
``workload`` Kubernetes pod's workload name and namespace in the form of ``namespace/workload-name``.
``workload-name`` Kubernetes pod's workload name (workloads are: Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift), etc).
``app`` Kubernetes pod's app name, derived from pod labels (``app.kubernetes.io/name``, ``k8s-app``, or ``app``).
===================== ===================================================================================
When specifying the source and/or destination context, multiple contexts can be
specified by separating them via the ``|`` symbol.
When multiple are specified, then the first non-empty value is added to the
metric as a label. For example, a metric configuration of
``flow:destinationContext=dns|ip`` will first try to use the DNS name of the
target for the label. If no DNS name is known for the target, it will fall back
and use the IP address of the target instead.
.. note::
There are 3 cases in which the identity label list contains multiple reserved labels:
1. ``reserved:kube-apiserver`` and ``reserved:host``
2. ``reserved:kube-apiserver`` and ``reserved:remote-node``
3. ``reserved:kube-apiserver`` and ``reserved:world``
In all of these 3 cases, ``reserved-identity`` context returns ``reserved:kube-apiserver``.
Hubble metrics can also be configured with a ``labelsContext`` which allows providing a list of labels
that should be added to the metric. Unlike ``sourceContext`` and ``destinationContext``, instead
of different values being put into the same metric label, the ``labelsContext`` puts them into different label values.
============================== ===============================================================================
Option Value Description
============================== ===============================================================================
``source_ip`` The source IP of the flow.
``source_namespace`` The namespace of the pod if the flow source is from a Kubernetes pod.
``source_pod`` The pod name if the flow source is from a Kubernetes pod.
``source_workload`` The name of the source pod's workload (Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift)).
``source_workload_kind`` The kind of the source pod's workload, for example, Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift).
``source_app`` The app name of the source pod, derived from pod labels (``app.kubernetes.io/name``, ``k8s-app``, or ``app``).
``destination_ip`` The destination IP of the flow.
``destination_namespace`` The namespace of the pod if the flow destination is from a Kubernetes pod.
``destination_pod`` The pod name if the flow destination is from a Kubernetes pod.
``destination_workload`` The name of the destination pod's workload (Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift)).
``destination_workload_kind`` The kind of the destination pod's workload, for example, Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift).
``destination_app`` The app name of the source pod, derived from pod labels (``app.kubernetes.io/name``, ``k8s-app``, or ``app``).
``traffic_direction`` Identifies the traffic direction of the flow. Possible values are ``ingress``, ``egress`` and ``unknown``.
============================== ===============================================================================
When specifying the flow context, multiple values can be specified by separating them via the ``,`` symbol.
All labels listed are included in the metric, even if empty. For example, a metric configuration of
``http:labelsContext=source_namespace,source_pod`` will add the ``source_namespace`` and ``source_pod``
labels to all Hubble HTTP metrics.
.. note::
To limit metrics cardinality hubble will remove data series bound to specific pod after one minute from pod deletion.
Metric is considered to be bound to a specific pod when at least one of the following conditions is met:
* ``sourceContext`` is set to ``pod`` and metric series has ``source`` label matching ``<pod_namespace>/<pod_name>``
* ``destinationContext`` is set to ``pod`` and metric series has ``destination`` label matching ``<pod_namespace>/<pod_name>``
* ``labelsContext`` contains both ``source_namespace`` and ``source_pod`` and metric series labels match namespace and name of deleted pod
* ``labelsContext`` contains both ``destination_namespace`` and ``destination_pod`` and metric series labels match namespace and name of deleted pod
.. _hubble_exported_metrics:
Exported Metrics
^^^^^^^^^^^^^^^^
Hubble metrics are exported under the ``hubble_`` Prometheus namespace.
lost events
~~~~~~~~~~~
This metric, unlike other ones, is not directly tied to network flows. It's enabled if any of the other metrics is enabled.
================================ ======================================== ========== ==================================================
Name Labels Default Description
================================ ======================================== ========== ==================================================
``lost_events_total`` ``source`` Enabled Number of lost events
================================ ======================================== ========== ==================================================
Labels
""""""
- ``source`` identifies the source of lost events, one of:
- ``perf_event_ring_buffer``
- ``observer_events_queue``
- ``hubble_ring_buffer``
``dns``
~~~~~~~
================================ ======================================== ========== ===================================
Name Labels Default Description
================================ ======================================== ========== ===================================
``dns_queries_total`` ``rcode``, ``qtypes``, ``ips_returned`` Disabled Number of DNS queries observed
``dns_responses_total`` ``rcode``, ``qtypes``, ``ips_returned`` Disabled Number of DNS responses observed
``dns_response_types_total`` ``type``, ``qtypes`` Disabled Number of DNS response types
================================ ======================================== ========== ===================================
Options
"""""""
============== ============= ====================================================================================
Option Key Option Value Description
============== ============= ====================================================================================
``query`` N/A Include the query as label "query"
``ignoreAAAA`` N/A Ignore any AAAA requests/responses
============== ============= ====================================================================================
This metric supports :ref:`Context Options<hubble_context_options>`.
``drop``
~~~~~~~~
================================ ======================================== ========== ===================================
Name Labels Default Description
================================ ======================================== ========== ===================================
``drop_total`` ``reason``, ``protocol`` Disabled Number of drops
================================ ======================================== ========== ===================================
Options
"""""""
This metric supports :ref:`Context Options<hubble_context_options>`.
``flow``
~~~~~~~~
================================ ======================================== ========== ===================================
Name Labels Default Description
================================ ======================================== ========== ===================================
``flows_processed_total`` ``type``, ``subtype``, ``verdict`` Disabled Total number of flows processed
================================ ======================================== ========== ===================================
Options
"""""""
This metric supports :ref:`Context Options<hubble_context_options>`.
``flows-to-world``
~~~~~~~~~~~~~~~~~~
This metric counts all non-reply flows containing the ``reserved:world`` label in their
destination identity. By default, dropped flows are counted if and only if the drop reason
is ``Policy denied``. Set ``any-drop`` option to count all dropped flows.
================================ ======================================== ========== ============================================
Name Labels Default Description
================================ ======================================== ========== ============================================
``flows_to_world_total`` ``protocol``, ``verdict`` Disabled Total number of flows to ``reserved:world``.
================================ ======================================== ========== ============================================
Options
"""""""
============== ============= ======================================================
Option Key Option Value Description
============== ============= ======================================================
``any-drop`` N/A Count any dropped flows regardless of the drop reason.
``port`` N/A Include the destination port as label ``port``.
``syn-only`` N/A Only count non-reply SYNs for TCP flows.
============== ============= ======================================================
This metric supports :ref:`Context Options<hubble_context_options>`.
``http``
~~~~~~~~
Deprecated, use ``httpV2`` instead.
These metrics can not be enabled at the same time as ``httpV2``.
================================= ======================================= ========== ==============================================
Name Labels Default Description
================================= ======================================= ========== ==============================================
``http_requests_total`` ``method``, ``protocol``, ``reporter`` Disabled Count of HTTP requests
``http_responses_total`` ``method``, ``status``, ``reporter`` Disabled Count of HTTP responses
``http_request_duration_seconds`` ``method``, ``reporter`` Disabled Histogram of HTTP request duration in seconds
================================= ======================================= ========== ==============================================
Labels
""""""
- ``method`` is the HTTP method of the request/response.
- ``protocol`` is the HTTP protocol of the request, (For example: ``HTTP/1.1``, ``HTTP/2``).
- ``status`` is the HTTP status code of the response.
- ``reporter`` identifies the origin of the request/response. It is set to ``client`` if it originated from the client, ``server`` if it originated from the server, or ``unknown`` if its origin is unknown.
Options
"""""""
This metric supports :ref:`Context Options<hubble_context_options>`.
``httpV2``
~~~~~~~~~~
``httpV2`` is an updated version of the existing ``http`` metrics.
These metrics can not be enabled at the same time as ``http``.
The main difference is that ``http_requests_total`` and
``http_responses_total`` have been consolidated, and use the response flow
data.
Additionally, the ``http_request_duration_seconds`` metric source/destination
related labels now are from the perspective of the request. In the ``http``
metrics, the source/destination were swapped, because the metric uses the
response flow data, where the source/destination are swapped, but in ``httpV2``
we correctly account for this.
================================= =================================================== ========== ==============================================
Name Labels Default Description
================================= =================================================== ========== ==============================================
``http_requests_total`` ``method``, ``protocol``, ``status``, ``reporter`` Disabled Count of HTTP requests
``http_request_duration_seconds`` ``method``, ``reporter`` Disabled Histogram of HTTP request duration in seconds
================================= =================================================== ========== ==============================================
Labels
""""""
- ``method`` is the HTTP method of the request/response.
- ``protocol`` is the HTTP protocol of the request, (For example: ``HTTP/1.1``, ``HTTP/2``).
- ``status`` is the HTTP status code of the response.
- ``reporter`` identifies the origin of the request/response. It is set to ``client`` if it originated from the client, ``server`` if it originated from the server, or ``unknown`` if its origin is unknown.
Options
"""""""
============== ============== =============================================================================================================
Option Key Option Value Description
============== ============== =============================================================================================================
``exemplars`` ``true`` Include extracted trace IDs in HTTP metrics. Requires :ref:`OpenMetrics to be enabled<hubble_open_metrics>`.
============== ============== =============================================================================================================
This metric supports :ref:`Context Options<hubble_context_options>`.
``icmp``
~~~~~~~~
================================ ======================================== ========== ===================================
Name Labels Default Description
================================ ======================================== ========== ===================================
``icmp_total`` ``family``, ``type`` Disabled Number of ICMP messages
================================ ======================================== ========== ===================================
Options
"""""""
This metric supports :ref:`Context Options<hubble_context_options>`.
``kafka``
~~~~~~~~~
=================================== ===================================================== ========== ==============================================
Name Labels Default Description
=================================== ===================================================== ========== ==============================================
``kafka_requests_total`` ``topic``, ``api_key``, ``error_code``, ``reporter`` Disabled Count of Kafka requests by topic
``kafka_request_duration_seconds`` ``topic``, ``api_key``, ``reporter`` Disabled Histogram of Kafka request duration by topic
=================================== ===================================================== ========== ==============================================
Options
"""""""
This metric supports :ref:`Context Options<hubble_context_options>`.
``port-distribution``
~~~~~~~~~~~~~~~~~~~~~
================================ ======================================== ========== ==================================================
Name Labels Default Description
================================ ======================================== ========== ==================================================
``port_distribution_total`` ``protocol``, ``port`` Disabled Numbers of packets distributed by destination port
================================ ======================================== ========== ==================================================
Options
"""""""
This metric supports :ref:`Context Options<hubble_context_options>`.
``tcp``
~~~~~~~
================================ ======================================== ========== ==================================================
Name Labels Default Description
================================ ======================================== ========== ==================================================
``tcp_flags_total`` ``flag``, ``family`` Disabled TCP flag occurrences
================================ ======================================== ========== ==================================================
Options
"""""""
This metric supports :ref:`Context Options<hubble_context_options>`.
dynamic_exporter_exporters_total
This is dynamic hubble exporter metric.
==================================== ======================================== ========== ==================================================
Name Labels Default Description
==================================== ======================================== ========== ==================================================
dynamic_exporter_exporters_total source Enabled Number of configured hubble exporters
==================================== ======================================== ========== ==================================================
Labels """"""
status identifies status of exporters, can be one of:
activeinactivedynamic_exporter_up
This is dynamic hubble exporter metric.
==================================== ======================================== ========== ==================================================
Name Labels Default Description
==================================== ======================================== ========== ==================================================
``dynamic_exporter_up`` ``source`` Enabled Status of exporter (1 - active, 0 - inactive)
==================================== ======================================== ========== ==================================================
Labels
""""""
- ``name`` identifies exporter name
dynamic_exporter_reconfigurations_total
This is dynamic hubble exporter metric.
=========================================== ======================================== ========== ==================================================
Name Labels Default Description
=========================================== ======================================== ========== ==================================================
dynamic_exporter_reconfigurations_total op Enabled Number of dynamic exporters reconfigurations
=========================================== ======================================== ========== ==================================================
Labels """"""
op identifies reconfiguration operation type, can be one of:
addupdateremovedynamic_exporter_config_hash
This is dynamic hubble exporter metric.
==================================== ======================================== ========== ==================================================
Name Labels Default Description
==================================== ======================================== ========== ==================================================
``dynamic_exporter_config_hash`` Enabled Hash of last applied config
==================================== ======================================== ========== ==================================================
dynamic_exporter_config_last_applied
This is dynamic hubble exporter metric.
======================================== ======================================== ========== ==================================================
Name Labels Default Description
======================================== ======================================== ========== ==================================================
dynamic_exporter_config_last_applied Enabled Timestamp of last applied config
======================================== ======================================== ========== ==================================================
.. _clustermesh_apiserver_metrics_reference:
Configuration ^^^^^^^^^^^^^
To expose any metrics, invoke clustermesh-apiserver with the
--prometheus-serve-addr option. This option takes a IP:Port pair but
passing an empty IP (e.g. :9962) will bind the server to all available
interfaces (there is usually only one in a container).
Exported Metrics ^^^^^^^^^^^^^^^^
All metrics are exported under the cilium_clustermesh_apiserver_
Prometheus namespace.
Bootstrap
======================================== ========================================================
Name Description
======================================== ========================================================
``bootstrap_seconds`` Duration in seconds to complete bootstrap
======================================== ========================================================
KVstore
~~~~~~~
======================================== ============================================ ========================================================
Name Labels Description
======================================== ============================================ ========================================================
``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
``kvstore_events_queue_seconds`` ``action``, ``scope`` Seconds waited before a received event was queued
``kvstore_quorum_errors_total`` ``error`` Number of quorum errors
``kvstore_sync_errors_total`` ``scope``, ``source_cluster`` Number of times synchronization to the kvstore failed
``kvstore_sync_queue_size`` ``scope``, ``source_cluster`` Number of elements queued for synchronization in the kvstore
``kvstore_initial_sync_completed`` ``scope``, ``source_cluster``, ``action`` Whether the initial synchronization from/to the kvstore has completed
======================================== ============================================ ========================================================
API Rate Limiting
============================================== ========================================== ========================================================
Name Labels Description
============================================== ========================================== ========================================================
api_limiter_processed_requests_total api_call, outcome, return_code Total number of API requests processed
api_limiter_processing_duration_seconds api_call, value Mean and estimated processing duration in seconds
api_limiter_rate_limit api_call, value Current rate limiting configuration (limit and burst)
api_limiter_requests_in_flight api_call value Current and maximum allowed number of requests in flight
api_limiter_wait_duration_seconds api_call, value Mean, min, and max wait duration
============================================== ========================================== ========================================================
Controllers
======================================== ================================================== ========== ========================================================
Name Labels Default Description
======================================== ================================================== ========== ========================================================
``controllers_group_runs_total`` ``status``, ``group_name`` Enabled Number of times that a controller process was run, labeled by controller group name
======================================== ================================================== ========== ========================================================
The ``controllers_group_runs_total`` metric reports the success
and failure count of each controller within the system, labeled by
controller group name and completion status. Enabling this metric is
on a per-controller basis. This is configured using an allow-list which
is passed as the ``controller-group-metrics`` configuration flag.
The current default set for ``clustermesh-apiserver`` found in the
Cilium Helm chart is the special name "all", which enables the metric
for all controller groups. The special name "none" is also supported.
.. _kvstoremesh_metrics_reference:
kvstoremesh
-----------
Configuration
^^^^^^^^^^^^^
To expose any metrics, invoke ``kvstoremesh`` with the
``--prometheus-serve-addr`` option. This option takes a ``IP:Port`` pair but
passing an empty IP (e.g. ``:9964``) binds the server to all available
interfaces (there is usually only one interface in a container).
Exported Metrics
^^^^^^^^^^^^^^^^
All metrics are exported under the ``cilium_kvstoremesh_`` Prometheus namespace.
Bootstrap
~~~~~~~~~
======================================== ========================================================
Name Description
======================================== ========================================================
``bootstrap_seconds`` Duration in seconds to complete bootstrap
======================================== ========================================================
KVStoremesh
================================= ======== ==========================
Name Labels Description
================================= ======== ==========================
leader_election_master_status name The leader election status
================================= ======== ==========================
Clustermesh
Note that these metrics are not prefixed by ``clustermesh_``.
=============================================== ================== ====================================================================
Name Labels Description
=============================================== ================== ====================================================================
``remote_clusters`` The total number of remote clusters meshed with the local cluster
``remote_cluster_failures`` ``target_cluster`` The total number of failures related to the remote cluster
``remote_cluster_last_failure_ts`` ``target_cluster`` The timestamp of the last failure of the remote cluster
``remote_cluster_readiness_status`` ``target_cluster`` The readiness status of the remote cluster
``remote_cluster_cache_revocations`` ``target_cluster`` The total number of cache revocations related to the remote cluster
=============================================== ================== ====================================================================
KVstore
~~~~~~~
======================================== ============================================ ========================================================
Name Labels Description
======================================== ============================================ ========================================================
``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
``kvstore_events_queue_seconds`` ``action``, ``scope`` Seconds waited before a received event was queued
``kvstore_quorum_errors_total`` ``error`` Number of quorum errors
``kvstore_sync_errors_total`` ``scope``, ``source_cluster`` Number of times synchronization to the kvstore failed
``kvstore_sync_queue_size`` ``scope``, ``source_cluster`` Number of elements queued for synchronization in the kvstore
``kvstore_initial_sync_completed`` ``scope``, ``source_cluster``, ``action`` Whether the initial synchronization from/to the kvstore has completed
======================================== ============================================ ========================================================
API Rate Limiting
============================================== ========================================== ========================================================
Name Labels Description
============================================== ========================================== ========================================================
api_limiter_processed_requests_total api_call, outcome, return_code Total number of API requests processed
api_limiter_processing_duration_seconds api_call, value Mean and estimated processing duration in seconds
api_limiter_rate_limit api_call, value Current rate limiting configuration (limit and burst)
api_limiter_requests_in_flight api_call value Current and maximum allowed number of requests in flight
api_limiter_wait_duration_seconds api_call, value Mean, min, and max wait duration
============================================== ========================================== ========================================================
Controllers
======================================== ================================================== ========== ========================================================
Name Labels Default Description
======================================== ================================================== ========== ========================================================
``controllers_group_runs_total`` ``status``, ``group_name`` Enabled Number of times that a controller process was run, labeled by controller group name
======================================== ================================================== ========== ========================================================
The ``controllers_group_runs_total`` metric reports the success
and failure count of each controller within the system, labeled by
controller group name and completion status. Enabling this metric is
on a per-controller basis. This is configured using an allow-list
which is passed as the ``controller-group-metrics`` configuration
flag. The current default set for ``kvstoremesh`` found in the
Cilium Helm chart is the special name "all", which enables the metric
for all controller groups. The special name "none" is also supported.
NAT
~~~
.. _nat_metrics:
======================================== ================================================== ========== ========================================================
Name Labels Default Description
======================================== ================================================== ========== ========================================================
``nat_endpoint_max_connection`` ``family`` Enabled Saturation of the most saturated distinct NAT mapped connection, in terms of egress-IP and remote endpoint address.
======================================== ================================================== ========== ========================================================
These metrics are for monitoring Cilium's NAT mapping functionality. NAT is used by features such as Egress Gateway and BPF masquerading.
The NAT map holds mappings for masqueraded connections. Connection held in the NAT table that are masqueraded with the
same egress-IP and are going to the same remote endpoints IP and port all require a unique source port for the mapping.
This means that any Node masquerading connections to a distinct external endpoint is limited by the possible ephemeral source ports.
Given a Node forwarding one or more such egress-IP and remote endpoint tuples, the ``nat_endpoint_max_connection`` metric is the most saturated such connection in terms of a percent of possible source ports available.
This metric is especially useful when using the egress gateway feature where it's possible to overload a Node if many connections are all going to the same endpoint.
In general, this metric should normally be fairly low.
A high number here may indicate that a Node is reaching its limit for connections to one or more external endpoints.
Local Redirect Policy (control plane)
.. _local_redirect_policy_metrics:
============================================= ======================================== ========== ==========================================================================================================================================
Name Labels Default Description
============================================= ======================================== ========== ==========================================================================================================================================
controller_duration_seconds Enabled Histogram of processing times for local redirect policies
============================================= ======================================== ========== ==========================================================================================================================================