Back to Autoscaler

Metrics

cluster-autoscaler/proposals/metrics.md

latest8.9 KB
Original Source

Cluster Autoscaler Monitoring

Introduction

Currently the options to monitor Cluster Autoscaler are limited to logs, status configmap and events. While useful for debugging, none of this options is particularly practical for monitoring Cluster Autoscaler behaviour over time. This document describes a set of metrics (in Prometheus format) that will be added to 0.6 Cluster Autoscaler to allow better monitoring of its operations.

Current state

Cluster Autoscaler already has a metrics endpoint providing some basic metrics. This includes default process metrics (number of goroutines, gc duration, cpu and memory details, etc) as well as some custom metrics related to time taken by various parts of Cluster Autoscaler main loop. Existing metrics will be renamed and refactored and new metrics will be added.

Metrics

Cluster Autoscaler needs to monitor some aspects of cluster state as part of its normal operations. It can therefore provide metrics describing its own operations as well as some metrics related to general state of the cluster.

All the metrics are prefixed with cluster_autoscaler_.

Cluster state

Metric nameMetric typeLabelsDescription
cluster_safe_to_autoscaleGaugeWhether or not cluster is healthy enough for autoscaling. 1 if it is, 0 otherwise.
nodes_countGaugestate=<node-state>Number of nodes in cluster.
unschedulable_pods_countGaugeNumber of unschedulable ("Pending") pods in the cluster.
node_groups_countGaugenode_group_type=<node-group-type>Number of node groups managed by CA.
max_nodes_countGaugeMaximum number of nodes in all node groups.
cluster_cpu_current_coresGauge
cpu_limits_coresGaugedirection=<minimum or maximum>Minimum and maximum number of cores in the cluster.
cluster_memory_current_bytesGaugeCurrent number of bytes of memory in the cluster, minus deleting nodes.
memory_limits_bytesGaugedirection=<minimum or maximum>Minimum and maximum number of bytes of memory in cluster.
  • cluster_safe_to_autoscale indicates whether cluster is healthy enough for autoscaling. CA stops all operations if significant number of nodes are unready (by default 33% as of CA 0.5.4).
  • nodes_count records the total number of nodes, labeled by node state. Possible states are ready, unready, notStarted.
  • node_groups_count records the number of currently managed node groups. It's useful when using dynamic configuration or Node Autoprovisioning. Types of node group are autoscaled (managed by CA but not created by NAP) and autoprovisioned (created by NAP and managed by CA).

Cluster Autoscaler execution

This metrics are refactored from currently existing metrics and track execution of various parts of Cluster Autoscaler loop.

Metric nameMetric typeLabelsDescription
last_activityGaugeactivity=<autoscaler-activity>Last time certain part of CA logic executed
function_duration_secondsHistogramfunction=<autoscaler-function>Time taken by various parts of CA main loop.
  • last_activity records last time certain part of cluster autoscaler logic executed. Represented with unix timestamp. autoscaler-activity values are:

    • main - main loop iteration started.
    • autoscaling - current state of the cluster has been updated, started autoscaling logic.
    • scaleUp - autoscaler will check if scale up is necessary.
    • scaleDown - autoscaler will try to scale down some nodes.

    Not all of the above have to happen in every loop. For example if CA adds a new node it will completely skip scale down logic in this loop.

  • function_duration_seconds summarizes time taken by different functions executed in main CA goroutine. Uses the following set of values for autoscaler-function:

    • main - duration of the whole iteration of main loop.
    • updateClusterState - time used by CA to get node status from API server and update internal data structures.
    • scaleUp - time used to check if new node are necessary and add them.
    • findUnneeded - time required to find nodes that are candidates for removal.
    • scaleDown - time required to verify unneeded nodes are really unnecessary and remove them.

New labels may be added to both last_activity and function_duration_seconds if we add more features or additional logic to Cluster Autoscaler.

Cluster Autoscaler operations

This metrics describe internal state and actions taken by Cluster Autoscaler.

Metric nameMetric typeLabelsDescription
errors_totalCountertype=<error-type>The number of CA loops failed due to an error.
scaled_up_nodes_totalCounterNumber of nodes added by CA.
scaled_down_nodes_totalCounterreason=<scale-down-reason>Number of nodes removed by CA.
scaled_up_gpu_nodes_totalCountergpu_name=<gpu-name>Number of GPU-enabled nodes added by CA.
scaled_down_gpu_nodes_totalCounterreason=<scale-down-reason>, gpu_name=<gpu-name>Number of GPU-enabled nodes removed by CA.
failed_scale_ups_totalCounterreason=<failure-reason>Number of times scale-up operation has failed.
evicted_pods_totalCounterNumber of pods evicted by CA.
unneeded_nodes_countGaugeNumber of nodes currently considered unneeded by CA.
old_unregistered_nodes_removed_countCounterNumber of unregistered nodes removed by CA.
skipped_scale_events_countCounterdirection=<scaling-direction>, reason=<skipped-scale-reason>Number of times scaling has been skipped due to a resource limit being reached, or similar event.
  • errors_total counter increases every time main CA loop encounters an error.
    • Growing errors_total count signifies an internal error in CA or a problem with underlying infrastructure preventing normal CA operation. Example errors include:
      • failed to get list of nodes or pods from API server,
      • failed to retrieve node group size from cloud provider,
      • failed to update node group size,
      • error in CA simulations,
      • error updating internal data structures.
    • Not every condition causing CA to skip part of main loop is an error in this context. For example having pending pods after reaching maximum cluster size causes ScaleUp function to return error resulting in CA skipping the rest of main loop. This is expected behaviour of CA and it should not be counted as error.
    • Possible error types are:
      • cloudProviderError - failed to get node group info, set node group size, etc.
      • apiCallError - error related to call to k8s api server (ex. get nodes, get pods).
      • internalError - error in any other part of CA logic.
  • scaled_up_nodes_total counts the number of nodes successfully added by CA. In this context we consider node as successfully added after updating node group size (without waiting for actual vm to spin up, run a kubelet, etc).
  • failed_scale_ups_total counts the number of unsuccessful scale-up operations performed by CA. This includes both getting error from cloud provider and new nodes failing to boot up and register within timeout. It does not include reaching maximum cluster size (as CA doesn't attempt scale-up at all in that case).
  • scaled_down_nodes_total counts the number of nodes removed by CA. Possible scale down reasons are empty, underutilized, unready.
  • scaled_up_gpu_nodes_total counts the number of GPU-enabled nodes successfully added by CA, similar to scaled_up_nodes_total. Additionally gpu_name specifies name of the GPU (e.g. nvidia-tesla-k80).
  • scaled_down_gpu_nodes_total counts the number of nodes removed by CA. Scale down reasons are identical to scaled_down_nodes_total, gpu_name to scaled_up_gpu_nodes_total.
  • skipped_scale_events_count counts the number of times that the autoscaler has declined to scale a node group because of a resource limit being reached or similar internal event. Scale direction can be either up or down, and the reason explains why the scaling was skipped (eg CPULimitReached, MemoryLimitReached). This is different than failed scaling events in that the autoscaler is choosing not to perform a scaling action.

Node Autoprovisioning operations

This metrics describe operations and state related to Node Autoprovisioning feature.

Metric nameMetric typeLabelsDescription
nap_enabledGaugeWhether or not Node Autoprovisioning is enabled. 1 if it is, 0 otherwise.
created_node_groups_totalCounterNumber of node groups created by Node Autoprovisioning.
deleted_node_groups_totalCounterNumber of node groups deleted by Node Autoprovisioning.