docs/node_resource_handling.md
An aspect of Kubernetes clusters that is often overlooked is the resources non- pod components require to run, such as:
sshd, udev etc.kubelet, container runtime (e.g.
containerd), node problem detector, journald etc.As you manage your cluster, it's important that you are cognisant of these components because if your critical non-pod components don't have enough resources, you might end up with a very unstable cluster.
Each node in a cluster has resources available to it and pods scheduled to run on the node may or may not have resource requests or limits set on them. Kubernetes schedules pods on nodes that have resources that satisfy the pod's specified requirements. Broadly, pods are bin-packed onto the nodes in a best effort attempt to utilize as much of the resources available with as few nodes as possible.
Node Capacity
---------------------------
| kube-reserved |
|-------------------------|
| system-reserved |
|-------------------------|
| eviction-threshold |
|-------------------------|
| |
| allocatable |
| (available for pods) |
| |
| |
---------------------------
Node resources can be categorised into 4 (as shown above):
kube-reserved – reserves resources for kubernetes system daemons.system-reserved – reserves resources for operating system components.eviction-threshold – specifies limits that trigger evictions when node
resources drop below the reserved value.allocatable – the remaining node resources available for scheduling of pods
when kube-reserved, system-reserved and eviction-threshold resources
have been accounted for.For example, with a 30.5 GB, 4 vCPUs machine with only eviction-thresholds set
as --eviction-hard=memory.available<100Mi we'd get the following Capacity
and Allocatable resources:
$ kubectl describe node/ip-xx-xx-xx-xxx.internal
...
Capacity:
cpu: 4
memory: 31402412Ki
...
Allocatable:
cpu: 4
memory: 31300012Ki
...
The scheduler ensures that for each resource type, the sum of the resources scheduled does not surpass the sum of allocatable resources. But suppose you have a couple of applications deployed in your cluster that are constantly using up way more resources set in their resource requests (burst above requests but below limits during workload). You end up with a node with pods that are each attempting to take up more resources than there are available on the node!
This is particularly an issue with non-compressible resources like memory. For
example, in the aforementioned case, with an eviction threshold of only
memory.available<100Mi and no kube-reserved nor system-reserved
reservations set, it is possible for a node to OOM prior to when kubelet is
able to reclaim memory (because it may not observe memory pressure right away,
since it polls cAdvisor to collect memory usage stats at a regular interval).
All the while, keep in mind that without kube-reserved nor system-reserved
reservations set (which is most clusters i.e. GKE, kOps), the
scheduler doesn't account for resources that non-pod components would require to
function properly because Capacity and Allocatable resources are more or
less equal.
It's difficult to give a one size fits all answer to node resource allocation.
The behaviour of your cluster depends on the resource requirements of the apps
running on the cluster, the pod density and the cluster size. But there's a
node performance dashboard that exposes cpu and memory usage profiles
of kubelet and docker engine at multiple levels of pod density which may
serve as a guide for what values would be appropriate for your cluster.
But, it seems fitting to recommend the following:
kube-reserved and system-reserved.Further Reading: