design/ceph/ceph-managed-disruptionbudgets.md
OSDs do not fit a single PodDisruptionBudget pattern. Ceph's ability to tolerate pod disruptions in one failure domain is dependent on the overall health of the cluster. Even if an upgrade agent were only to drain one node at a time, Ceph would have to wait until there were no undersized PGs before moving on the next node.
The failure domain for the PDB is determined by the lowest failure domain
enforced by any Ceph RADOS pool. For example, consider a cluster with two
pools, bigrbdpool that is associated with a CRUSH rule that enforces rack
failure domain, and .mgr that is associated with a CRUSH rule that
enforces host failure domain. In this case, the PDB failure domain will be host.
Default PDB:
rook-ceph-osdmaxUnavailable=1 on any failure domain.active+clean. In that case, the downed OSDs are excluded from the PDB through label match expressions.Blocking PDBs
rook-ceph-osd-<failureDomainType>-<FailureDomainName>. For example: rook-ceph-osd-zone-zone-amaxUnavailable is set to 0 to prevent any OSD pod from draining.We begin with creating the default PodDisruptionBudget for all the OSDs. Once the user drains a node and an OSD goes down, we determine the failure domain for the draining OSD (using the OSD deployment labels). Then we create blocking PodDisruptionBudgets (maxUnavailable=0) for all other failure domains and delete the main PodDisruptionBudget. This blocks OSDs from going down in multiple failure domains simultaneously.
Once the drained OSDs are back and all the pgs are active+clean, that is, the cluster has healed, the default PodDisruptionBudget is added back and blocking ones are deleted.
Detecting drains is not easy as they are client-side operation. The client cordons a node and continuously attempts to evict all pods from the node until it succeeds. If a node on which the OSD is suppose to run is unscheduleable then the operator considers that node to be draining.
rook-ceph-osd, with maxUnavailable=1, is created for all OSDs.Node a for maintenanceosd.0 on Node a in Zone x:rook-ceph-osd-zone-zone-y and rook-ceph-osd-zone-zone-zzone x are allowed to be drainedNode-a is back, all of its OSDs are running and all PGs are active+clean:
rook-ceph-osd (maxUnavailable=1)rook-ceph-osd-zone-zone-y and rook-ceph-osd-zone-zone-zAn example of an operator that will attempt to do rolling upgrades of nodes is the Machine Config Operator in OpenShift. Based on the SIG cluster lifecycle, Kubernetes deployments based on the cluster-api are a common. This will also work to mitigate manual drains from accidentally disrupting storage.
noout flag to the failure domain of the drained node.noout flag is removed after the OSDMaintenanceTimeout has elapsed. OSDMaintenanceTimeout defaults to 30 minutes but can be configured from the cephCluster CR.noout is not added if an OSD is down but there is no drained node.active+clean despite down OSD, the Rook operator will exclude the down OSDs in the main PDB like so (assume OSDs 1,3,5 are down): maxUnavailable: 1
selector:
matchExpressions:
- key: app
operator: In
values: ["rook-ceph-osd"]
- key: osd
operator: NotIn
values: ["1", "3", "5"]
Since there is no strict failure domain requirement for each of these pods, and they are not logically grouped, a static PDB suffices.
A single PodDisruptionBudget is created and owned by the respective controllers, and updated only according to changes in the CRDs that change the number of pods.
For example: In a deployment with three Monitors we can have PDB with the same labelSelector as the deployment and maxUnavailable=1.
When the Monitor count is increased to 5 (as is prudent in production), we can replace it with a PDB that has maxUnavailable set to 2.