design/two-node-fencing.md
Design a highly available two-node architecture for the Rook storage cluster.
Target distributed and edge environments where the hardware footprint must be strictly minimized without sacrificing High Availability (HA) or data integrity.
Maintain Ceph Monitor quorum across only two physical nodes by introducing a floating monitor backed by a mon store that is replicated across both nodes.
Really?? You can run a cluster with just two nodes???
While a two-node solution is not as reliable as a three-node cluster, in edge clusters where it is cost-prohibitive to have a third node, the two-node solution can be considered "good enough".
Fencing is managed by a cluster resource manager (e.g., Pacemaker), which isolates an unresponsive node by utilizing its Baseboard Management Console (BMC). Once the unresponsive node is safely fenced (powered off/rebooted), the surviving node can continue operating the cluster without the risk of split-brain or resource corruption. Once the fencing operation is complete, the api server is back online on the remaining node. At this point, the floating mon can be rescheduled to the online node. Thus, Ceph can restore mon quorum after the floating mon is scheduled to the single online node, and IO will come back online.
As a reference, this design is based on the Openshift Two-Node Fencing solution, though it is expected to work independently from Openshift if Pacemaker and other components are configured properly.
Node Fencing Enabled: The cluster must have node-level fencing configured and active.
Synchronous Block Replication: The two nodes must have access to the same filesystem to back the floating monitor's data store. For example, one block device presented to both nodes (e.g. via JBOD or DRBD) with a filesystem on top.
Ceph requires a strict majority of monitors to maintain quorum (e.g. with three mons, two must be online). In a two-node setup, placing two monitors on one node creates a single point of failure. To solve this, we will use a "floating" third monitor.
Pinned Mons: Two mons are deployed normally, with strict node affinity across each of the two nodes. These mons are backed by the CephCluster dataDirHostPath, as in a regular Rook cluster.
Floating Mon: The third monitor is designed to "float" and is permitted to be scheduled on either of the two nodes. The backing store for the floating mon must be mirrored between the two nodes, to allow it to come online on either of the two nodes (though it is not allowed to come online on both nodes simultaneously).
Networking: Rook's host networking cannot be enabled. The floating monitor must use a K8s service clusterIP to retain a constant IP address regardless of which node it is scheduled on.
To ensure the floating mon retains its state when moving between nodes, its data directory is backed by a host path that is synchronously mirrored between the two nodes. The configuration of the mirroring is done independently from the Rook operator. Rook allows the floating mon spec to be customized to accommodate the mirroring implementation.
The current proposal uses DRBD to mirror the float mon data directory. If there is another mirroring solution, it could also be implemented with the proposed design.
A mon deployment is normally generated by Rook in code. The floating mon will avoid this code path and instead allow the mon to be specified by defining a yaml template with the following properties:
The CephCluster CRD is updated with a new floatingMon section:
apiVersion: ceph.rook.io/v1
kind: CephCluster
...
spec:
dataDirHostPath: /var/lib/rook
mon:
count: 3
allowMultiplePerNode: false
floatingMon:
name: c
configmapName: rook-floating-mon-config
The floatingMon settings include:
c)The custom variables required for the DRBD template include:
DRBD_RESOURCE_NAME: The specific DRBD resource name used for the underlying configuration.DRBD_DEVICE_NAME: The explicit DRBD device path (e.g., /dev/drbd0) to be mounted.DRBD_UTILS_IMAGE: The image used for init or sidecar containers for DRBD configurationSeveral built-in variables will be replaced automatically in the template:
NAMESPACE: The K8s namespace where the cluster is runningCLUSTER_IP: The clusterIP of the mon's K8s serviceCEPH_IMAGE: The Ceph container image used to start the mon (already defined in spec.cephVersion.Image)A ConfigMap must be defined with key-value pairs of the variables for the floating mon. When Rook reconciles the floating mon, these variables will be replaced in the mon template. A configmap approach is used so that the variables can be generated for the cluster independent from the CephCluster CR. For example, a script may configure the device to configure DRBD, then generate the configmap directly. Rook will implement an example script.
apiVersion: v1
kind: ConfigMap
metadata:
name: rook-floating-mon-config
namespace: rook-ceph
data:
DRBD_RESOURCE_NAME: "drbd-mon-data"
DRBD_DEVICE_NAME: "/dev/drbd0"
DRBD_UTILS_IMAGE: "<image>"
In a TNF cluster, consider the following recommendations:
Several Rook features are not supported and should be disabled:
spec.disruptionManagement.managePodBudgets: falsespec.healthCheck.daemonHealth.mon.disabled: true