design/ceph/cephx-keyring-rotation.md
Feature request: https://github.com/rook/rook/issues/15904
From an end-user UX view, we can divide keys/keyrings into 2 categories:
Keys for Ceph daemons that are purely used internally to the Ceph cluster.
Rook can rotate these keys automatically and transparently without risk of disrupting any connections to the Ceph cluster beyond Rook's control.
e.g., Admin key(s), mon, mgr, osd, mds, rgw, rbdmirror's internal (non-peer) key
Any key whose rotation may reasonably require user action beyond Rook API controls.
Because these keys affect non-daemon connections, Rook must make sure the user can initiate rotation during a maintenance window. It is imperative that the user can determine when rotation is finished, so they know when to update the non-daemon application with the updated key.
e.g., CephClient keys, RBD/CephFS mirroring peer tokens, CSI keys
Because different types of key rotations may call for different administrator/user workflows, these rotations should be able to be initiated independently. For example, a user may have an automated workflow to update CephClient A's consumer, but CephClient B's consumer may be a manual process. CephClient A and B should allow rotation separately.
CSI keys are a special case that best fits into the non-daemon case. Rotation of CSI keys will affect Ceph-CSI's RBD and CephFS volume mounts associated with application pod PVCs. This rotation may require administrators to reboot or drain/undrain all nodes with Ceph-CSI PVCs.
When CSI keys are updated, users must drain/undrain (or reboot) all nodes to transition PVC mounts from using old key to new key.
Users with many nodes may not be able to update them all within the (2 * auth_service_ticket_ttl)
window. Some users with very large k8s clusters might only drain/reboot a portion of their nodes
during a maintenance window over a period of several days. (Service ticket TTL could be extended,
but extending it to several days would also have Ceph security implications that are best avoided
if possible)
In cases where the rotated key can't be picked up quickly, the alternative is for Rook to create a new ceph client (user) with the new key without deleting the old user/key. Then, Rook updates the CSI secrets to have the new user+key. This would begin the window in which users can drain/reboot nodes. Any new PVC mounts would use the new key, and old mounts would continue operating happily since the old user+key are not yet destroyed.
After the user is done with node updates, the user should have some way of indicating to Rook that the old key is no longer needed, telling Rook it is safe to delete the old key. Autodetection might be possible, but let's leave that for a future exercise.
For clarity, this design could be called "overlapping" rotation, and the design elsewhere in this doc "in-place" rotation.
Overlapping rotation will be implemented for only CSI keys.
Overlapping rotation will not be implemented for the CephClient resource because each CephClient represents a single client, which is a single client ID and its corresponding key. For users or systems that propagate CephClient credentials, overlapping rotation can be accomplished by creating a new CephClient, propagating the new client credentials as needed, then deleting the old CephClient after completion.
In the future, overlapping rotation would be beneficial for the RBD/CephFS mirror peering keys. Currently, the user name for RBD peering is hardcoded, making overlapping rotation impossible.
keyRotationPolicy: Disabled | WithCephVersionUpgrade | KeyGeneration
keyGeneration: <int> # used with KeyGeneration
keepPriorKeyCount: <int> # only present to configure overlapping rotation
keyRotationPolicy (string): select a key rotation policy:
Disabled (default when unset): don't rotate keys after initial creationWithCephVersionUpgrade: rotate keys when the Ceph version updatesKeyGeneration: rotate when the keyGeneration input is greater than the current key generationkeyGeneration (int): when keyRotationPolicy==KeyGeneration, set this to the desired key
generation value. Ignored for other rotation policies. If this is set to greater than the
current key generation, relevant keys will be rotated, and the generation value will be updated
to this new value (generation values are not necessarily incremental, though that is the
intended use case).
If this is set to less than or equal to the current key generation, keys are not rotated.
API for overlapping rotation:
keepPriorKeyCount (int): only available for components that use overlapping key rotation. This
tells Rook how many prior keys to keep active. Generally, this would be set to 1 to allow for
a migration period for applications. If desired, set this to 0 to delete prior keys after
migration.Alternative key rotation policy designs that were rejected:
Rook could use a string like once as an input, but Rook would have to record when once was first
observed so that it doesn't repeat rotations with every reconcile.
Rook could alternatively use a string like always and expect the user to unset always after they
see that rotation is complete, but this requires the user to monitor the rotation status and change
config in a clunky way that might risk manual errors.
Rook could use the Ceph version as the "older-than" selection, but that would only allow CSI key rotations when the Ceph version changes. For a user already at the latest Ceph version, they wouldn't have the option to rotate CSI keys on demand.
Rook could use a DATETIME string to initiate rotation for keys minted/rotated before a certain
time, but it is hard to track rotation time well when many keys are involved. Generation allows for
a more clear representation of actual state and provides a simple interface.
Ceph developers have suggested rotation on every Ceph version update. This corresponds to a rotation each time the administrator updates the CephCluster's Ceph image. This also aligns with when Rook knows Ceph daemons are going to be restarted and when Rook can reasonably assume the administrator has a maintenance window. Rotation at this periodicity will have no additional impact on cluster connectivity or performance, so this periodicity will be used as the suggested periodicity option, and one-time updates will be allowed as well.
For ease of configuration, the option for rotating daemon keys will be present on only the CephCluster CR. Any child CRs (e.g., CephFilesystem) dependent on the CephCluster will inherit the daemon key rotation config from the corresponding CephCluster. This allows administrators to enable key rotation selectively for specific CephClusters while also keeping the UX simple.
spec:
# ...
security:
cephx:
daemon:
keyRotationPolicy: Disabled | WithCephVersionUpgrade | KeyGeneration
keyGeneration: <int> # used with KeyGeneration
csi: {} # discussed more later
# (room for future spec.security.cephx options)
# (room for future spec.security options)
# ...
List of Ceph daemons which have daemon keys that can be rotated automatically:
The Ceph admin key Rook uses to run Ceph commands will also be rotated automatically. Care should be taken to ensure admin key rotation doesn't block rotation of other keys.
Rook considered using an operator-level global config option
ROOK_ROTATE_DAEMON_CEPHX_KEYS_OLDER_THAN, but this does not allow controllers to get reconciliation
events when the config is modified. Associating the config with CephCluster allows controllers to
reconcile as needed if/when the user modifies the configuration(s).
Ceph-CSI deployments (if managed by Rook) are managed in the operator namespace, but keys are created on a per-CephCluster basis. Thus, a CephCluster configuration option (like above) is most appropriate. CSI key rotations require manual administrator action to reboot or drain/undrain nodes to remount PVCs with the new key, so Rook will avoid automatic rotation and only implement one-time rotation options. To allow for an arbitrarily-long maintenance window for admins to perform node actions, CSI will use overlapping rotation.
spec:
# ...
security:
cephx:
daemon: {} # discussed above
csi:
keyRotationPolicy: Disabled | KeyGeneration
keyGeneration: <int> # used with KeyGeneration
keepPriorKeyCount: 1
# (room for future spec.security.cephx options)
# (room for future spec.security options)
# ...
Note on keyRotationPolicy for CSI. WithCephVersionUpgrade will not be supported for CSI keys
unless we can validate that the keys can safely be rotated without the risk of affecting existing
PVC mount connectivity. Rook will return an error if this value is given.
Ceph daemons that have keys used by non-Rook-controlled clients are also associated with Custom Resources (CRs).
Full list of Rook CRs with non-daemon keys:
peer)peer)Each CR will provide a key rotation mechanism as part of the primary API spec.
For users to determine when Rook has successfully rotated keys, two pieces of information must be reported:
Different Rook CRs will need to report status slightly differently (more below). The reused status fields will be as follows:
keyGeneration: <int>
keyCephVersion: "20.2.0" # e.g.
keyGeneration (int): the CephX key generation for the most recently (successfully) reconciled
resources. This status field is always updated, even when keyRotationPolicy is not
KeyGeneration. When keys are first created, the generation is 1. Generation 0 indicates
that initial reconciliation (including key creation) has not finished, or keys existed prior to
the implementation of the key rotation feature.
keyCephVersion (string): the Ceph version that minted the currently-in-use keys.
This must be the same string format as reported by CephCluster.status.version.version to allow
them to be compared by users to determine when rotation is complete. E.g., 20.2.0.
An empty string indicates that the version is unknown, as expected in brownfield deployments.
For keys rotated WithCephVersion, the status...keyCephVersion can be compared to the Ceph
version known to be in the image being deployed. When status equals that in the image, rotation is
complete.
For keys rotated using keyGeneration, When status...keyGeneration >= spec...keyGeneration,
rotation is complete.
These statuses will be filled on all Rook resources when CephX keys are first created, even when key rotation is not enabled. This will ensure that users can always know the Ceph version and generation of minted keys -- or, by absence, show that the info is unknown.
For keys rotated via the overlapping mechanism, this status is also added:
priorKeyCount (int): the number of prior keys currently kept active.It is best if Rook is able to ensure old generation(s) of keys in Ceph's auth system are tracked
accurately. This is especially important if a bug occurs and Rook loses track of how many keys it
has generated for a component via resource statuses. This design doc recommends appending the
current keyGeneration to the Ceph auth client name to ensure Rook can list keys and ensure only
the current generation of keys and desired number of previous generations of keys exists.
In Rook, key rotation will be automated on a per-reconcile-controller basis. Wherever keys are currently being created/deleted when Rook creates/deletes Ceph daemons, rotation will occur nearby. This will ensure that key rotations can result in immediate daemon restart, allowing for appropriate detection of key rotation errors before a cluster or child resource might be brought offline due to unexpected errors.
The key rotation workflow will fit the following high-level process, adapted as necessary for each Ceph daemon to ensure key rotation is applied:
cephImageVersion) that will be deployed (Already part of all reconciles)keyGeneration is taken to be 1keyCephVersion is taken to be cephImageVersionKeyGeneration single-rotation configs, Rook checks the appropriate
status...keyGeneration. If spec...keyGeneration is greater, the key is rotated.CurrentCephVersion configs, Rook checks the appropriate status...keyCephVersion.
If cephImageVersion is greater, the key is rotated.status...keyGeneration is taken to be spec...keyGenerationstatus...keyCephVersion is taken to be cephImageVersionspec.security.cephx.daemon to determine if the reconcile should rotate.With separated daemon key info tracking, the status will look like so:
spec:
security:
cephx:
daemon: {}
csi: {}
rbdMirrorPeer: {}
status:
# ...
cephx:
admin:
keyGeneration: 3 # e.g.
keyCephVersion: "20.2.2" # e.g.
mon:
keyGeneration: 3 # e.g.
keyCephVersion: "20.2.2" # e.g.
mgr:
keyGeneration: 3 # e.g.
keyCephVersion: "20.2.2" # e.g.
osd:
keyGeneration: 3 # e.g.
keyCephVersion: "20.2.2" # e.g.
rbdMirrorPeer: # cluster-level RBD mirror peer key (client.rbd-mirror-peer)
keyGeneration: 3 # e.g.
keyCephVersion: "20.2.2" # e.g.
crashCollector:
keyGeneration: 3 # e.g.
keyCephVersion: "20.2.2" # e.g.
exporter:
keyGeneration: 3 # e.g.
keyCephVersion: "20.2.2" # e.g.
csi: {} # discussed more below, but some CSI keys and need to be in this status
Keeping track of each daemon type separately helps Rook ensure it won't re-rotate keys from earlier in the reconcile if the reconcile needs to restart later in the process. For example, there is no need to re-rotate admin/mon/mgr keys in the event the reconcile needs to restart in the middle of OSD updates. Because OSD updates can take quite a while, this case is likely to occur across Rook's large user base.
CSI keys are updated as part of a CephCluster reconcile.
For a CephCluster reconcile:
Ceph-CSI daemon keys (provisioners and node plugins) use CephCluster.spec.security.cephx.csi
(not ...daemon) to determine if the reconcile should rotate.
Assuming the rotation is indicated, rotate keys.
Update the CephCluster status with the CSI key info
kind: CephCluster
# ...
status:
# ...
cephx:
# (status from CephCluster daemon rotations)
csi:
keyGeneration: 2 # e.g.
keyCephVersion: "20.2.1" # e.g.
priorKeyCount: 1 # e.g.
When mirroring is enabled on a CephBlockPools, it results in Rook creating an RBD mirror peering
token for the pool. The key housed within the token is hardcoded in Ceph to use the
client.rbd-mirror-peer user and key. Therefore, the singular RBD mirror key is rotated at the
CephCluster level (above), but Rook should still update CephBlockPool statuses to identify when the
peering token has been updated to use the latest peer token.
For token updates, the CephBlockPool controller should reconcile when the parent CephCluster's
status.cephx.rbdMirrorPeer is updated. When the CephBlockPool controller reconciles and creates
its bootstrap token, it should copy CephCluster.status.cephx.rbdMirrorPeer to its own
status.cephx.peerToken, which will indicate that the token has been updated with the latest key
after a CephCluster key rotation event.
kind: CephBlockPool
# ...
status:
# ...
cephx:
peerToken:
keyGeneration: 2
keyCephVersion: "20.2.2"
Design for this should mirror rbdMirrorPeer/peerToken above, tailored for CephFS mirror design.
CephClient stands out due to its simplicity. It will have one and exactly one key which can serve any purpose (neither daemon key nor peer token). The interfaces are simplified to reflect this.
kind: CephClient
# ...
spec:
# ...
security:
cephx:
keyRotationPolicy: Disabled | WithCephVersionUpgrade | KeyGeneration
keyGeneration: <int> # used with KeyGeneration
# (room for future spec.security options)
# ...
status:
cephx:
keyGeneration: 1 # e.g.
keyCephVersion: "20.2.0" # e.g.
There are a number of keys created by Ceph automatically that are unused. The Ceph team has indicated that they will eventually modify Ceph to stop creating these keys. Before that change is made, it is safe for Rook to delete them. Since there is no need to rotate unused keys, deletion is best.
Ceph Tentacle (v20) provides a new ceph auth rotate command merged here:
https://github.com/ceph/ceph/pull/58121. Rook will rely on this command to rotate keys.
A few CephX technical details that are important for understanding CephX key rotation development are summarized below. More details here: https://docs.ceph.com/en/latest/dev/cephx/
CephX keys minted by Rook are only used by Ceph for initial daemon connection. Internally, all Ceph connections use "service" keys with laddered expiration times.
By default, Ceph service keys allow at least 2 hours from key rotation until the client must be updated with the new key. There are 3 service keys with TTLs 1 hour, 2 hours, and 3 hours from the time when service keys were last refreshed. Assuming the first TTL expiration is imminent, it is still at least 2 hours until the 3-hour TTL expiration.
Ceph does not allow keyrings to contain multiple keys for a given client/daemon. When a key is rotated, the old key is removed and new key replaces it with no ability to have a 2-valid-keys transition period.
Ceph's auth_service_ticket_ttl and auth_mon_ticket_ttl config options allow users to
shorten/lengthen that time as desired/needed. Note that this isn't immediate: it will take an
unknown amount of time for internal service keys to update.
Ceph config debug_auth=30 provides maximum CephX debug logs, as a development aid.
Rotation of the Ceph admin key is a risky process. Rook must ensure that admin key rotation cannot brick a Rook cluster. As such, process details is designed in full here.
Today, the rook-ceph-mon secret contains the authoritative client.admin keyring
(the "primary admin keyring").
While client.admin may be able to rotate its own key, the process of rotating the key and updating
the secret could not be made atomic. If the reconcile or operator were to fail between rotation and
secret update, there could be no way to recover the cluster.
To ensure CephCluster reconciliation can be recovered in the event of failures, Rook will create a
new, temporary admin user client.admin-rotator whose sole purpose is to rotate the primary admin
keyring. It will be created to rotate the primary admin keyring and removed after rotation.
While client.admin-rotator exists, it must be stored also. To avoid the complexity and risk of
adding a temporary field to the rook-ceph-mon secret for storing the admin rotator keyring, a new
rook-ceph-admin-rotator-keyring secret will be used.
First, we establish a rotation procedure. Because this procedure is risky, the process takes extra caution to verify that updates to the secret are properly stored. There is no room for error.
client.admin: run ceph auth get-or-create client.admin-rotator w/ admin permissionsrook-ceph-admin-keyring secret with the client.admin-rotator keyringclient.admin-rotator: run ceph auth ls to ensure it has permissionsclient.admin-rotator: run ceph auth rotate client.adminclient.admin run ceph auth ls to ensure it has permissionsclient.admin keyringrook-ceph-mon secret with new client.admin keyringclient.admin: run ceph auth rm client.admin-rotatorrook-ceph-admin-keyring secretCephCluster.status.cephx.admin with updated CephX statusRotation of the admin key should happen after mons are updated. This is important if Ceph is being
upgraded and the 'current' Ceph version doesn't support the auth rotate command, but the 'new'
Ceph version does support it. (This consideration also exists for rotating the mon. key)
CephCluster.spec.security.cephx.daemon to determine if rotation is indicatedIf rotation fails or the operator restarts after the ROTATE step, the CephCluster reconcile will
begin again. In the new reconcile, the secret may have the wrong info in data.keyring. In that
case, the CephCluster reconcile would be unable to take any admin actions, effectively bricking the
reconcile. Therefore, the CephCluster reconciler must be able to recover from any interrupted admin
key rotation and must do so before any other Ceph admin actions.
client.admin from the secret data.keyring (same as today)data.rotatorKeyring is present, prior rotation failed somewhere - recovery needed
client.admin: run ceph auth ls
client.admin-rotator is not present in output, final cleanup failed
auth ls failed), rotation failed somewhere between ROTATE and CLEANUP
client.admin-rotator from the secret data.rotatorKeyringKubernetes documentation explains how to encrypt data at rest in the cluster and how keys are rotated in this document: https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/#rotating-a-decryption-key
In that document, users are in charge of generating keys. Because Rook is in charge of generating CephX keys, the k8s design does not translate to this feature.
IBM Credential Rotator Operator (https://github.com/IBM/credential-rotator-operator) automatically
rotates keys for an application and restarts the application pod(s) afterwards. The input that
initiates rotation is not easy to understand, but the status shows PreviousResourceKeyID. Since
Ceph keys don't have an ID associated with rotation, this seems similar to Rook tracking
"key version" metadata on its own for key rotation.