specifications/trusted_computing_base/key_manager/README.md
This specification outlines the design and execution of the Key Manager (KM): the primary service responsible for managing and rotating cryptographic keys used by validator nodes and validator full nodes in the Diem payment network.
Validator nodes (VNs) and validator full nodes (VFNs) generate and store cryptographic keys in order to securely participate in the Diem network. For example, each VN in Diem maintains a secret and unique cryptographic key that can be used to sign votes on quorum certificates; this allows VNs to achieve consensus on transactions within Diem. Likewise, all VNs and VFNs in the network maintain cryptographic keys to authenticate and encrypt communication between nodes; thus, preventing adversaries from interfering with network communication in order to perform attacks.
To maintain the integrity and confidentiality of cryptographic keys in the Diem network, the KM is responsible for automatically rotating these keys over time. This provides freshness for keys used by the VNs and VFNs mitigating long range attacks, where keys are compromised and gathered over long periods of time.
At present, the KM plays a role in rotating the following asymmetric cryptographic keys in Diem:
Ed25519PrivateKey used by VNs to sign votes on quorum certificates to achieve consensus.Note: from this point forward, whenever we refer to a cryptographic key by name (e.g., the consensus key), we are referencing the private exponent of the key (e.g., the Ed25519PrivateKey). References to the public exponent of a cryptographic key will always be clarified.
At a high-level, the KM is designed to be a stand-alone service that drives the rotation of cryptographic keys across the relevant services and components that make up a VN and VFN. Each VN/VFN instance in the Diem network will deploy its own KM to manage the keys on that instance and perform rotations as required. The diagram below shows a broad overview of the KM operating within the context of a VN. The rest of this section is dedicated to explaining this architecture and presenting how it operates.
Key Manager Architecture: First, there are several important components that must be highlighted within the architecture diagram above; these are displayed in color:
Validator Config resource held under each validator operator's account. For example, in the case of the consensus key, each VN will register their consensus key in the validator config of that node, on-chain. The consensus key for that VN can then be seen by all other VNs in the network, and thus used to authenticate the consensus votes signed by the consensus key. To protect the validator config of each VN/VFN, each VN/VFN is assumed to have a unique and private identity key, called the operator key that has permission to update the validator config for that VN. The operator key is initialized and held in the SS for each VN and VFN. More details on validator configs and the operator key is presented below.Given the KM architecture presented above, we now describe the high-level flow of a VN consensus key rotation as performed by the KM. The diagram below augments the architecture diagram with numbered steps showing the interactions between the KM and the various components in the architecture. We explain each step in detail below:
At any point during a consensus key rotation, it is possible for any component within the key rotation pipeline to fail. For example: (i) the KM itself may crash during a key rotation; (ii) the SS may be offline for some period of time; (iii) the JSON RPC endpoint may fail to forward signed rotation transactions to the blockchain; or (iv) a component that maintains an exported key copy may fail (e.g., LSR). As such, it is the responsibility of the KM to provide fault tolerance against key rotation failures. For this, the KM assumes that any and all failures are transient, i.e., that the failure will occur for only a finite amount of time, after which the failed component will resume operation and the KM can address any inconsistencies. We make several important observations regarding consensus key rotation failures:
Failure Recovery Protocol: To achieve fault tolerance against component failures (including the KM itself), the KM should periodically check the state of the consensus key across components in the system, by performing the following steps:
LSR Failure Recovery Steps: As already outlined in this section, should LSR fail and restart at any point, it will need to read the currently registered consensus public key on the blockchain for this VN and retrieve the corresponding private key from the SS. Given that the SS maintains N=4 versions of the consensus key at any time, this should be successful in the majority of cases. If the SS fails to contain the corresponding private key for the consensus key version registered on-chain, LSR should simply wait until a reconfiguration event occurs which announces the new consensus key, and retrieve the corresponding private key from the SS before resuming execution.
In this section, we discuss in further detail the external components relied upon by the KM. These are: the secure storage (SS), the validator configs, and the JSON RPC endpoint. For each of these components, we present abstractions (e.g., component interfaces) in order to reason about the correctness of the KM protocols. Assuming each of these external components is implemented correctly (i.e., according to the interfaces they expose), we argue for the security and correctness of the KM.
The SS offers a secure persistent backend for generating, storing and managing security-critical data. While this includes a wide variety of data types (e.g., arbitrary text, json blobs, waypoints etc.), the KM only requires cryptographic key support from the SS. As such, we focus on this support here. For further information about the SS, see the specification here.
At a high-level, the KM requires several important functionalities from the SS. We list these functionalities below, and for each functionality, explain what it is required, and where it is used.
To support all of the required functionalities listed above for the KM, the SS provides a complete cryptographic key API named CryptoStorage. The snippet below shows a simplified CryptoStorage API. To view the full API, see the SS specification here.
// CryptoStorage offers a secure storage engine for generating, using and managing cryptographic
/// keys securely.
pub trait CryptoStorage: Send + Sync {
/// Securely generates a new named Ed25519 key pair and returns the corresponding public key.
fn create_key(&mut self, name: &str) -> Result<Ed25519PublicKey, Error>;
/// Returns the private key for a given Ed25519 key pair, as identified by the 'name'.
fn export_private_key(&self, name: &str) -> Result<Ed25519PrivateKey, Error>;
/// Returns the private key for a given Ed25519 key pair version, as identified by the
/// 'name' and 'version'.
fn export_private_key_for_version(
&self,
name: &str,
version: Ed25519PublicKey,
) -> Result<Ed25519PrivateKey, Error>;
/// Rotates an Ed25519 key pair by generating a new Ed25519 key pair, and updating the
/// 'name' to reference the freshly generated key.
fn rotate_key(&mut self, name: &str) -> Result<Ed25519PublicKey, Error>;
/// Signs the given message using the private key associated with the given 'name'.
fn sign_message(&mut self, name: &str, message: &HashValue) -> Result<Ed25519Signature, Error>;
/// Signs the given message using the private key associated with the given 'name'
/// and 'version'.
fn sign_message_using_version(
&mut self,
name: &str,
version: Ed25519PublicKey,
message: &HashValue,
) -> Result<Ed25519Signature, Error>;
}
As can be seen from the snippet above, the CryptoStorage API offered by the SS provides all of the required functionalities for the KM to operate correctly. More specifically:
create_key(..) and rotate_key(..) API calls. As such, the SS can be initialized when the VN is started by calling create_key(..) for the operator_key and consensus_key. Moreover, each time the KM needs to generate a new key version for the consensus key, rotate_key(consensus_key..) can be called.sign_message(..) and sign_message_using_version(..) API calls. As such, the KM can request the SS sign the rotation transaction by calling sign_message(operator_key,rotation_transaction).rotate_key(..), export_private_key_for_version(..) and sign_message_using_version(..) API calls. The version of each key is specified using the public key, and multiple versions of each key (i.e., N>1) are maintained by the SS.export_private_key(..) and export_private_key_for_version(..) API calls. As such, LSR can request a local copy of the consensus key using either of these API calls (depending on the version required).As discussed in the overview section above, the validator configs offer a public key infrastructure (PKI) that maps the identity of each VN to the corresponding cryptographic keys (e.g., the consensus key). While this mapping could be done using another PKI system, in Diem, we run this PKI directly on the blockchain. This provides a single source of truth that is both tamper and censorship resistant.
To allow dynamic updates to the consensus key, the ValidatorConfig decouples the operator key of each VN from the consensus key, allowing VNs to retain a single operator key over time despite performing multiple rotations. To achieve this, each VN in Diem is responsible for publishing and maintaining their ValidatorConfig. This configuration is associated with each VN using the VN's operator account, and the only way in which to update the ValidatorConfig is to sign a Diem transaction using the operator key associated with that account. As a result, only the operator of each VN can modify this configuration. This makes the ValidatorConfig an ideal location in which to publish the consensus key of each VN.
The snippet below shows the on-chain ValidatorConfig of each VN:
pub struct ValidatorConfig {
pub consensus_public_key: Ed25519PublicKey,
/// This is an bcs serialized Vec<EncNetworkAddress>
pub validator_network_addresses: Vec<u8>,
/// This is an bcs serialized Vec<NetworkAddress>
pub fullnode_network_addresses: Vec<u8>,
}
As can be seen in the snippet above, ValidatorConfig contains a field named consensus_public_key of type Ed25519PublicKey. This field contains the currently published consensus key of the VN, thus allowing other VNs in the network to identify the consensus key of this VN.
Epoch-specific ValidorConfig Snapshots: For security and performance reasons, the ValidatorConfig of each VN is copied into an external move module called ValidatorInfo whenever the VN is selected to participate in a consensus round (i.e., epoch). This snapshot (or copy) is taken at the beginning of each epoch for the next consensus round. ValidatorInfo contains the identity information of each VN participating in consensus. As a result, on each epoch change in Diem (i.e., reconfiguration), the VNs participating in that epoch will have their consensus public key frozen for the duration of that epoch, and all other VNs in the network will expect that VN to use the published consensus key in ValidatorInfo. This means that consensus key rotations will only take effect on the next reconfiguration (i.e., when a new snapshot of ValidatorConfig is copied into ValidatorInfo). The snippet below shows ValidatorInfo:
/// After executing a special transaction indicates a change to the next epoch, consensus
/// and networking get the new list of validators, their keys, and their voting power. Consensus
/// has a public key to validate signed messages and networking will has public identity
/// keys for creating secure channels of communication between validators. The validators and
/// their public keys and voting power may or may not change between epochs.
pub struct ValidatorInfo {
// The validator's account address. AccountAddresses are initially derived from the account
// auth pubkey; however, the auth key can be rotated, so one should not rely on this
// initial property.
account_address: AccountAddress,
// Voting power of this validator
consensus_voting_power: u64,
// Validator config
config: ValidatorConfig,
// The time of last recofiguration invoked by this validator
// in microseconds
last_config_update_time: u64,
}
As can be seen in the snippet above, every ValidatorInfo contains a copy of the most up-to-date ValidatorConfig published on the blockchain before the epoch change. This contains the identity information of each VN during the current epoch, including the consensus key published by that VN.
As discussed in the overview section above, the KM must be able to communicate with the Diem blockchain in order to read and update the consensus key of each VN registered on-chain. To achieve this, the KM uses the JSON RPC API offered by each VN endpoint. To execute all steps in the KM rotation and failure recovery protocols, the KM requires the API to provide the following list of functionalities:
ValidatorConfig published on-chain for each specific VN. This is needed for failure recovery, for example, when the KM needs to determine if the current consensus key registered on-chain matches the consensus key held in the SS.ValidatorInfo constructed at the beginning of each epoch. This is needed for failure recovery of LSR. For example, if LSR crashes and recovers, it will need to read the ValidatorInfo constructed for the VN in this epoch in order to determine which version of the consensus key each vote should be signed with.To support all of the functionalities listed above, the JSON RPC endpoint provides the following API. For brevity, we only list the API calls used by the KM.
/// Returns the associated AccountStateWithProof for a specific account. This method returns the
/// AccountStatewithProof at the height currently synced to by the server. To ensure the
/// correct AccountStatewithProof is returned, the caller should verify the account state proof.
pub fn get_account_state_with_proof(&mut self, account: AccountAddress) -> Result<AccountStateWithProof, Error>;
/// Submits a signed transaction to the Diem blockchain via the JSON RPC API.
pub fn submit(signed_transaction: SignedTransaction) -> Result<(), Error>;
As can be seen from the snippet above, the JSON RPC API supports two calls, get_account_state_with_proof(..) and submit(..). These are used by the KM to achieve the required functionality in the following manner:
get_account_state_with_proof(..) call returns an associated AccountStateWithProof for a specified AccountAddress. This call can be used by the KM to: (a) retrieve the ValidatorConfig of a specific VN, by passing in the address of the VN to query (i.e., get_account_state_with_proof(vn_address)); and (b) retrieve the ValidatorInfo of a VN for the current or next epoch. To achieve this, the KM can pass in the validator_set_address, an account address specified by Diem to hold specific information about the current VN set (i.e., get_account_state_with_proof(validator_set_address)). For further information about what the KM should do with the associated AccountProof returned for each call, see the security considerations section below.submit(..) call provides the ability to submit a signed transaction to the blockchain. As such, the KM can call this method with the signed rotation transaction to perform a rotation (i.e., submit(signed_rotation_transaction)).As outlined in the overview section above, the KM is a stand-alone service that operates autonomously in a controlled execution loop. In this section, we present the entry point into the KM and the controlled execution loop. Where appropriate, we present relevant data structures.
The KM operates with a single point of entry: a main(..) execution function. To initialize the KM and invoke execution using this function, the KM requires specific configuration information. The code snippet below shows the information required by the KM on startup (i.e., the KeyManagerConfig):
pub struct KeyManagerConfig {
/// Key Manager execution specific constants
pub rotation_period_secs: u64,
pub sleep_period_secs: u64,
pub txn_expiration_secs: u64,
/// External component dependencies
pub json_rpc_endpoint: String,
pub chain_id: ChainId,
pub secure_backend: SecureBackend,
}
As can be seen in the code snippet above, the KM requires three execution specific constants at startup:
rotation_period_secs: First, the KM requires knowing the period of time between each consensus key rotation (in seconds). This specifies how frequently the KM will perform a rotation. For example, if this is set to 3600 seconds, the KM will rotate the consensus key every hour.sleep_period_secs: Second, as the KM is designed to run autonomously in a controlled execution loop, the KM requires knowing how long to sleep between executions (in seconds). This prevents the KM from busy waiting when there is no work to be done and reduces execution load on the machine.txn_expiration_secs: Finally, the KM needs to know how long each rotation transaction it creates should be valid for (i.e., the transaction expiration time of each rotation transaction). This prevents transactions from being valid at all points in the future, creating the potential for security vulnerabilities (e.g., replay attacks). If a rotation transaction has been submitted to the blockchain (using the submit() JSON RPC API call) but has not ultimately been written to the blockchain within this time, the KM will need to reconstruct a new transaction.Moreover, the KM requires configuration information about how to communicate with the external components it relies on. These are:
json_rpc_endpoint and chain_id: First, the KM requires knowing about the JSON RPC endpoint that it should talk to (i.e., when reading and writing transactions to the blockchain). json_rpc_endpoint follows a url format, for example, https://123.123.123.123:8080.
secure_backend: Second, the KM requires knowing about the SS to which it can communicate. This includes the connection credentials to use, the url location of the SS, any supported API version etc.
Execution Loop: Once the KM has been correctly initialized using a valid KeyManagerConfig, the KM follows a single execution loop that obeys the rotation and failure recovery protocols outlined in the overview section above.
At a high-level, this means that the KM follows these simple steps in an infinite loop:
ValidatorConfig. If not, follow the failure recovery protocol above.rotation_period_secs and then return to step 1.In this section, we discuss several interesting security considerations that affect the safety and liveness of the KM and/or each VN.
Proving Consensus Key Ownership: As discussed above, each VN will announce its consensus public key on-chain. This occurs by essentially updating the mapping between each operator_key and consensus_key for the VN. However, to ensure that this occurs securely, it is essential that when such updates occur on-chain, ownership of the keys being published are proven. Otherwise, spoofing attacks could occur. For example, consider the case where a malicious or Byzantine VN publishes a transaction on-chain that updates their consensus key to be the same consensus key owned by another VN (i.e., the malicious VN spoofs a consensus key owned by another VN). In this case, there will be two operator keys mapping to the same consensus key. This may allow the malicious VN to benefit financially by appearing to participate in a consensus round, despite not actually doing any work. To avoid this type of attack (and other attacks that might not be so obvious..), the validator config on-chain should require a signature from the consensus key when a validator config update occurs.
Avoiding Exported Key Copies: At present, each VN may contain components that manage exported key copies locally, for example, LSR, which stores a local copy of the consensus key in memory in order to sign votes. While this may be beneficial for performance reasons (i.e., LSR doesn't need to contact the SS to perform a signature on each vote), it does lead to concerns about key compromise: if LSR is less secure than the SS, an adversary will simply target the key copy in LSR to compromise the key. Thus nullifying any defenses put into place to protect the consensus key in the SS. To avoid this, the use of exported key copies is heavily discouraged, and any components that maintain exported key copies should have significant justifications for doing so. Moreover, such components should be strictly audited for security vulnerabilities, as they will likely become the target of adversaries when deployed.
Enforcing Periodic Key Rotation: Key rotation is an attractive means of protecting against various types of attacks on each VN. However, without strictly enforcing key rotation within the network (i.e., making it mandatory), it is possible that VNs will not perform key rotation frequently enough to see any security benefits (e.g., due to being lazy or wanting to avoid the financial or operational costs of performing a rotation). To avoid this, it is essential that VNs that do not rotate their keys frequently enough are disincentivized and/or penalized for their actions. One way to achieve this is to exclude lazy VNs from the set of possible VNs that may participate in the Diem consensus algorithm. Such exclusions may then be lifted once the VNs meet the key rotation requirements. This will help to protect the Diem network against old or compromised keys.
Verifying vs. Non-verifying the JSON RPC Endpoint: As discussed above, in order for the KM to perform key rotations, it must be able to read and write transactions to the blockchain. The KM does this internally by interfacing with a JSON RPC endpoint. However, one notable challenge arises when performing such interaction: how does the KM know the information it is being supplied by the endpoint is correct and up-to-date? If an attacker can compromise the JSON RPC endpoint, it may feed the KM stale information or incorrect blockchain state (e.g., transactions). As a result, this could cause a liveness violation for the VN node (e.g., the KM could fail to see that the VN does not have the correct consensus key registered on the blockchain, and thus will be unable to participate in consensus). To defend against this, it is critical that the KM verifies all information returned via the JSON RPC endpoint (e.g., the KM should verify proofs of account state and ensure that the height of the blockchain is monotonically increasing). This will help to reduce the attack surface against the KM. Note, however, that while the KM can verify the blockchain is growing sequentially, it is unable to verify that the information being presented to it is fresh (i.e., that the responses returned by the JSON RPC endpoint hold the most up-to-date information). As such, without querying and aggregating the responses of multiple different JSON RPC endpoints, there are always some levels of implicit trust between the KM and the endpoint it relies on.
Minimizing the size of the KM and the corresponding TCB: At present, the KM communicates with the JSON RPC endpoint via a JSON RPC client that verifies API responses. To achieve this, the KM implementation handles and parses both HTTP and JSON responses internally. From a security perspective, however, this is less-than-ideal. The KM forms part of the trusted computing base (TCB) of each VN (read more about the TCB, here). As a result, any and all code placed into the KM must be free from software bugs and vulnerabilities. It is critical, therefore, to reduce the amount of code placed into the KM, as this in turn reduces the attack surface of the KM and thus the TCB. We therefore argue that it is more appropriate for the KM to delegate communication with the JSON RPC endpoint to another external component, and only handle the structured API responses and proofs directly. This will prevent the KM from having to perform complex and unnecessary operations locally (e.g., parsing JSON, serialization/deserializing objects). This reduces the size of the TCB and makes the implementation easier to reason about/verify.