rfd/0025-hsm.md
Support HSM devices for CA private key storage.
In simple terms, HSMs are hardware devices that store private keys and expose only cryptographic operations using them (like sign/verify and encrypt/decrypt). HSMs provide stronger protections for storing private keys compared to disks or remote databases. Private keys stored in the HSM can be non-exportable, meaning they will never leave the device. HSMs are also required by some regulation regimes, like FIPS and PCI.
First, some background on HSMs and Teleport CAs.
It's important to understand that HSMs are a product category (like "storage devices" or "monitors") and not a specific standard (like NVMe, SATA or HDMI). There are commonalities between HSM devices, but also many vendor-specific variations, extensions and features.
That said, most HSM devices support the following minimum set of operations that Teleport needs:
Some devices support other interesting operations, but they are not universal:
By far, the most ubiquitous interface to HSM devices is PKCS#11. It is often
despised because of complexity and usage difficulty. PKCS#11 works by loading a
module (.so library) provided by the vendor, which has a known C API. To
cover the largest number of devices, we have to support PKCS#11.
Cloud HSM (and KMS) products are also available from the biggest cloud providers, usually over a custom API. AWS CloudHSM is an exception - it mounts a real HSM over PKCS#11. Since cloud HSM could become a popular request, we should design with the assumption of multiple HSM interfaces. But we will not implement their support initially.
Each HSM vendor typically also provides a custom SDK or library for their specific product. We should not support these in favor of the standard PKCS#11 interface.
yubihsm-connector is an interesting HTTP-to-USB proxy for the device. It allows a single HSM to be shared by multiple clients. It doesn't seem to do authentication though and only works for YubiHSM 2.
Despite popular belief, HSMs don't guarantee that private keys can't be stolen. Using key wrapping and special attributes on private keys, it's possible to copy keys created on one device to a different one or extract them in plaintext. Yubico even provides CLI tools for this, specific to YubiHSM.
This may be used to distribute a single CA private key to multiple HSMs over the network. However, it significantly weakens the security properties of using an HSM in the first place.
Each Teleport cluster has multiple CAs:
Private keys in each CA are stored as a list to enable CA rotation. For example, when rotating a Host CA, we shuffle the key fields of the existing CA object instead of creating a brand new backend object. The expectation of only 1 CA of each type is deeply integrated in Teleport.
Each Auth server may have an HSM device available, configured in
teleport.yaml. Auth servers create and manage unique CA keys in their local
HSMs, distinct from the keys on other Auth servers. Each Auth server encodes
its local HSM information in the private key field of the CA backend object
(instead of the PEM-encoded raw private key).
CA storage schema has to support multiple active private keys. Auth servers decide on which key to use based on their local configuration.
CA rotation also needs to update, to support multiple active key pairs and multiple trusted key pairs.
Implementing key creation in the Auth server (instead of asking users to manually create keys) gives us the flexibility to change key algorithms in the future. Support for multiple private keys sets us up for CA backup/restore/migration and full federation in the future.
There are several PKCS#11 libraries available, we'll use crypto11. This library allows both key creation and usage, and is a nice usable abstraction over the raw PKCS#11 protocol.
We could use miekg/pkcs11
directly, which represents the PKCS#11 API as closely as possible. But we'd end
up with a lot of verbose code, essentially reimplementing most of crypto11
(which is based on miekg/pkcs11 anyway).
Current CA rotation logic operates on one CA type at a time and performs a phased rotation to give clients and servers a chance to fetch updated CA certs list and reissue their credentials with the new CA.
The rotation mechanism switches new and old CA keys to be "signing" key and "trusted" keys in different phases:
CertificateAuthorityV2.Spec.TLSKeyPairs[0]) and is used to sign any new
certificatesThis index-based approach will break down with multiple active CA keys (in different HSMs on different Auth servers). Instead, we'll update the storage schema to support multiple "signing" and "trusted" keys for each CA.
Clients will trust all "signing" + "trusted" CA certs for validation. Auth servers will use their corresponding HSM key for signing (see below).
The current backend schema stores CA keys as lists:
message CertAuthoritySpecV2 {
// Note: some fields are omitted below.
// SSH keys. CheckingKeys are public keys, SigningKeys are private keys.
repeated bytes CheckingKeys = 3 [ (gogoproto.jsontag) = "checking_keys,omitempty" ];
repeated bytes SigningKeys = 4 [ (gogoproto.jsontag) = "signing_keys,omitempty" ];
// TLS keys and certs.
repeated TLSKeyPair TLSKeyPairs = 7 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "tls_key_pairs,omitempty" ];
// JWT keys.
repeated JWTKeyPair JWTKeyPairs = 10 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "jwt_key_pairs,omitempty" ];
}
This looks like it supports multiple private keys, but in practice only the
first item of each list is used for signing. Since we can't add a bool signing flag to SSH keys (because they are byte slices), there isn't an
incremental way to mark multiple signing keys.
I propose revamping the schema entirely and doing a migration to move existing data:
message CertAuthoritySpecV2 {
// Note: some fields are omitted below.
// Deprecated fields for backwards-compatibility.
// Delete in teleport v8.
repeated bytes CheckingKeys = 3 [ (gogoproto.jsontag) = "checking_keys,omitempty" ];
repeated bytes SigningKeys = 4 [ (gogoproto.jsontag) = "signing_keys,omitempty" ];
repeated TLSKeyPair TLSKeyPairs = 7 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "tls_key_pairs,omitempty" ];
repeated JWTKeyPair JWTKeyPairs = 10 [ (gogoproto.nullable) = false, (gogoproto.jsontag) = "jwt_key_pairs,omitempty" ];
// New fields in teleport v7.
CAKeySet ActiveKeys = 11;
// Note: this field is called AdditionalTrustedKeys instead of just
// TrustedKeys to make it clear that these are not the only keys used to
// validate certificates.
// Callers must validate using ActiveKeys + AdditionalTrustedKeys.
CAKeySet AdditionalTrustedKeys = 12;
}
message CAKeySet {
repeated SSHKeyPair SSH = 1;
repeated TLSKeyPair TLS = 2;
repeated JWTKeyPair JWT = 3;
}
message SSHKeyPair {
bytes PublicKey = 1;
bytes PrivateKey = 2;
PrivateKeyType PrivateKeyType = 3;
}
// TLSKeyPair already exists.
message TLSKeyPair {
bytes Cert = 1;
bytes Key = 2;
// New field.
PrivateKeyType KeyType = 3;
}
// JWTKeyPair already exists.
message JWTKeyPair {
bytes PublicKey = 1;
bytes PrivateKey = 2;
// New field.
PrivateKeyType PrivateKeyType = 3;
}
enum PrivateKeyType {
RAW = 0;
PKCS11 = 1;
}
Clients will use a concatenation of both CAKeySets. Auth servers will use
ActiveKeys to sign new certificates. CA rotation can now swap ActiveKeys
and AdditionalTrustedKeys instead of relying on indexes.
When PrivateKeyType is set to PKCS11, the PrivateKey/Key fields contain
encoded information about the PKCS#11 key, as a JSON object prefixed with
pkcs11::
pkcs11:{"host_id": "host-uuid", "key_id": "key-uuid"}
Where host-uuid is the UUID of the Auth server that created the key and
key_id is the UUID of the key set in the CKA_ID attribute on the HSM.
Auth servers, when signing, will use:
ActiveKeys PKCS#11 key that belongs to
their host (matched via host_id field)ActiveKeysAuth servers need PKCS#11 configuration, passed in via teleport.yaml:
auth_service:
ca_key_params:
pkcs11:
# Path to PKCS#11 module from the HSM vendor.
module_path: "/path/to/pkcs11.so"
# CKA_LABEL of the hSM, set during HSM initialization.
label: "myhsm"
# PIN for connecting to the HSM, set during HSM initialization.
# This can also be a path to a file containing the PIN.
pin: "password"
If pkcs11 field is not set, Auth server will start with plaintext keys stored
in the backend. If Auth server fails to connect to the token at startup with
these parameters, the process will exit.
The fields from PKCS#11 configuration will also be added to the auth_server
API resource as read-only.
Migration of existing clusters to use HSMs will involve a CA rotation. CA rotation is needed to propagate the CA certificates associated with the new private key to all users and hosts. Without the rotation, all certificates issued by this Auth server will not be trusted by any client.
On startup, Auth server will fetch existing CAs from the backend and:
ActiveKeys
ActiveKeys, create a key in HSM and register it in AdditionalTrustedKeys
tctl status until a CA rotation is
completed./var/lib/teleport/host_uuid to explicitly change the UUID and
dis-associate the old HSM key from this instance; from there the Auth server
behaves as above when registering a new keyWhen rotation is initiated through a different Auth server, Auth servers with
HSMs will add a new PKCS#11 key reference to CA AdditionalTrustedKeys in the
init phase of rotation. Admins should not perform a rotation with
--grace_period=0 (or any very short value, less than 1 minute) so that all auth
servers have a chance to add their keys. A message will be logged when this is
completed.
/var/lib/teleport/host_uuid to explicitly change
the UUID and dis-associate the old HSM key from this instance; from there
the Auth server behaves as if it is brand new.Using HSMs is inherently more fragile than using plaintext keys. Auth servers become much less dynamic and are tied to a specific machine with an HSM device. CA rotation becomes more complex, with new failure modes.
This design attempts to automate as much HSM management as possible, but still relies on human operators to:
teleport.yaml on Auth servers to ensure correct PKCS#11
configurationtctl status to ensure the new HSM keys are in useHSM support should support any PKCS#11-compatible device, but will be tested with: