docs/RFCS/20170319_certificate_rotation.md
This RFC proposes changes to certificate and key management.
The main goals of this RFC are:
Out of scope for this RFC:
Anytime certificates and keys are used, we must allow for rotation, either due to security concerns or standard expiration.
This RFC is concerned with rotation due to expiration, meaning we do not consider certificate revocation.
Our current certificate use allows for a single CA certificate and a single server certificate/private key pair, with node restarts being required to update any of them.
We wish to be able to push new certificates to nodes and use them without restart or connection termination.
CA and node certificates need radically different lifetimes to allow new node certificates to be rolled out rapidly without waiting for CA propagation to all clients and nodes.
We propose the following defaults:
This may not always be appropriate. See "unresolved questions".
To provide enough warning of potential problems, each node should record and export the start/end validity timestamp for each type of certificate:
nodeIf two certificates of the same type are present (eg: two CA certificates), report the timestamps for the certificate with the latest expiration date.
With such metrics, we can now alert on expiring certificates, either with a fixed lifetime remaining or when a fraction of the lifetime has expired.
It may be preferable to monitor certificate chains rather than individual certificates (see Known Issues below) and report validity dates for the latest valid chain.
The location of certificates and keys can be specified using the --certs-dir command-line
flag or the corresponding COCKROACH_CERTS_DIR environment variable.
The flag value is a relative directory, with a default value of ~/.cockroach-certs.
We avoid using the cockroach-data directory for a few reasons:
All files within the directory are examined, but sub-directories are not traversed.
We propose the following naming scheme:
<prefix>[.<middle>].<extension>
<prefix> determines the role:
ca: certificate authority certificate/key.node: node combined client/server certificate/key.client: client certificate/key.<middle> is required for client certificates, where <middle> is the name
of the client as specified in the certificate Common Name (eg: client.marc.crt).
For other certificate types, this may be used to differentiate between multiple versions of a similar
certificate/key. See "unresolved questions".
<extension> determines the type of file:
crt files are certificateskey files are keysThe only check is for the key to be read-write by the current user only (maximum permissions of -rwx------).
We need to provide admins with a way to disable permissions checks, due both to incompatible
certificate deployment methods, and incompatible filesystems/architectures. An environment variable
COCKROACH_SKIP_CERTS_PERMISSION_CHECKS with a stern warning should be sufficient.
Initial CA creation involves creating the certificate and private key.
ca.crt: the CA certificate, valid 5 years. Provided to all nodes and clients.ca.key: the CA private key. Must be kept safe and not distributed to nodes and clients.CA renewal involves creating a new certificate using either the same, or a new private key. All valid CA certificates need to be kept as well as their corresponding keys:
ca.crt (optionally removing expired certificates along the way).The updated ca.crt file must be communicated to all nodes and clients.
The ca.key must still be kept safe and not distributed to nodes and clients.
When signing node/client certificate, if multiple CA certificates are found inside ca.crt, the
certificate matching the private key will be used. If multiple such certificates exist, the one with
the latest expiration date will be used.
The trusted machine holding the CA private key generates node certificates and private keys, then pushes them to the appropriate nodes. Keys are not kept by the trusted machine once deployed on the nodes.
Generated node/client certificates have a shorter default lifetime than CA certificates (see "Certificate Expiration" section). Furthermore, their expiration date cannot exceed the CA certificate expiration.
Upon renewal, certificates and keys are fully re-generated, with no attempt to re-use the private node/client key. Filenames for node/client certificates and keys can remain the same as before, or be new files.
Running nodes can be told to re-read the certificate directory by issuing a SIGHUP to the process.
Since we cannot control when nodes may be restarted, it is important to keep the reload process identical to the initial load.
Node certificates must be checked for validity. Specifically:
Not Before < time.Now() < Not After)The last condition is an attempt to avoid loading a certificate that may not be verifiable by other nodes or clients. If we do not have the right CA, chances are someone else does not either.
We may need to set a timer for certificates that have not reached their Not Before date, otherwise
we would need to trigger a second refresh.
A good description of online key rotation in Go can be found in Hitless TLS Certificate Rotation in Go
Adding or swapping certificates can be done in multiple ways:
tls.Config objectThe tls.Config object is specified at connection time and cannot be modified after as it
is not safe for concurrent use.
A node needs to maintain two tls.Config objects, one for server-side connections, one for client-side connections. A new config can be constructed upon reload, then reused for all subsequent connections.
The server-side tls.Config object can be specified for each client connection by implementing
the tls.Config.GetConfigForClient. This should return the most recent tls.Config object.`
Root CAs for server and client certificate verification are in tls.Config.RootCAs and tls.Config.ClientCAs respectively. We should add all detected CA certificates to both pools.
The node and client certificates are set in tls.Config.Certificates.
If more than one node certificate is present, the one matching the requested ServerName is presented.
We should set only one certificate in tls.Config.Certificates.
We will need to modify all commands that use certs to make use of the new directory structure.
We will also need:
cert create-ca to use an existing key.cert list to list all CA certs and node/client cert/key pairs.We want at least barebones listing of all certs on a given node, including validity dates, certificate chain (corresponding CA for a node cert), and valid hosts.
Soon-to-expire certificates (or chains) must be reported prominently on the admin UI and available through external metrics.
Separate .crt and .key files are expected by libpq, but other libraries/languages may have different ways of specifying/packaging them.
We need to:
cockroach cert commands.Some additional methods to trigger a reload can later be introduced:
We have no way of knowing which certificate authority a client has, so we cannot monitor for clients not yet aware of a new CA certificate.
We could examine client certificates and report soon-to-expire ones. This will not help with CA knowledge, but would provide better visibility into user authentication issues.
tls.Config fieldsWe need to verify that the proposed CA and node cert rotation mechanisms work, especially through grpc.
Since everything uses tls.Config, implementing tls.Config.GetConfigForClient to rotate
the config on the server should be sufficient,
However, we need to ensure that all client-side connections are able to use the new config when initiating a connection.
Renegotiation may cause new certificates to be presented. We need to make sure this will not cause issues.
The tls.Config comments also mention this happening in TLS 1.3:
GetClientCertificate may be called multiple times for the same
connection if renegotiation occurs or if TLS 1.3 is in use.
The Go lib/pq will use all certificates found in ca.crt, but this may not be the
case of other libraries.
It may be safer to keep a single CA certificate per file.
Instead of the --certs-dir flag being a single directory, we could allow specification of
multiple directories. This would be useful to separate CA certificates from other certs.
Is checking for -rwx------ on keys sufficient? A more stringent check would be similar
to what the ssh client does (strict directory/file permissions).
Should we allow multiple versions of the same certificate? eg: multiple files matching ca.*.crt or node.*.crt? If so, how do we handle them?
We need to pick some defaults for certificate lifetimes.
The proposed ones may be inappropriate for most users: security-minded admins or those with a good certificate-rotation process in place may want much shorter periods while casual users may want to never be bothered by certificates.
Simply because both a CA certificate and a node certificate are valid does not mean they are both correct. Consider the following scenario:
In this case, as long as the old CA is still valid, we can verify other node certificates. However, as soon as the old CA expires, we will be unable to verify node certificates due to the key change.
We could improve monitoring by analyzing the lifetime data for each certificate chain on each node. This would notice that the old chain expires when the old CA expires, and the new chain contains only a CA certificate, no node certificate.
If we record/export chain information by CA cert serial number (or public key), we can ensure that all nodes have certificate chains rooted at the same CA.
When generating new node or clients certs, we may automatically detect multiple valid CA certs in the certificate directory. If two certs have the same key pair, we can pick the one expiring the latest. If the keys differ, we want to throw an error and either ask the user to remove one, or add an additional flag to force the cert (this partially defeats the point of automatic file detection).
Dropping specific cert flags means that the postgres URL will be built automatically from the requested username.
For example, running the following command: cockroach sql -u foo -certs-dir=~/.cockroach-certs will
generate the URL: postgresql://foo@localhost:26257/?sslcert=foo.client.crt&sslkey=foo.client.key&...
If delegation is allowed (user root can act as user foo), the command must be run with the
fully-specified postgres URL postgresql://foo@localhost:26257/?sslcert=root.client.crt&sslkey=root.client.key&...
Delegation remains doable, but with an extra hoop to jump through.
With default values for certificate and key locations, secure mode is now less explicit, relying on the detection of certificates in the default directory. This could be misleading to users.
We have three major options to specify certificate and key files:
This is the current method, all files are specified by their own flags.
Drawbacks:
ca.crt file requires finding the certificate matching the key. This can be partially avoided by always putting the newest certificate first.Advantages:
/etc/..., node certs in user directory).Command-line flags for filename globs (either per pair, or per file). Optionally, allow specification or a certs directory, with globs matching files within the directory.
Drawbacks:
node.old.crt and node.new.crt)?*.crt glob matches CA/node/client certs), do we just fail?ca.crt files.Advantages:
Automatically determine file types (key vs cert) and cert usage (CA vs node vs client) by analyzing
all files in the certs directory.
Certificates can be determine by looking at IsCA or ExtendedUsage. The keys can be matched
to certificates by comparing algorithms and public keys.
Drawbacks:
Advantages: