docs/RFCS/20171220_encryption_at_rest.md
This feature is Enterprise.
We propose to add support for encryption at rest on cockroach nodes, with encryption being done at the rocksdb layer for each file.
We provide CTR-mode AES encryption for all files written through rocksdb.
Keys are split into user-provided store keys and dynamically-generated data keys. Store keys are used to encrypt the data keys. Data keys are used to encrypt the actual data. Store keys can be rotated at the user's discretion. Data keys can be rotated automatically on a regular schedule, relying on rocksdb churn to re-encrypt data.
Plaintext files go through the regular rocksdb interface to the filesystem. Encrypted files go through an intermediate layer responsible for all encryption tasks.
Data can be transitioned from plaintext to encrypted and back with status being reported continuously.
Encryption is desired for security reasons (prevent access from other users on the same machine, prevent data leak through drive theft/disposal) as well as regulatory reasons (GDPR, HIPAA, PCI DSS).
Encryption at rest is necessary when other methods of encryption are either not desirable, or not sufficient (eg: filesystem-level encryption cannot be used if DBAs do not have access to filesystem encryption utilities).
The following are not in scope but should not be hindered by implementation of this RFC:
The following are unrelated to encryption-at-rest as currently proposed:
Caveat: this is not a thorough security analysis of the proposed solution, let alone its implementation.
This section should be expanded and studied carefully before this RFC is approved.
The goal of this feature is to block two attack vectors:
An attacker can gain access to the disk after it has been removed from the system (eg: node decommission). At-rest encryption should make all data on the disk useless if the following are true:
plaintextUnprivileged users (eg: non root) should not be able to extract cockroach data even if they have access to the raw rocksdb files. This will still not guard against:
plaintextSome of the assumptions here can be verified by runtime checks, but others must be satisfied by the user (see Configuration Recommendation.
We assume attackers do not have privileged access on a running system. Specifically:
A big assumption in this document is that attackers do not have write access to the raw files while we are operating: we trust the integrity of the store and data key files as well as all data written on disk.
This includes the case of an attacker removing a disk, modifying it, and re-inserting it into the cluster.
A potential future improvement is to use authenticated encryption to verify the integrity of files on disk. This would add complexity and cost to filesystem-level operations in rocksdb as we would need to read entire files to compute authentication tags.
However, integrity checking can be cheaply used on the data keys file.
We need to generate random values for a few things:
Crypto++ provides OS_GenerateRandomBlock
which can operate in blocking (using /dev/random) or non-blocking (using /dev/urandom) mode.
We would prefer to use better entropy for data keys, but /dev/random is notoriously slow especially
when just starting rocksdb with very little disk/network utilization.
Generating data keys (other than the first one, or when changing encryption ciphers) can be done
in the background so we may be able to use the higher entropy /dev/random.
nonces may be safe to keep generating using the lower-entropy /dev/urandom.
More research must be done into the use of /dev/random in multi-user environment. For example, is it possible
for an attacked to consume /dev/random for long enough that key generation is effectively disabled?
An important consideration in AES-CTR is making sure we never reuse the same IV for a given key.
The IV has a size of AES::BlockSize, or 128 bits. It is made of two parts:
This imposes two limits:
2^32 128-bit blocks == 64GiB2^32 files is 2^-32These limits should be sufficient for our needs.
Given a reasonably safe hashing algorithm, exposing the hash of the store keys should not be an issue.
Indeed, finding collisions in sha256 is not currently easier than cracking aes128. Should better collision
methods be found, this is still not the key itself.
We need to provide safety for the keys while held in memory. At the C++ level, we can control two aspects:
mlock (man mlock(2)) on memory holding keys, preventing paging out to diskmadvise with MADV_DONTDUMP (see man madvise(2) on Linux) to exclude pages from core dumps.There is no equivalent in Go so the current approach is to avoid loading keys in Go. This can become problematic if we want to reuse the keys to encrypt log files written in Go. No good answer presents itself.
Terminology used in this RFC:
Encryption-at-rest is an optional feature that can be enabled on a per-store basis.
In order to enable encryption on a given store, the user needs two things:
Enabling encryption increases the store version, making downgrade to a binary before encryption impossible.
We identify a few configuration requirements for users to safely use encryption at rest.
TODO: this will need to be fleshed out when writing the docs.
keywhiz)The store key is a symmetric key provided by the user. It has the following properties:
Store keys are stored in raw format in files (one file per key).
eg: to generate a 128-bit key: openssl rand 16 > store.key
Specifying store keys is done through the --enterprise-encryption flag. There are two key fields in this flag:
key: path to the active store key, or plain for plaintext (default).old_key: path to the previous store key, or plain for plaintext (default).When a new key is specified, we must tell cockroach what the previous active key was through old_key.
Data keys are automatically generated by cockroach. They are stored in the data directory and encrypted with the active store key. Data keys are used to encrypt the actual files inside the data directory.
This two-level approach allows easy rotation of store keys and provides safer encryption of large amounts of data. To rotate the store key, all we need to do is re-encrypt the file containing the data keys, leaving the bulk of the data as is.
Data keys are generated and rotated by cockroach. There are two parameters controlling how data keys behave:
AES CTR with the same key
size as the store key.The need for encryption entails a few recommended changes in production configuration:
We add a new flag for CCL binaries. It must be specified for each store we wish encrypted:
--enterprise-encryption=path=<path to store>,key=<path to key file>,old_key=<path to old key>,rotation_period=<duration>
The individual fields are:
path: the path to the data directory of the corresponding store. This must match the path specified in --storekey: the path to the current encryption key, or plaintext if we wish to use plaintext. default: plaintextold_key: the path to the previous encryption key. Only needed if data was already encrypted.rotation_period: how often data keys should be rotated. default: 1 weekThe flag can be specified multiple times, once for each store.
The encryption flags can specify different encryption states for different stores (eg: one encrypted one plain, different rotation periods).
Turning on encryption for a new store or a store currently in plaintext involves the following:
# Ensure your key file exists and has valid key data (correct size)
# For example, to generate a key for AES-128:
$ openssl rand 16 > /path/to/cockroach.key
# Specify the enterprise-encryption:
$ cockroach start <regular options> \
--store=/mnt/data \
--enterprise-encryption=path=/mnt/data,key=/path/to/cockroach.key
The node will generate a 128 bit data key, encrypt the list of data keys with the store key, and use AES128 encryption for all new files.
Examine the logs or node debug pages to see that encryption is now enabled and see its progress.
Given the previous configuration, we can generate a new store key. We must pass the previous key.
# Create a new 128 bit key.
$ openssl rand 16 > /path/to/cockroach.new.key
# Tell cockroach about the new key, and pass the old key (/path/to/cockroach.key)
$ cockroach start <regular options> \
--store=/mnt/data \
--enterprise-encryption=path=/mnt/data,key=/path/to/cockroach.new.key,old_key=/path/to/cockroach.key
Examine the logs or node debug pages to see that the new key is now in use. It is now safe to delete the old key file.
We can switch an encrypted store back plaintext. This is done by using the special value plaintext in the
key field of the encryption flag. We need to specify the previous encryption key.
# Instead of a key file, use "plaintext" as the argument.
# Pass the old key to allow decrypting existing data.
$ cockroach start <regular options> \
--store=/mnt/data \
--enterprise-encryption=path=/mnt/data,key=plain,old_key=/path/to/cockroach.new.keys
Examine the logs or node debug pages to see that the store encryption status is now plaintext. It is now safe to delete the old key file.
Examine logs and debug pages to see progress of data encryption. This may take some time.
The biggest impact of this change on contributors is the fact that all data on a given store must be encrypted.
There are three main categories:
We introduce a new store version to mark switching to stores supporting encryption.
Stores are currently using versionBeta20160331. If no encryption flags are specified, we remain at this
version until a "reasonable" time (one or two minor stable releases) has passed.
Specifying the --enterprise-encryption flag increases the version to versionSwitchingEnv. Downgrades to
binaries that do not support this version is not possible.
Rocksdb performs filesystem-level operations through an Env.
This layer can be used to provide different behavior for a number of reasons. For example:
Envbase envWe leverage the Env layer to implement the following behavior:
versionBeta20160331 continue to use the default EnvversionSwitchingEnv use the switching envversionSwitchingEnv use a default EnvversionSwitchingEnv use an EncryptedEnvversionBeta20160331: DefaultEnv
versionSwitchingEnv: SwitchingEnv: Encrypted? no -----> DefaultEnv
yes -----> EncryptedEnv
The state of a file (plaintext or encrypted) is stored in a file registry. This records the list of all
encrypted files by filename and is persisted to disk in a file named COCKROACHDB_REGISTRY.
For every file being operated on, the switching env must lookup its existing encryption state in the registry or the
desired encryption state for new files. If the file is plaintext, pass the operation down to the DefaultEnv.
If the file is encrypted, pass the operation down to the EncryptedEnv. For a new file, we must successfully
persist its state in the registry before proceeding with the operation.
Most SwitchingEnv methods will perform something like the following:
OpOnFile(filename)
// Determine whether the file uses encryption (existing files) or encryption is desired (new files)
if !registry.HasFile(filename)
useEncryption = lookup desired encryption (from --enterprise-encryption flag)
add filename to registry
persist registry to disk. Error out on failure.
else
useEncryption = get file encryption state from registry
// Perform the operation through the appropriate Env.
if useEncryption
EncryptedEnv->OpOnFile(filename)
else
DefaultEnv->OpOnFile(filename)
The registry may accumulate non-existent entries if writes fail after addition or removal fails after deletes. It will also gather entries that are never deleted by rocksdb (eg: archives). We can clean these up by adding a periodic garbage collection.
The registry is a new file containing encryption status information for files written through rocksdb.
This is similar to rocksdb's MANIFEST. We intentionally do not call it manifest to avoid confusion.
It is stored in the base rocksdb directory for the store and written using a write/close/rename method.
It is always operated on through the DefaultEnv.
Encrypted files are always present in the registry. Plaintext files are not registered as we cannot guarantee their presence when operating on an existing store.
Env operations on files will use the registry in different ways:
DefaultEnv. If it does not exist, see "create a new file"The registry is a serialized protocol buffer:
enum EncryptionRegistryVersion {
// The only version so far.
Base = 0;
}
message EncryptionRegistry {
// version is currently always Base.
int version = 1;
repeated EncryptedFile files = 2;
}
enum EncryptionType {
// No encryption applied, not used for the registry.
Plaintext = 0;
// AES in counter mode.
AES_CTR = 1;
}
message EncryptedFile {
Filename string = 1;
// The type of encryption applied.
EncryptionType type = 2;
// Encryption fields. This may move to a separate AES-CTR message.
// ID (hash) of the key in use, if any.
optional bytes key_id = 3;
// Initialization vector, of size 96 bits (12 bytes) for AES.
optional bytes nonce = 4;
// Counter, allowing 2^32 blocks per file, so 64GiB.
optional uint32 counter = 5;
}
The registry contains all information needed to find the encryption key used for a given file and encrypt/decrypt it.
Rocksdb has an EncryptedEnv introduced in PR 2424.
It adds a 4KiB data block at the beginning of each file with a nonce and possible encrypted extra information.
We opt to use a slightly modified (mostly simplified) version of this encrypted env because:
EncryptedEnv does not support multiple keysWe will use a modified version of the existing EncryptedEnv without data prefix.
The encrypted env uses a CipherStream for each file, with the cipher stream containing the necessary
information to perform encryption and decryption (cipher algorithm, key, nonce, and counter).
It also holds a reference to a key manager which can provide the active key and any older keys held.
Two instances of the encrypted env are in use:
We introduce two levels of encryption with their corresponding keys:
COCKROACHDB_DATA_KEYS fileWe have three distinct status for keys:
Store keys consist of exactly two keys: the active key, and the previous key.
They are stored in separate files containing the raw key data (no encoding).
Specifying the keys in use is done through the encryption flag fields:
key: path to the active key, or plaintext for plaintext. If not specified, plaintext is the default.old_key: path to the previous key, or plaintext for plaintext. If not specified, plaintext is the default.The size of the raw key in the file dictates the cipher variant to use. Keys can be 16, 24, or 32 bytes long corresponding to AES-128, AES-192, AES-256 respectively.
Key files are opened in read-only mode by cockroach.
The key manager is responsible for holding all keys used in encryption. It is used by the encrypted env and provides the following interfaces:
GetActiveKey: returns the currently active keyGetKey(key hash): returns the key matching the key hash, if anyWe identify two types of key managers:
The store key manager holds the current and previous store keys as specified through the --enterprise-encryption
flag.
Since the keys are externally provided, there is no concept of key rotation.
The data key manager holds the dynamically-generated data keys.
Keys are persisted to the COCKROACHDB_DATA_KEYS file using the write/close/rename method and encrypted
through an encrypted env using the store key manager.
The manager periodically generates a new data key (see Rotating data keys), keeps the previously-active key in the list of existing keys, and marks the new key as active.
Keys must be successfully persisted to the COCKROACHDB_DATA_KEYS file before use.
Rotating the store keys consists of specifying:
key points to a new key file, or plaintext to switch to plaintext.old_key points to the key file previously used.Upon starting (or other signal), cockroach decrypts the data keys file and re-encrypts it with the new key. If rotation is done through a flag (as opposed to other signal), this is done before starting rocksdb.
An ID is computed for each key by taking the hash (sha-256) of the raw key. This key ID is stored in plaintext
to indicate which store key is used to decode the data keys file.
Any changes in active store key (actual key, key size) triggers a data key rotation.
The data keys file is an encoded protocol buffer:
message DataKeysRegistry {
// Ordering does not matter.
repeated DataKey data_keys = 1;
repeated StoreKey store_keys = 2;
}
// EncryptionType is shared with the registry EncryptionType.
enum EncryptionType {
// No encryption applied.
Plaintext = 0;
// AES in counter mode.
AES_CTR = 1;
}
// Information about the store key, but not the key itself.
message StoreKey {
// The ID (hash) of this key.
optional bytes key_id = 1;
// Whether this is the active (latest key).
optional bool active = 2;
// First time this key was seen (in seconds since epoch).
optional int32 creation_time = 3;
}
// Actual data keys and related information.
message DataKey {
// The ID (hash) of this key.
optional bytes key_id = 1;
// Whether this is the active (latest) key.
optional bool active = 2;
// EncryptionType is the type of encryption (aka: cipher) used with this key.
EncryptionType encryption_type = 3;
// Creation time is the time at which the key was created (in seconds since epoch).
optional int32 creation_time = 4;
// Key is the raw key.
optional bytes key = 5;
// Was exposed is true if we ever wrote the data keys file in plaintext.
optional bool was_exposed = 6;
// ID of the active store key at creation time.
optional bytes creator_store_key_id = 7;
}
The store_keys field is needed to keep track of store key ages and statuses. We only need to keep the
active key but may keep previous keys for history. It does not store the actual key, only key hash.
The data_keys field contains all in-use (data encrypted with those keys is still live) keys and all information
needed to determine ciphers, ages, related store keys, etc...
was_exposed indicates whether the key was even written to disk as plaintext (encryption was disabled at the
store level). This will be surfaced in encryption status reports. Data encrypted by an exposed key is securely
as bad as plaintext.
creator_store_key_id is the ID of the active store key when this key was created. This enables two things:
create_store_key_id against the active store key. Mismatch triggers rotationTo generate a new data key, we look up the following:
AES128)If the cipher is other than plaintext, we generate a key of the desired length using the pseudorandom CryptoPP::OS_GenerateRandomBlock(blocking=false) (see Random number generator for alternatives).
We then generate the following new key entry:
sha256) of the raw keyplaintextRotation is the act of using a new key as the active encryption key. This can be due to:
When a new key has been generated (see above), we build a temporary list of data keys (using the existing
data keys and the new key).
If the current store key encryption type is plaintext, set was_exposed = true for all data keys.
We write the file with encryption to COCKROACHDB_DATA_KEYS. Upon successful write, we trigger a data key file reload.
We use a write/close/rename method to ensure correct file contents.
Key generation is done inline at startup (we may as well wait for the new key before proceeding), but in the background for automated changes while the system is already running.
We need to report basic information about the current status of encryption.
At the very least, we should have:
With the following information:
We can report the following encryption status:
plaintext: plaintext dataAES-<size>: encrypted with AES (one entry for each key size)AES-<size> EXPOSED: encrypted, but data key was exposed at some pointActive key IDs and ciphers are known at all times. We need to log them when they change (indicating successful key rotation) and propagate the information to the Go layer.
Fraction of data encoded is a bit trickier. We need to:
We can find the list of all in-use files the same way rocksdb's backup does, by calling:
rocksdb::GetLiveFiles: retrieve the list of all files in the databaserocksdb::GetSortedWalFiles: retrieve the sorted list of all wal filesNote: logs encryption is currently Out of scope
All existing uses of local disk to process data must apply the desired encryption status.
Data tied to a specific store should use the store's rocksdb instance for encryption. Data not necessarily tied to a store should be encrypted if any of the stores on the node is encrypted.
We identify some existing uses of local disk: TODO(mberhault, mjibson, dan): make sure we don't miss anything.
In addition to making sure we cover all existing use cases, we should:
Gating at-rest-encryption on the presence of a valid enterprise license is problematic due to the fact that we have no contact with the cluster when deciding to use encryption.
For now, we propose a reactive approach to license enforcement. When any node in the cluster uses encryption (determined through node metrics) but we do not have a valid license:
The overall idea is that the cluster is not negatively impacted by the lack of an enterprise license. See Enterprise feature gating for possible alternatives.
Actual code for changes proposed here will be broken into CCL and non-CCL code:
Implementing encryption-at-rest as proposed has a few drawbacks (in no particular order):
While rocksdb-level encryption does not force us to keep encryption-at-rest at this level, it strongly discourages us from implementing it elsewhere.
This means that more fine-grained encryption (eg: per column) will need to fit within this model or will require encryption in a completely different part of the system.
The rocksdb env_encryption functionality is barely tested and has no known open-source uses.
This raises serious concerns about the correctness of the proposed approach.
We can improve testing of this functionality at the rocksdb level as well as within cockroach. A testing plan must be developed and implemented to provide some assurances of correctness.
Proper use of encryption-at-rest requires a reasonable amount of user education, including
A lot of this falls onto proper documentation and admin UI components, but some are choices made here (flag specification, logged information, surfaced encryption status).
The current proposal takes a reactive approach to license enforcement: we show warnings in multiple places if encryption was enabled without an enterprise license.
This is unlike our other enterprise features which simply cannot be used without a license.
There is some discussion of possible ways to solve this in Enterprise feature gating, but this is left as future improvements.
Any files not included in rocksdb's "Live files" will still be encrypted. However, due to not being rewritten, they will become inaccessible as soon as the key is rotated out and GCed.
While we do not currently make use of backups, we have in the past and may again.
The enterprise-related functionality should live in CCL directories as much as possible (pkg/ccl for go code,
c-deps/libroach/ccl for C++ code).
However, a lot of integration is needed. Some (but far from all) examples include:
start commandStoreSpecEnv) for DBImpl constructionThis makes hook-based integration of CCL functionality tricky.
Making less code CCL would simplify this. But enterprise enforcement must be taken into account.
There are a few alternatives available in the major aspects of this design as well as in specific areas. We address them all here (in no particular order):
This is Out of scope
Filesystem encryption can be used without requiring coordination with cockroach or rocksdb. While this may be an option in some environments, DBAs do not always have sufficient privileges to use this or may not be willing to.
Filesystem encryption can still be used with cockroach independently of at-rest-encryption. This can be a reasonable solution for non-enterprise users.
Should we choose this alternative, this entire RFC can be ignored.
This is Out of scope
The solution proposed here allows encryption to be enabled or not for individual rocksdb instances. This may not be sufficient for fine-grained encryption.
Database and table-level encryption can be accomplished by integrating store encryption status with zone configs, allowing the placement of certain databases/tables on encrypted disks. This approach is rather heavy-handed and may not be suitable for all cases of database/table-level encryption.
However, this may not be sufficient for more fine-grained encryption (eg: per column). It's not clear how encryption for individual keys/values would work.
We have settled on a two-level key structure
The current choice of two key levels (store keys vs data keys) is debatable:
Advantages:
Negated advantage:
Cons:
We could instead use a single level of keys where the user-provided keys are directly used to encode the data. This would simplify the logic and reporting (and user understanding). This would however make rotation slower and potentially make integration with third-party services more difficult. User-provided keys would have to be available until no data uses them.
We have settled on tied cipher/key-size specification. This can be changed easily.
The current proposal uses the same cipher and key size for store and data keys.
Pros:
Cons:
The previous version of this RFC proposed using the rocksdb::EncryptedEnv for all files, with encryption state
(plaintext or encrypted) and encryption fields stored in the 4KiB data prefix.
The main issues of that solution are:
We break down future improvements in multiple categories:
The features are listed in no particular order.
Crypto++ can determine support for SSE2 and AES-NI at runtime and fall back to software implementation when not supported.
There are a few things we can do:
We need to find a way to force re-encryption when we want to remove an old key.
While rocksdb regularly creates new files, we may need to force rewrite for less-frequently
updated files. Other files (such as MANIFEST, OPTIONS, CURRENT, IDENTITY, etc...) may need
a different method to rewrite.
Compaction (of the entire key space, or specific ranges determined through live file metadata) may provide the bulk of the needed functionality. However, some files (especially with no updates) will not be rewritten.
Some possible solutions to investigate:
Part of forcing re-encryption includes:
We would prefer not to keep old data keys forever, but we need to be certain that a key is no longer in use before deleting it. How feasible this is depends on the accuracy of our encryption status reporting.
If we choose to ignore non-live files, garbage collection should be reasonably safe.
All encrypted files are stored in the registry. Live rocksdb files will automatically be removed as they are deleted, but any other files will remain forever if not deleted through rocksdb.
We may want to periodically stats all files in our registry and deleted the entries for nonexistent files.
The performance impact needs to be measured for a variety of workloads and for all supported ciphers. This is needed to provide some guidance to users.
Guidance on key rotation period would also be helpful. This is dependent on the rocksdb churn, so will depend on the specific workload. We may want to add metrics about data churn to our encryption status reporting.
We may want to automatically mark a store as "encrypted" and make this status available to zone configuration, allowing database/table placement to specify encryption status.
When to mark a store as "encrypted" is not clear. For example: can we mark it as encrypted just because encryption is enabled, or should we wait until encryption usage is at 100%?
If we use the existing store attributes for this marker, we may need to add the concept of "reserved" attributes.
We can export high-level metrics about at-rest-encryption through prometheus. This can include:
The current proposal only reloads store keys at node start time.
We can avoid restarts by triggering a refresh of the store key file when receiving a signal (eg: SIGHUP) or other
conditions (periodic refresh, admin UI endpoint, filesystem polling, etc...)
At the very least, we want cockroach debug tools to continue working correctly with encrypted files.
We should examine which rocksdb-provided tools may need modification as well, possibly involving patches to rocksdb.
We may want to delete old files in a less recoverable way (some filesystems allow un-delete). On SSDs, a single overwrite pass may be sufficient. We do not propose to handle safe deletion on hard drives.
Crypto++ supports multiple block ciphers. It should be reasonably easy to add support for other ciphers.
We can switch to authenticated encryption (eg: Galois Counter Mode, or others) to allow integrity verification of files on disk.
Implementing authenticated encryption would require additional changes to the raw storage format to store the final authentication tag.
We could perform a few checks to ensure data security, such as:
The current proposal does not gate encryption on a valid license due to the fact that we cannot check the license when initialising the node.
A possible solution to explore is detection when the node joins a cluster. eg:
initThis would still cause issues when removing the license (or errors loading/validating the license).
Less drastic actions may be possible.