docs/RFCS/20191202_full_cluster_backup_restore.md
Users should be able to BACKUP and RESTORE all relevant information stored
in their cluster - namely relevant information stored in system tables.
Currently, only user data can be backed up and restored - along with very limited metadata information (table statistics if requested). There does not exist a mechanism for a user to easily restore their entire cluster as it appeared at the time of a backup.
This RFC builds on the original Backup &
Restore
functionality and extends it to include all logical data stored in the backup.
A new syntax is introduced to perform a full cluster backup and restore:
BACKUP TO [...] and RESTORE FULL CLUSTER FROM [...].
Additionally, incremental cluster backups are supported:
> BACKUP TO 'nodelocal:///cluster-backup/1';
> BACKUP TO 'nodelocal:///cluster-backup/2' INCREMENTAL FROM 'nodelocal:///cluster-backup/1';
A user can create an incremental cluster backup, but they must also provide a
full cluster backup and optionally additional incremental backups (as is the
case for non-cluster backups). All listed backups must be full cluster backups.
Incremental cluster backups can be restored in the usual way: RESTORE FROM 'nodelocal:///cluster-backup/1', 'nodelocal:///cluster-backup/2'.
Every backup listed must be a full-cluster backup.
A full cluster RESTORE can only be performed in a fresh cluster with no user
data. Some of the data in the system tables may be set (for example the
cluster.organization setting must be set in order to even use this feature).
However, it should be noted that this data will be modified by performing a
full cluster RESTORE.
A full cluster BACKUP/RESTORE could be thought of as performing the following steps:
/* Full Cluster Backup
There are no semantics to restore all user tables. Assume all user databases are: database_a, database_b, [...].
/* Current backup also does not support backing up entire databases and individual tables, but only a subset of system tables should be backed up. */
BACKUP DATABASE database_a, database_b, [...], system TO 'nodelocal:///cluster-backup/1';
/* Full Cluster Restore */
CREATE DATABASE crdb_system_temporary;
RESTORE system.* FROM 'nodelocal:///cluster-backup/1' WITH into_db='crdb_system_temporary';
/* Restore the user data. */
RESTORE DATABASE database_a, database_b, [...] FROM 'nodelocal:///full-cluster-backup';
/* Restore the system tables. */
BEGIN;
DELETE FROM system.users WHERE true;
INSERT INTO system.users (SELECT * FROM crdb_system_temporary.zones);
COMMIT;
BEGIN;
DELETE FROM system.settings WHERE true;
INSERT INTO system.settings (SELECT * FROM crdb_system_temporary.zones);
COMMIT;
[...]
Not all system tables should be included in a backup since some information
relates to the physical properties of a cluster. The existing system tables
have been audited below. New system tables will need to add themselves to the
list of system tables that should be included in a backup. This will initially
be a list of names of system tables maintained inside the backupccl package.
| Table Name | Description | Included | Notes |
|---|---|---|---|
| namespace | Provides relationship between parentID <-> descriptor name <-> descriptor ID | No | This information should be generated by the restoring cluster |
| descriptor | Maps ID <-> Descriptor Proto | No | New descriptors should be made for every RESTOREd table. |
| users | Stores the users in the table. | Yes | |
| zones | Stores the zone config protos | Yes | |
| settings | Stores all the cluster settings | Yes | |
| leases | Table leases | No | Leases held in the old cluster are no longer relevant. |
| eventlog | A log of a variety of events (schema changes, node additions, etc..) | No | Most events are not node-specific and would be useful to backup. This may produce confusing output if restored into a cluster with a different number of nodes. See Future work. |
| rangelog | Range level events. | No | Ranges on the old and new cluster will not match. |
| ui | A set of KV pairs used by the UI | Yes | |
| jobs | A list of all jobs that are running or have run. | Yes | |
| web_sessions | No | This could eventually be moved into the backup. Unclear. | |
| table_statistics | Yes | This information is currently backed up in the BACKUP manifest to BACKUP and RESTORE table statistics on a per-table level. | |
| locations | Stores information about the localities. | Yes | |
| role_members | Contains role-role and user-role relationships for RBAC | Yes | |
| comments | Stores up to 1 string comment per object ID | Yes | |
| replication_* | No | Replication stats should be regenerated when the data is RESTORED. | |
| reports_meta | No | " | |
| protectedts_* | As proposed by the protected timestamp RFC | No | Restore only restores a snapshot of the data in the backup, not the entire MVCC history. |
There is no information in the system ranges that should be included in a CLUSTER backup since they all relate to properties of the ranges/nodes.
This RFC assumes that a cluster restore will occur on a fresh cluster with no user data. This allows the data to be restored exactly as it appeared in the backup. Otherwise, it will be necessary to re-key the user tables as well as the content inside the system tables which reference table IDs/KV spans (such as zone configs). Note: this implies that behavior resulting in interactions with the restoring cluster is undefined until the restore succeeds. This may be extended in the future, see the Future work section for more details.
Additionally, incremental cluster backups and restoration is supported using
the same syntax as the existing BACKUP. In addition to checking that the
previous backups cover the necessary span of keybase and time, a check must be
added backupPlanHook to verify that every backup that this incremental backup
builds upon are also cluster backups. Additionally, full cluster restore should
only be permitted on Therefore it is necessary to add a flag in the backup
manifest (BackupDescriptor) indicating whether or not a given backup is a
full cluster backup or not. The primary reason for this bit is to ensure that
full cluster restore can only restore full cluster backup files.
Users will mainly interact with this new feature through the new syntax
introduced: BACKUP FULL CLUSTER TO [...] and RESTORE FULL CLUSTER FROM [...].
Additionally, incremental cluster backups are supported:
> BACKUP FULL CLUSTER TO 'nodelocal:///cluster-backup/1';
> BACKUP FULL CLUSTER TO 'nodelocal:///cluster-backup/2' INCREMENTAL FROM 'nodelocal:///cluster-backup/1';
These backups can be restored: RESTORE FULL CLUSTER FROM 'nodelocal:///cluster-backup/1', 'nodelocal:///cluster-backup/2'. Every backup
listed must be a full-cluster backup.
This new syntax introduces a new target "FULL CLUSTER" which can be used instead of specifying particular databases/tables to be restored. Replacing the targets for the new target (FULL CLUSTER), should not result in any UX surprises.
A user can then examine a full cluster backup using the SHOW BACKUP command
(start_time and end_time are omitted from this example for brevity):
root@:26258/default_db> SHOW BACKUP 'nodelocal://1/full-cluster';
database_name | table_name | size_bytes | rows | is_full_cluster |
+---------------+------------+------------+------+-----------------+
some_user_db | foo | 0 | 0 | true |
system | zones | 252 | 0 | true |
system | users | 99 | 0 | true |
...
This command shows the user what type of metadata is stored in the backup. Since users must specify only full cluster backups to build incremental backups, this allows users to inspect a backup to check what cluster information is stored.
With regards to user-visible errors introduced by this feature, users can expect to see an error when:
MinNonPredefinedUserDescID).Note: it is expected that users will be able to perform a non-full cluster RESTORE on a full-cluster BACKUP.
The first difference between a full cluster backup and a regular (non-full cluster) backup is that a full cluster backup includes all user tables in the backup. This can be accomplished by including all tables -- as defined by enumerating the descriptors table -- except for the set explicitly excluded as defined above.
Additionally, all OFFLINE tables need to be included in a BACKUP (they are not today). This is used to ensure that the in-progress jobs may be able to continue after a full cluster restoration. See the Jobs section below.
Finally, the backup manifest (BackupDescriptor protobuf) needs to be augmented
with an enum specifying the amount of cluster information stored in the backup.
An enum DescriptorCoverage will be added to the BackupDescriptor and will
have options: RequestedDescriptors, which is the default and is what existing
backups will have going forward, and AllDescriptors for full cluster backup.
This enum is required to prevent a full cluster restore being performed from a
non-full cluster backup file. In particular, this requirement exists because
full cluster RESTORE guarantees that the entire cluster has been RESTOREd (so we
need the entire cluster to be in the backup file).
Upon a full cluster restore, the order in which data is restored becomes
relevant. In particular, system.zones must be restored prior to restoring the
user data in order to ensure that the user data is placed in the appropriate
locality if appropriate. The user data will then be restored, and finally the
rest of the system tables.
First, a check is performed to ensure that no user data exists in this cluster.
This is achieved by ensuring that no descriptors exist with ID greater than or
equal to MinNonPredefinedUserDescID. Then the DescIDGenerator needs to be
restored. This key is used to determine that value of the next descriptor ID
(such as during the creation of a table or database). This check would also
ensure that no other full cluster restores are in progress, as the full cluster
restore would create a crdb_system_temporary table in the user database
space. It is incremented whenever a descriptor is created. Let MaxID by the
maximum descriptor ID found in the backup, then the DescIDGenerator should be
set to MaxID + 1 so that new descriptors can be created after the restore
with correct IDs.
System tables cannot be restored in the same way as user data tables are since
they occupy a fixed keyspace (and thus cannot be re-keyed as we do today for
new tables). First we restore the system tables into a temporary database. The
DescIDGenerator key must be updated prior to creating this temporary table to
ensure no conflicts with a user table that needs to be restored (and thus the
ID of this table will be MaxID + 1 and the DescIDGenerator will be
incremented again).
CREATE DATABASE crdb_system_temporary;
RESTORE system.* FROM 'nodelocal://1/full-backup/1' WITH into_db='crdb_system_temporary';
In an internal executor execute:
BEGIN;
DELETE FROM system.zones WHERE true;
INSERT INTO system.zones (SELECT * FROM crdb_system_temporary.zones);
COMMIT;
Before restoring the user data, we need to ensure that all the user tables and database descriptors are created with the same ID as they have in the backup. This differs from the current implementation which generates a new ID for these items. This allows for a potential future optimization to skip the no-op rekeying. User tables can then be restored normally.
Finally, to restore the remaining of the system tables, perform a transaction similar to the one listed above, but rather than only restoring the zones table, restore the rest of the system tables. It is preferable to restore all of the remaining system tables in one transaction in order to ensure atomicity across the restoration of all the system tables. However, there may be a limitation based on the maximum transaction size, in which case the possibility of restoring the system tables 1 by 1 could be investigated. However, the maximum size of a transaction is quite large and is not expected to cause issue.
Finally, the temporary crdb_system_temporary database is deleted.
During a cluster backup, a job may be in progress. The state for these jobs
should persisted in the user-data and in the system.jobs table. These jobs
will be restored into a running state and nobody will have a lease on this job.
This job should be adopted and the continued. For the job to be able to be
continued, all OFFLINE tables need to be included in the BACKUP.
The current implementation of locality-aware BACKUPs should continue to work with cluster backup without further work. BACKUP for the system tables will operate just as the user-data tables and the relevant lease-holders will backup to the appropriate locality.
The happy path for a full cluster backup is when the restore is started and all nodes remain available until the restore is complete.
Non-cluster restore creates the tables for the user-data tables at the start of
the restore. These tables are in an OFFLINE state - inaccessible to the
user[1]. If there is a failure during the restore, and these tables are marked
as DROP and will be removed. Full cluster restore can recover the user-data
tables that it restored this way as well. The difficulty lies in handling
system tables that it has already restored. This will likely be only the
system.zones table since the remainder of the system tables will be restored
in a single transaction near the end of the job, however the general case is
considered.
Since full cluster backup must have been run on a fresh cluster, the first iteration of full cluster restore could require the cluster to be destroyed if the restore fails. This can likely be improved as detailed in the Future Work section.
One reasonable question is why doesn't CockroachDB load the system data the
same way as the user-data tables. One difficulty that this would present is
that user-data restoration happens on new tables, but the system tables in the
new cluster already have data in them. This method directly ingests the
SSTables for the system tables spans, then issues a TimeBoundedDelete. This
does not yet exist, but can be implemented by leveraging
engine.MVCCClearTimeRange, similarly to the batcheval.RevertRange command.
This leaves the possibility of having a potentially dirty state in the system
tables. Additionally, the keys in the SSTs would need to have their timestamp
updated to some time greater than the time of the start of the restore.
The reason we take an approach of loading SSTs directly into the storage layer for user data is that we typically expect a large volume of data. Additionally, we can ensure that this data is not needed or accessed by the user while it is being loaded. Since the size of the system tables is expected to be much smaller than the size of the user-data tables, there are no advantages to this approach and it is more complex.
Additionally, instead of marking a particular backup as "full cluster", the system tables that it holds could be examined. This would allow for previous backups that included the system table information to be restored via a full cluster backup. One problem with this approach is that if new system tables are added to the list of expected system tables, a mapping between version numbers and which tables are expected to be included in that version needs to be maintained. This problem is avoided by marking backups as full cluster since we can assume that all system tables included in those backups are safe to restore (and override the existing ones).
As mentioned, it may be possible to enable cluster restoration on a non-new cluster - however this does raise further complications. Since it seems that the vast majority of uses cases for this feature are to restore a cluster exactly how it was in the backup, there is little motivation to generalize cluster restoration in this way. In particular, this would require the contents of the system tables to be re-keyes (in addition to the user-data KVs themselves). This would require each system table to provide a way to map each of their rows to an updated row based on a re-keyer.
One large remaining piece of work is how to handle the case where the metadata of the restoring cluster does not match that of the one in the backup. For example, if the cluster on which the BACKUP was performed as a given set of localities which do not exist on the cluster that is being restored, there is currently no way for the user to map the localities from the backed up cluster to the values they should be changed to in the restoring cluster. Currently since all BACKUP and RESTORE interactions happens at the CLI, a major difficulty is providing a powerful enough interface for the user to provide these mappings.
Include system.eventlog in a full cluster backup. One reason for not doing
this initially is that some event logs may be non-sensical if the table is
restored in a cluster with a different number of nodes.
Additionally, one further improvement would be allow the restoration of a set of tables/databases with their respective configuration. This requires that the RESTORE process find which rows in the system tables are applicable to the given database/table. This also implies that we'd need to add the ability to rewrite values in other tables. This is out of scope of this RFC.
A potential improvement to ensure that the cluster is in a fresh state would
be to mark a cluster for restoration at creation time (similar to cockroach init). This would also prevent any operations to interfere with the restore.
A more graceful failure mode could be implemented which ensure's that the cluster's state is guaranteed to be healthy in the case of a failed full cluster RESTORE.
The initial implementation will not consider what happens if there is a failure in the middle of the backup. It will clean up the data following the normal backup procedures. In the case that there is a failure while updating the system tables, the cluster should be started up again. Since we enforce that the cluster we are restoring to has no user data, this is acceptable.
Due to the restriction that full cluster backup can only be performed on a newly created cluster, there may be some user surprised when trying to perform a full cluster restore when this assumption is violated.
[1] Offline tables can however be references when setting zone configs. See #40285.