docs/design/2022-07-03-block-storage-snapshot-based-backup-restore.md
Amazon Elastic Block Store (Amazon EBS) is an easy-to-use, scalable, high-performance block-storage service designed for Amazon Elastic Compute Cloud (Amazon EC2). It provides varied volume types that best fit the workload. The EBS volumes that are attached to an instance are exposed as storage volumes that persist independently from the life of the instance. Most importantly, you can back up the data on your Amazon EBS volumes by taking EBS snapshots.
TiDB is a distributed database providing horizontal scalability, strong consistency, and high availability. It is designed to work in the cloud to make deployment, provisioning, operations, and maintenance flexible.
Current TiDB Backup Restore solution provides high-performance backup and restore speeds for large-scale TiDB Clusters, however, the impact to the cluster being backed up is notable for many customers and hard to reduce based on current implementation. By leveraging the Amazon EBS snapshot feature, we can provide a block-level based backup and recovery solution with transaction consistency and minimize the impact to the target TiDB cluster..
Backup and Restore a TiDB cluster using the EBS snapshot on the AWS, it is expected to achieve:
A Custom Resource Definition (CRD) generated based on the Backup/Restore request launched by the customer in TiDB Operator.
The Backup & Restore Controller detects the CRD. Then, creates a new POD to load the corresponding worker to execute the backup and restore work.
Notice: If the user wants to delete the backup (snapshots and metadata), the user can simply remove the backup CR from TiDB Operator.
For Backup, as Backup Worker states, there are mainly 3 steps:
Step 1 assume time < 5 minute (included fault tolerance, e.g. retry)
Step 2 TiKV already maintains the resolved_ts component.
Step 3 snapshot depends on the customer's EBS volume type and data change. Excluding full snapshot at first time, one day data change with gp3 type may takes ~ 10 minutes in our test.
For Restore, as Restore Worker states, there are mainly 3 steps:
Step 1, assume takes < 30 minutes
Step 2, assume takes < 10 minutes, since the snapshot taken parallelly, the unalign raft log among peers in a Region is few.
Step 3, assume takes < 30 minutes, Resolved data by removing all key-values that have versions larger than resolved_ts. And also resolved data through the raft layer, not to delete from disk directly.
Note: For EBS volume fully initialized, there are extra steps suggested by AWS.
TiDB Cluster On AWS EBS, a key-value write workflow as follows:
TiDB->TiKV->OS->EBS
<table> <tr> <td><strong>Layer</strong> </td> <td><strong>Owned By</strong> </td> <td><strong>Description</strong> </td> </tr> <tr> <td>SQL </td> <td>TiDB </td> <td>Where the user data comes from </td> </tr> <tr> <td>Transaction </td> <td>TiDB </td> <td>Refer to the transaction driver of TiDB, which has ACID properties. </td> </tr> <tr> <td>MVCC </td> <td>TiKV </td> <td>Multiversion Concurrency Control module in TiKV </td> </tr> <tr> <td>Raftstore </td> <td>TiKV </td> <td>An MultiRaft implementation in TiKV </td> </tr> <tr> <td>RocksDB </td> <td>TiKV </td> <td>A key-value storage stores user data </td> </tr> <tr> <td>OS </td> <td>AWS </td> <td>EC2 instance, response to local volume and format file system on volume. </td> </tr> <tr> <td>Volume </td> <td>AWS </td> <td>EBS volume, like a disk on an on-premises data center. </td> </tr> <tr> <td>EBS </td> <td>AWS </td> <td>More like an AWS storage pool, can provide a various performance disk (gp3, io2 etc) for EC2. </td> </tr> </table>In TiDB, The transaction layer adopts the Percolator model, which is 2 phase commit. TiDB is a distributed key-value database. TiKV uses the raft consensus algorithm to provide strong consistency. Furthermore, TiKV implements multi-raft in Raftstore and provides data consistency and scalability.
AWS EBS is the physical storage layer, and EBS volume can be backed up by EBS snapshot. In the transaction layer, after the data of the same transaction is encoded by the MVCC and processed by the raft layer, complete transaction data is written on different TiKV EBS volumes. The consistency of the snapshot on these volumes needs to be handled as follow:
TiKV has the Region resolved_ts component maintains timestamp (ts) within a Region. This resolved_ts ensures that the timestamp with the maximum consistency of the current Region data, any data with timestamp lower than resolved_ts has transaction layer consistency. Meanwhile, in the latest implementation, TiKV calculates the minimum resolved_ts of the current store, and reports it to PD. See ReportMinResolvedTsRequest.
In our solution, we get this resolved_ts from PD and use it as ts to resolve backup in the restore phase.
The key-value of the same transaction will be written to different Raft Groups. In the restore phase, after a Raft Group is handled to consistency, we use transaction-consistent resolved_ts to go to each TiKV to delete the data of incomplete transactions. For this step, see the detailed design of the Backup and Restore ResolvedData phase.
For each Raft Group within TiKV, we have to deal with Region metadata and Region data.
In TiKV#1, a write proposal has been applied, but TiKV#2 and TiKV#3 have not been applied.
For each Raft Group, we process the meta and data of the Region through BR RegionRecover, its workflow as follows:
In the previous step of RegionRecover phase, the second step is to start the raft state machine, so that the raft log of each Region non-leader node is applied to the largest log index. Region-level data consistency achieved.
Turn off pd scheduler. The reason is that when there are a large number of write scenarios to TiDB Cluster, the snapshot is sync. It is very likely that peer replication from one tikv volume to another, while 2 volumes has been snapshotted asynchronously. At block-level data may be lost. At the same time, replica scheduling makes the epoch version change, and there are many intermediate states that need to be processed. These states make it very complicated to deal with such problems. Currently PD supports the following methods to turn off peer replication:
./pd-ctl --pd=http://127.0.0.1:2379 config set merge-schedule-limit 0
./pd-ctl --pd=http://127.0.0.1:2379 config set region-schedule-limit 0
./pd-ctl --pd=http://127.0.0.1:2379 config set replica-schedule-limit 0
./pd-ctl --pd=http://127.0.0.1:2379 config set hot-region-schedule-limit 0
After the above schedule is limited to 0, peer addition, deletion and replication (from one TiKV to another) are prohibited. At the same time, functions such as merge and scatter across TiKV are also prohibited. For Regions that may have overlapping Ranges due to splits, after PD scheduling is resumed. please refer to the Recover function design for details, and for details about replicas, please refer to Q&A
After the TiDB Operator starts the Backup Worker, the backup job starts.
After the TiDB Operator starts the Restore Worker, it starts to restore work.
Backup metadata definition
{
"cluster_info": {
"cluster_version": "v6.3.0",
"max_alloc_id": "6000",
"resolved_ts": "456745777823347",
},
"tikv" : {
"replicas": 3,
"stores": [
{
"store_id" : 1,
"volumes" : [
{
"volume_id" : "vol-0e65f40961a9f6244",
"type" : "raft-engine.dir",
"mount_path" : "/var/lib/tikv/raft-engine",
"restore_volume_id" : "vol-0e65f40961a9f0001",
"snapshot_id" : "snap-1234567890abcdef0",
},
{
"volume_id" : "vol-0e65f40961a9f6245",
"type" : "storage.data-dir",
"mount_path" : "/var/lib/tikv/data-dir",
"restore_volume_id" : "vol-0e65f40961a9f0002",
"snapshot_id" : "snap-1234567890abcdef1",
}
]
},
{
"store_id" : 2,
"volumes" : [
{
"volume_id" : "vol-0e65f40961a9f6246",
"type" : "raft-engine.dir",
"mount_path" : "/var/lib/tikv/raft-engine",
"restore_volume_id" : "vol-0e65f40961a9f0003",
"snapshot_id" : "snap-1234567890abcdef2",
"fsr-enabled": "false",
},
{
"volume_id" : "vol-0e65f40961a9f6247",
"type" : "storage.data-dir",
"mount_path" : "/var/lib/tikv/data-dir",
"restore_volume_id" : "vol-0e65f40961a9f0004",
"snapshot_id" : "snap-1234567890abcdef3",
"fsr-enabled": "false",
}
]
},
{
"store_id" : 3,
"volumes" : [
{
"volume_id" : "vol-0e65f40961a9f6248",
"type" : "raft-engine.dir",
"mount_path" : "/var/lib/tikv/raft-engine",
"restore_volume_id" : "vol-0e65f40961a9f0005",
"snapshot_id" : "snap-1234567890abcdef4",
"fsr-enabled": "false",
},
{
"volume_id" : "vol-0e65f40961a9f6249",
"type" : "storage.data-dir",
"mount_path" : "/var/lib/tikv/data-dir",
"restore_volume_id" : "vol-0e65f40961a9f0006",
"snapshot_id" : "snap-1234567890abcdef5",
"fsr-enabled": "false",
}
]
}
],
},
"pd" : {
"replicas" : 3
},
"tidb": {
"replicas" : 3
},
"kubernetes" : {
"pvs" : [],
"pvcs" : [],
"crd_tidb_cluster" : {},
"options" : {}
}
"options" : {}
}
Backup worker has implements the following functions:
Obtain the configuration information of the online backup cluster, such as resolved_ts.
Configure cluster PD scheduling, stop replica scheduling, turn off GC during backup, and then turn on GC after backup.
The snapshot function of EBS/pv volumes whose TiKV running on.
Worker container contains: 1. backup-manager, 2. BR
backup full --type=aws-ebs --pd "172.16.2.1:2379" -s "s3:/bucket/backup_folder" --volumes-file=backup.json
Backup worker workflow
TiDB Operator retrieves the PD address of the target cluster and all TiKV volume information.
TiDB Operator provides --volumes-file=backup.json for the backup cluster, starts the backup job, and backup.toml contains:
{
"tikv" : {
"replicas": 3,
"stores": [
{
"store_id" : 1,
"volumes" : [
{
"volume_id" : "vol-0e65f40961a9f6244",
"type" : "raft-engine.dir",
"mount_path" : "/var/lib/tikv/raft-engine"
},
{
"volume_id" : "vol-0e65f40961a9f6245",
"type" : "storage.data-dir",
"mount_path" : "/var/lib/tikv/data-dir"
}
]
},
{
"store_id" : 2,
"volumes" : [
{
"volume_id" : "vol-0e65f40961a9f6246",
"type" : "raft-engine.dir",
"mount_path" : "/var/lib/tikv/raft-engine"
},
{
"volume_id" : "vol-0e65f40961a9f6247",
"type" : "storage.data-dir",
"mount_path" : "/var/lib/tikv/data-dir",
}
]
},
{
"store_id" : 3,
"volumes" : [
{
"volume_id" : "vol-0e65f40961a9f6248",
"type" : "raft-engine.dir",
"mount_path" : "/var/lib/tikv/raft-engine"
},
{
"volume_id" : "vol-0e65f40961a9f6249",
"type" : "storage.data-dir",
"mount_path" : "/var/lib/tikv/data-dir",
}
]
}
],
},
"pd" : {
"replicas" : 3
},
"tidb": {
"replicas" : 3
},
"kubernetes" : {
"pvs" : [],
"pvcs" : [],
"crd_tidb_cluster" : {},
"options" : {}
}
"options" : {}
}
Restore worker implements the following functions:
Obtain the deployment information of the recovery cluster, such as: PD, number of tikv
The ability to restore EBS/pv volumes from snapshot
Mount and start control of TiKV volume of cluster
Start BR for data recovery
Worker container contains: 1. backup-manager, 2. BR
Restore worker workflow:
br restore full --type=aws-ebs --prepare --pd "172.16.2.1:2379" -s "s3:///us-west-2/meta/&sk=xx..." --output=topology.json
BR command output as follows:
{
"cluster_info": {
"cluster_version": "v6.3.0",
"max_alloc_id": "6000",
"resolved_ts": "456745777823347",
},
"tikv" : {
"replicas": 3,
"stores": [
{
"store_id" : 1,
"volumes" : [
{
"volume_id" : "vol-0e65f40961a9f6244",
"type" : "raft-engine.dir",
"mount_path" : "/var/lib/tikv/raft-engine",
"restore_volume_id" : "vol-0e65f40961a9f0001",
"snapshot_id" : "snap-1234567890abcdef0"
},
{
"volume_id" : "vol-0e65f40961a9f6245",
"type" : "storage.data-dir",
"mount_path" : "/var/lib/tikv/data-dir",
"restore_volume_id" : "vol-0e65f40961a9f0002",
"snapshot_id" : "snap-1234567890abcdef1"
}
]
},
{
"store_id" : 2,
"volumes" : [
{
"volume_id" : "vol-0e65f40961a9f6246",
"type" : "raft-engine.dir",
"mount_path" : "/var/lib/tikv/raft-engine",
"restore_volume_id" : "vol-0e65f40961a9f0003",
"snapshot_id" : "snap-1234567890abcdef2"
},
{
"volume_id" : "vol-0e65f40961a9f6247",
"type" : "storage.data-dir",
"mount_path" : "/var/lib/tikv/data-dir",
"restore_volume_id" : "vol-0e65f40961a9f0004",
"snapshot_id" : "snap-1234567890abcdef3"
}
]
},
{
"store_id" : 3,
"volumes" : [
{
"volume_id" : "vol-0e65f40961a9f6248",
"type" : "raft-engine.dir",
"mount_path" : "/var/lib/tikv/raft-engine",
"restore_volume_id" : "vol-0e65f40961a9f0005",
"snapshot_id" : "snap-1234567890abcdef4"
},
{
"volume_id" : "vol-0e65f40961a9f6249",
"type" : "storage.data-dir",
"mount_path" : "/var/lib/tikv/data-dir",
"restore_volume_id" : "vol-0e65f40961a9f0006",
"snapshot_id" : "snap-1234567890abcdef5"
}
]
}
],
},
"pd" : {
"replicas" : 3
},
"tidb": {
"replicas" : 3
},
"kubernetes" : {
"pvs" : [],
"pvcs" : [],
"crd_tidb_cluster" : {},
"options" : {}
}
"options" : {}
}
The backup-manager mounts the relevant volume and starts TiKV.
The backup-manager starts the BR again until the BR completes the restoration of data consistency, sets the PD flag to enter the normal mode, and then restarts TiKV to exit. For detailed design, see BR Restore Detailed Design
br restore full --type=aws-ebs --pd "172.16.2.1:2379" -s "s3:///us-west-2/meta/&sk=xx..."
Backup design
Restore design
Backup cluster basic metadata, such as resolved_ts,
Configure cluster scheduling, stop replica scheduling, turn off GC during backup, and turn on GC after backup
The snapshot function of EBS/pv volumes
Backup workflow
const enableTiKVSplitRegion = "enable-tikv-split-region"
scheduleLimitParams := []string{
"hot-region-schedule-limit",
"leader-schedule-limit",
"merge-schedule-limit",
"region-schedule-limit",
"replica-schedule-limit",
enableTiKVSplitRegion,
}
Read resolved_ts and save it in the backup data.
Shut down the GC by start a background safepoint keeper. Continuously update the GC safepoint to stop the GC.
Get the ongoing peer scheduling operator operation and wait until the scheduling is complete.
After the Snapshot returns (EBS snapshot returns immediately), enable copy scheduling and enable GC.
Wait for the AWS snapshot to complete
Summarize all backup data information and upload to the target storage S3.
There are mainly 2 phase implementations, RegionRecover and ResolvedData.
Newly added recovery service/recovery mode
After TiKV starts raft-engine and kv engine, when establishing pdclient, it actively reads recovery marker from pd at very beginning of the connection between TiKV and PD.
TiKV performs data recovery in recovery mode. It mainly performs raft state machine and data consistency adjustment in the recovery phase. TiKV mainly completes the recovery work in two steps in Recovery mode:
Step 1: Report metadata, and adjust local metadata.
Step 3: Delete data
In recover mode, data consistency recovery is mainly completed.
curl "172.16.5.31:3279/pd/api/v1/admin/snapshot-recovering" -XPOST
More info, please refer to PR
curl "172.16.5.31:3279/pd/api/v1/min-resolved-ts"
curl "172.16.5.31:3279/pd/api/v1/admin/base-alloc-id" -XPOST -d "10000"
message IsSnapshotRecoveringRequest {
RequestHeader header = 1;
}
message IsSnapshotRecoveringResponse {
ResponseHeader header = 1;
bool marked = 2;
}
rpc IsSnapshotRecovering(IsSnapshotRecoveringRequest) returns (IsSnapshotRecoveringResponse) {}
During the BR backup, a admin check op interface to check if region has ongoing interface
message CheckAdminRequest {
}
message CheckAdminResponse {
Error error = 1;
metapb.Region region = 2;
bool has_pending_admin = 3;
}
// CheckPendingAdminOp used for snapshot backup. before we start snapshot for a TiKV.
// we need stop all schedule first and make sure all in-flight schedule has finished.
// this rpc check all pending conf change for leader.
rpc CheckPendingAdminOp(CheckAdminRequest) returns (stream CheckAdminResponse) {}
TiKV enters the recovery mode when pd is marked as Snapshot Recovery.
TiKV report region meta interface.
message RegionMeta {
uint64 region_id = 1;
uint64 peer_id = 2;
uint64 last_log_term = 3;
uint64 last_index = 4;
uint64 commit_index = 5;
uint64 version = 6;
bool tombstone = 7; //reserved, it may be used in late phase for peer check
bytes start_key = 8;
bytes end_key = 9;
}
// command to store for recover region
message RecoverRegionRequest {
uint64 region_id = 1;
bool as_leader = 2; // force region_id as leader
bool tombstone = 3; // set Peer to tombstoned in late phase
}
// read region meta to ready region meta
rpc ReadRegionMeta(ReadRegionMetaRequest) returns (stream RegionMeta) {}
BR RegionRecover command interface
// command to store for recover region
message RecoverRegionRequest {
uint64 region_id = 1;
bool as_leader = 2; // force region_id as leader
bool tombstone = 3; // set Peer to tombstoned in late phase
}
message RecoverRegionResponse {
Error error = 1;
uint64 store_id = 2;
}
// execute the recovery command
rpc RecoverRegion(stream RecoverRegionRequest) returns (RecoverRegionResponse) {}
During the BR ResolvedData phase, delete data interface:
// resolve data by resolved_ts
message ResolveKvDataRequest {
uint64 resolved_ts = 1;
}
message ResolveKvDataResponse {
Error error = 1;
uint64 store_id = 2;
uint64 resolved_key_count = 3; // reserved for summary of restore
// cursor of delete key.commit_ts, reserved for progress of restore
// progress is (current_commit_ts - resolved_ts) / (backup_ts - resolved_ts) x 100%
uint64 current_commit_ts = 4;
}
// execute delete data from kv db
rpc ResolveKvData(ResolveKvDataRequest) returns (stream ResolveKvDataResponse) {}
backup phase
backup full --type=aws-ebs --pd "172.16.2.1:2379" -s "s3:/bucket/backup_folder" --volumes-file=backup.json
volume prepare phase
br restore full --type=aws-ebs --prepare --pd "172.16.2.1:2379" -s "s3:///us-west-2/meta/&sk=xx..." --output=topology.json
recovery phase
During the recovery phase, TiDB Operator needs to pass two parameters --pd and --resolved_ts to BR
br restore full --type=aws-ebs --pd "172.16.2.1:2379" -s "s3:///us-west-2/meta/&sk=xx..."
TiKV does not configure report_min_resolved_ts_interval, backup directly failed backup.
Failed to stop PD scheduler, try again. If multiple retries fail within some minutes, the entire backup fails, and then stops snapshot and removes metadata. At the same time, delete snapshot already taken.
Snapshot takes too long or failure
In the version, we just simply fail the entire backup. Meanwhile, the snapshot taken shall be deleted since the backup shall be a failure.
If the PD cannot be connected, retry is required, and the retry logic can refer to the existing logic of BR.
Only support v6.3.0 or late
We prefer the IAM role to backup and restore, however, we are able to use the secret key-id way to launch the backup and restore. Here we re-use the TiDB Operator logic to handle the security part.
an extra permission need for EBS backup IAM role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"s3-object-lambda:*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:AttachVolume",
"ec2:CreateSnapshot",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:DeleteSnapshot",
"ec2:DeleteTags",
"ec2:DeleteVolume",
"ec2:DescribeInstances",
"ec2:DescribeSnapshots",
"ec2:DescribeTags",
"ec2:DescribeVolumes",
"ec2:DetachVolume"
],
"Resource": "*"
}
]
}
Notice: prefix with ec2 Action is require to do the ebs backup and restore.
All block issues had been identified. However some part of design may need some spiking during the implementation phase.
using the sync-diff-inspector to do a consistency validation, sync-diff-inspector
using the tidb-snapshot with checksum to validate consistency
create a tidbcluster by tidb-operator
luanch a tpcc to prepare some data
apply a ebs backup yaml, check Commint-Ts
using the Commit-Ts to take a session level tidb snapshot
do database admin checksum
setup a enviroment with LVM disk
prepare a lvm snapshot script
add --skip-AWS in tidb-operator backup and restore yaml
change some code in BR for --skip-AWS, run prepare scripts
Reference