docs/design/2022-09-19-distributed-ddl-reorg.md
This is distributed processing of design in the DDL reorg phase. The current design is based on the main logic that only the DDL owner can handle DDL jobs. However, for jobs in the reorg phase, it is expected that all TiDBs can claim subtasks in the reorg phase based on resource usage.
At present, TiDB already supports parallel processing of DDL jobs on the owner. However, the resources of a single TiDB are limited. Even if it supports a parallel framework, the execution speed of DDL is relatively limited, and it will compete for resources that affect the daily operations such as TiDB's TPS.
DDL Jobs can be divided into the general job and the reorg job. It can also be considered that improving DDL operation performance can be divided into improving the performance of all DDL jobs (including the time consumption of each schema state change, checking all TiDB schema state update success, etc.), and improving the performance of the reorg stage. The current time-consuming and resource-consuming stage is obviously the reorg stage.
At present, considering the problem of significantly improving DDL performance and improving TiDB resource utilization, and relatively stable design and development, we will DDL reorg stage for distributed processing.
At present, the master branch reorg stage processing logic (that is, no lighting optimization is added), takes an added index as an example. The simple steps that the owner needs to perform in the reorg stage of the added index operation:
The reorg worker and backfill worker for this scenario are completely decoupled, i.e. the two roles are not related.
Backfill workers build the associated worker pool to handle subtasks ( DDL small tasks that a job splits into during the reorg phase).
The overall process of this document program is rough as follows:
The contents of the existing table structure may be lacking, and a new Metadata needs to be added or defined.
Add a new field to the DDLReorgMeta structure in the mysql.tidb_ddl_job table, for example:
type DDLReorgMeta struct {
... // Some of the original fields
IsDistReorg bool // Determine whether do dist-reorg
}
Consider that if all subtask information is added to the TiDB_ddl_reorg.reorg field, there may be a lock problem. It is added to the mysql.tidb_background_subtask table, the specific structure is as follows:
+---------------+------------+------+-------------+
| Field | Type | Null | Key |
+---------------+------------+------+-------------+
| id | bigint(20) | NO | PK | auto
| Namespace string | varchar(256) | NO | MUL |
| Key string | varchar(256) | NO | MUL | // ele_key, ele_id, ddl_job_id, sub_id
| ddl_physical_tid | bigint(20) | NO | |
| type | int | NO | | // e.g.ddl_addIndex type
| exec_id | varchar(256) | YES | |
| exec_expired | Timestamp | YES | | // TSO
| state | varchar(64) | YES | |
| checkpoint | longblob | YES | |
| start_time | bigint(20) | YES | |
| state_update_time | bigint(20) | YES | |
| meta | longblob | YES | |
+---------------+------------+------+-------------+
Add the following to the BackfillMeta field:
type BackfillMeta struct {
CurrKey kv.Key
StartKey kv.Key
EndKey kv.Key
EndInclude bool
ReorgTp ReorgType
...
*JobMeta // parent job meta
}
Add mysql.tidb_background_subtask_history table to record completed (including failure status) subtasks. The table structure is the same as tidb_background_subtask . Considering the number of subtasks, some records of the history table are deleted regularly in the later stage.
The general process is simply divided into two parts:
Regarding step 1.b, the current plan is to reorg worker through timer regular check, consider the completion of subtask synchronization through PD, to actively check.
Rules for backfill workers to claim subtasks:
Later, it can support more flexible segmentation tasks and assign claim tasks.
Subtask claim notification method:
Adjust the backfiller and backfillWorker to update their interfaces and make them more explicit and generic when fetching and processing tasks.
backfiller interfaces:// backfiller existing interfaces:
func BackfillDataInTxn(handleRange reorgBackfillTask) (taskCtx backfillTaskContext, errInTxn error)
func AddMetricInfo(float64)
// backfiller new interfaces:
// get batch tasks
func GetTasks() ([]*BackfillJob, error){}
// update task
func UpdateTask(bfJob *BackfillJob) error{}
func FinishTask(bfJob *BackfillJob) error{}
// get the backfill context
func GetCtx() *backfillCtx{}
func String() string{}
backfillWorker.// In the current implementation, the result is passed between the reorg worker and the backfill worker using chan, and it runs tasks by calling `run`
// In the new framework, two situations need to be adapted
// 1. As before, transfer via chan and reorg workers under the same TiDB-server
// 2. Added support for transfer through system tables to reorg workers between different TiDB-servers
// Consider early compatibility. Implement the two adaptations separately, i.e., use the original `run` function for function 1 and `runTask` for function 2
func (w *backfillWorker) runTask(task *reorgBackfillTask) (result *backfillResult) {}
// updatet reorg substask exec_id and exec_lease
func (w *backfillWorker) updateLease(bfJob *BackfillJob) error{}
func (w *backfillWorker) releaseLease() {}
// return backfiller related info
func (w *backfillWorker) String() string {}
WorkerPool(later considered to be unified with the existing WorkerPool).In the current scheme, the backfill worker obtains subtasks and the reorg worker checks whether the subtask is completed through regular inspection and processing. Here, we consider combining PD watches for communication.
When the network partition or abnormal exit occurs in the TiDB where the current backfill worker is located, the corresponding subtask may not be handled by the worker. In the current scheme, it is tentatively planned to mark whether the executor owner of the current subtask is valid by lease. There are more suitable schemes that can be discussed later. The specific operation of this scheme:
When processing the reorg stage, the process with an error when backfilling is handled as follows:
When the user executes admin cancel ddl job , the job is marked as canceling as in the original logic. DDL the reorg worker where the owner is located checks this field and finds that it is canceling, the next process is similar to step 3-6 of Failed.
Since the subtask may be segmented by each table region, it may cause the mysql.tidb_background_subtask_history table is particularly large, so you need to add a regular cleaning function.
The first stage can be through subtasks inside row count to calculate the entire DDL job row count. Then the display is the same as the original logic.
Subsequent progress can be displayed more humanely, providing results such as percentages, allowing users to better understand the processing of the reorg phase.
Update and add some new logs and metrics.