docs/design/2022-09-29-reorganize-partition.md
Support ALTER TABLE t REORGANIZE PARTITION p1,p2 INTO (partition pNew1 values...)
TiDB is currently lacking the support of changing the partitions of a partitioned table, it only supports adding and dropping LIST/RANGE partitions. Supporting REORGANIZE PARTITIONs will allow RANGE partitioned tables to have a MAXVALUE partition to catch all values and split it into new ranges. Similar with LIST partitions where one can split or merge different partitions.
When this is implemented, it will also allow future PRs transforming a non-partitioned table into a partitioned table as well as remove partitioning and make a partitioned table a normal non-partitioned table, as well as COALESCE PARTITION and ADD PARTITION for HASH partitioned tables, which is different ALTER statements but can use the same implementation as REORGANIZE PARTITION
The operation should be online, and must handle multiple partitions as well as large data sets.
Possible usage scenarios:
These different use cases can have different optimizations, but the generic form must still be solved:
First implementation should be based on the merge-txn (row-by-row batch read, update record key with new Physical Table ID, write) transactional batches and then create the indexes in batches index by index, partition by partition. Later we can implement the ingest (lightning way) optimization, since DDL module are on the way of evolution to do reorg tasks more efficiency.
There are two parts of the design:
Where the schema change states will clarify which different steps that will be done in which schema state transitions.
Since this operation will:
It will use all these schema change stages:
// StateNone means this schema element is absent and can't be used.
StateNone SchemaState = iota
- Check if the table structure after the ALTER is valid
- Use the generate physical table ids to each new partition (that was generated already by the client sending the ALTER command).
- Update the meta data with the new partitions (AddingDefinitions) and which partitions to be dropped (DroppingDefinitions), so that new transactions can double write.
- Set placement rules
- Set TiFlash Replicas
- Set legacy Bundles (non-sql placement)
- Set the state to StateDeleteOnly
// StateDeleteOnly means we can only delete items for this schema element (the new partition).
StateDeleteOnly
- Set the state to StateWriteOnly
// StateWriteOnly means we can use any write operation on this schema element,
// but outer can't read the changed data.
StateWriteOnly
- Set the state to StateWriteReorganization
// StateWriteReorganization means we are re-organizing whole data after write only state.
StateWriteReorganization
- Copy the data from the partitions to be dropped (one at a time) and insert it into the new partitions. This needs a new backfillWorker implementation.
- Recreate the indexes one by one for the new partitions (one partition at a time) (create an element for each index and reuse the addIndexWorker). (Note: this can be optimized in the futute, either with the new fast add index implementation, based on lightning. Or by either writing the index entries at the same time as the records, in the previous step, or if the partitioning columns are included in the index or handle)
- Replace the old partitions with the new partitions in the metadata when the data copying is done
- Set the state to StateDeleteReorganization
// StateDeleteReorganization means we are re-organizing whole data after delete only state.
StateDeleteReorganization - we are using this state in a slightly different way than the comment above says.
This state is needed since we cannot directly move from StateWriteReorganization to StatePublic.
Imagine that the StateWriteReorganization is complete and we are updating the schema version, then if a transaction seeing the new schema version is writing to the new partitions, then those changes needs to be written to the old partitions as well, so new transactions in other nodes using the older schema version can still see the changes.
- Remove the notion of new partitions (AddingDefinitions) and which partitions to be dropped (DroppingDefinitions) and double writing will stop when it goes to StatePublic.
- Register the range delete of the old partition data (in finishJob / deleteRange).
- Set the state to StatePublic
// StatePublic means this schema element is ok for all write and read operations.
StatePublic
- Table structure is now complete and the table is ready to use with its new partitioning scheme
- Note that there is a background job for the GCWorker to do in its deleteRange function.
During the reorganization happens in the background the normal write path needs to check if there are any new partitions in the metadata and also check if the updated/deleted/inserted row would match a new partition, and if so, also do the same operation in the new partition, just like during adding index or modify column operations currently does. (To be implemented in (*partitionedTable) AddRecord/UpdateRecord/RemoveRecord)
Example of why an extra state between StateWriteReorganize and StatePublic is needed:
-- table:
CREATE TABLE t (a int) PARTITION BY LIST (a) (PARTITION p0 VALUES IN (1,2,3,4,5), PARTITION p1 VALUES IN (6,7,8,9,10));
-- during alter operation:
ALTER TABLE t REORGANIZE PARTITION p0 INTO (PARTITION p0a VALUES IN (1,2,3), PARTITION p0b VALUES IN (4,5));
Partition within parentheses (p0a [1] p0b [0]) is hidden or to be deleted by GC/DeleteRange. Values in the brackets after the partition p0a [2].
If we go directly from StateWriteReorganize to StatePublic, then clients one schema version behind will not see changes to the new partitions:
| Data (TiKV/Unistore) | TiDB client 1 | TiDB client 2 |
|---|---|---|
| p0 [] p1 [] StateWriteReorganize | ||
| p0 [] p1 [] (p0a [] p0b []) | ||
| (p0 []) p1 [] p0a [] p0b [] StatePublic | ||
| (p0 []) p1 [] p0a [2] p0b [] | StatePublic INSERT INTO T VALUES (2) | |
| (p0 []) p1 [] p0a [2] p0b [] | StateWriteReorganize SELECT * FROM t => [] (only sees p0,p1) |
But if we add a state between StateWriteReorganize and StatePublic and double write to the old partitions during that state it works:
| Data (TiKV/Unistore) | TiDB client 1 | TiDB client 2 |
|---|---|---|
| p0 [] p1 [] (p0a [] p0b []) StateWriteReorganize | ||
| (p0 []) p1 [] p0a [] p0b [] StateDeleteReorganize | ||
| (p0 [2]) p1 [] p0a [2] p0b [] | StateDeleteReorganize INSERT INTO T VALUES (2) | |
| (p0 [2]) p1 [] p0a [2] p0b [] | StateWriteReorganize SELECT * FROM t => [2] (only sees p0,p1) | |
| (p0 [2]) p1 [] p0a [2] p0b [] StatePublic | ||
| (p0 [2]) p1 [] p0a [2] p0b [4] | StatePublic INSERT INTO T VALUES (4) | |
| (p0 [2]) p1 [] p0a [2] p0b [4] | StateDeleteReorganize SELECT * FROM t => [2,4] (sees p0a,p0b,p1) |
If any non-retryable error occurs, we will call onDropTablePartition and adjust the logic in that function to also handle the roll back of reorganize partition, in a similar way as it does with model.ActionAddTablePartition.
Note that parser support already exists. There should be no issues with upgrading, and downgrade will not be supported during the DDL.
Notes:
Re-use tests from other DDLs like Modify column, but adjust them for Reorganize partition. A separate test plan will be created and a test report will be written and signed off when the tests are completed.
Correctness and functionality is higher priority than performance.
Impacts:
Risks: