architecture/design/docdb-automatic-tablet-splitting.md
Tracking GitHub Issue: https://github.com/yugabyte/yugabyte-db/issues/1004
Automatic tablet splitting enables changing the number of tablets (which are splits of data) at runtime. There are a number of scenarios where this is useful:
In use-cases that scan a range of data, the data is stored in the natural sort order (also known as range-sharding). In these usage patterns, it is often impossible to predict a good split boundary ahead of time. For example:
CREATE TABLE census_stats (
age INTEGER,
user_id INTEGER,
...
);
In the table above, it is not possible for the database to infer the range of values for age (typically in the 1 to 100 range). It is also impossible to predict the distribution of rows in the table, meaning how many user_id rows will be inserted for each value of age to make an evenly distributed data split. This makes it hard to pick good split points ahead of time.
In use-cases with a low-cardinality of the primary keys (or the secondary index), hashing is not very effective. For example, if we had a table where the primary key (or index) was the column gender which has only two values Male and Female, hash sharding would not be very effective. However, it is still desirable to use the entire cluster of machines to maximize serving throughput.
This feature is also useful for use-cases where tables begin small and thereby start with a few shards. If these tables grow very large, then nodes continuously get added to the cluster. We may reach a scenario where the number of nodes exceeds the number of tablets. Such cases require tablet splitting to effectively re-balance the cluster.
There are four steps in the lifecycle of tablet splitting
Each of these stages is described below.
The YB-Master continuously monitors tablets and decides when to split a particular tablet. Currently, the data set size across the tablets is used to determine if any tablet needs to be split. This can be enhanced to take into account other factors such as:
Currently, the YB-Master configuration parameter tablet_split_size_threshold_bytes is propagated to all YB-TServers by piggybacking it with the heartbeat responses. The YB-TServers in turn report a list of tablets whose sizes exceed the tablet_split_size_threshold_bytes parameter.
Based on the heartbeats from all the YB-TServers, the YB-Master picks the set of tablets that need to be split. At this point, the split can be initiated.
The YB-Master registers two new post-split tablets and increments SysTablesEntryPB.partitions_version for the table.
Then, it sends a SplitTablet() RPC call to the appropriate YB-TServer with the pre-split tablet ID, post-split tablet IDs, and split key. Note that a server can split a tablet only if it hosts the leader tablet-peer.
Note: we use two new tablet IDs vs old tablet ID + one new tablet ID for the following reasons:
When leader tablet server receives SplitTablet RPC, it adds a special Raft record containing:
Tablet splitting happens at the moment of applying a Raft split-record and includes following steps:
Do the RocksDB split - both regular and provisional records.
Duplicate all Raft-related objects.
Any new (received after split record is added) records/operations will have to be added directly to one of the two new tablets.
Before split record is applied old tablet is continuing processing read requests in a usual way.
Old tablet will reject processing new read/write operations after split record apply is started. Higher layers will have to handle this appropriately, update metadata and retry to new tablets.
Once tablet splitting is complete on a leader of the original pre-split tablet - master will get info about new tablets in a
tablet report embedded into TSHeartbeatRequestPB. We can also send this info back as a response to
TabletServerAdminService.SplitTablet RPC so that master knows faster about new tablets.
After leaders are elected for the new tablets - they are switched into RUNNING state.
We keep the old tablet Raft group available but not serving reads/writes for the case when some old tablet replica hasn’t received a Raft split record and hasn’t been split. For example, this replica was partitioned away before the split record has been added into its Raft log and then it joins the cluster back after majority splits. There are the following cases:
This replica joins the cluster back in less than log_min_seconds_to_retain seconds: it will be able to get all Raft log
records from the old tablet leader, split the old tablet on a replica, and then get all Raft log records for post-split
tablets.
This replica joins the cluster back in less than follower_unavailable_considered_failed_sec but after
log_min_seconds_to_retain seconds: part of the Raft log records absent on this replica would not be available on the old
tablet leader, so remote bootstrap will be initiated.
Note: we have logic to prevent a Raft split record from being GCed.
This replica joins the cluster back after follower_unavailable_considered_failed_sec. In this case replica is considered as failed and is evicted from the Raft group, so we don’t need to hold the old tablet anymore.
Note: by default follower_unavailable_considered_failed_sec = log_min_seconds_to_retain, but these flags can be adjusted.
We copy the RocksDB to additional directory using hard links and add metadata saying that only part of the key range is visible. Remote bootstrap will work right away. Next major compaction will remove all key-value pairs which are no longer related to the new tablets due to split. Later, we can implement cutting of RocksDB without full compaction (see below in this section).
We store tablet's key bounds inside KvStoreInfo tablet metadata. BoundedRocksDbIterator filters out non relevant keys. We can
also propagate key boundaries to a regular RocksDB instance so that it has knowledge about non relevant keys, and we can
implement RocksDB optimizations like truncating SST files quickly even before the full compactions.
Performance note: remote bootstrap could download irrelevant data from new tablet if remote bootstrap is happening before compaction.
Snapshots (created by Tablet::CreateSnapshot)
Cutting of RocksDB without full compaction
We have the following distinct types of data in provisional records (intents) store:
SubDocKey (no HybridTime) + IntentType + HybridTime -> TxnId + value of the provisional recordTxnId -> status tablet id + isolation levelTxnId + HybridTime -> Main provisional record data keyTxnId, IntentType, HybridTime are all prefixed with appropriate value type.
Reverse index by transaction ID is used for getting all provisional records in tablet related to particular transaction.
We can’t just split provisional records DB at RocksDB level by some mid-key because metadata is not sorted by original
key. Instead we can just duplicate provisional records DB and do filtering by original key inside
docdb::PrepareApplyIntentsBatch which is the only function using provisional records reverse index.
Filtering of main provisional record data will be done inside BoundedRocksDbIterator.
SysTablesEntryPB.partitions_version is included into GetTableLocationsResponsePB.
YBTable::partitions_ is first populated inside YBTable::Open. We refresh it on the following events:
MetaCache::ProcessTabletLocations gets a newer partition_version from GetTableLocationsResponsePB.YBTable::partitions_ is updated we also update the meta cache.We should disallow tablet splitting when the split is requested from a tablet leader that is remote-bootstrapping one of the nodes. Bulk load tool relies on partitioning to be fixed during the load process, so we decided to pre-split and disable dynamic splitting during bulk load.