src/doc/rgw/multisite-reshard.md
A layout describes a set of rados objects, along with some strategy to distribute things across them. A bucket index layout distributes object names across some number of shards via ceph_str_hash_linux(). Resharding a bucket enacts a transition from one such layout to another. Each layout could represent data differently. For example, a bucket index layout would be used with cls_rgw to write/delete keys. Whereas a datalog layout may be used with cls_log to append and trim log entries, then later transition to a layout based on some other primitive like cls_queue or cls_fifo.
To reshard a bucket, we currently create a new bucket instance with the desired sharding layout, and switch to that instance when resharding completes. In multisite, though, the metadata master zone is authoritative for all bucket metadata, including the sharding layout and reshard status. Any changes to metadata must take place on the metadata master zone and replicate from there to other zones.
If we want to allow each zone to manage its bucket sharding independently, we can't allow them each to create a new bucket instance, because data sync relies on the consistency of instance ids between zones. We also can't allow metadata sync to overwrite our local sharding information with the metadata master's copy.
That means that the bucket's sharding information needs to be kept private to the local zone's bucket instance, and that information also needs to track all reshard status that's currently spread between the old and new bucket instance metadata: old shard layout, new shard layout, and current reshard progress. To make this information private, we can just prevent metadata sync from overwriting these fields.
This change also affects the rados object names of the bucket index shards, currently of the form .dir.<instance-id>.<shard-id>. Since we need to represent multiple sharding layouts for a single instance-id, we need to add some unique identifier to the object names. This comes in the form of a generation number, incremented with each reshard, like .dir.<instance-id>.<generation>.<shard-id>. The first generation number 0 would be omitted from the object names for backward compatibility.
The bucket replication logs for multisite are stored in the same bucket index shards as the keys that they modify. However, we can't reshard these log entries like we do with normal keys, because other zones need to track their position in the logs. If we shuffle the log entries around between shards, other zones no longer have a way to associate their old shard marker positions with the new shards, and their only recourse would be to restart a full sync. So when resharding buckets, we need to preserve the old bucket index logs so that other zones can finish processing their log entries, while any new events are recorded in the new bucket index logs.
An additional goal is to move replication logs out of omap (so out of the bucket index) into separate rados objects. To enable this, the bucket instance metadata should be able to describe a bucket whose index layout is different from its log layout. For existing buckets, the two layouts would be identical and share the bucket index objects. Alternate log layouts are otherwise out of scope for this design.
To support peer zones that are still processing old logs, the local bucket instance metadata must track the history of all log layouts that haven't been fully trimmed yet. Once bilog trimming advances past an old generation, it can delete the associated rados objects and remove that layout from the bucket instance metadata. To prevent this history from growing too large, we can refuse to reshard bucket index logs until trimming catches up.
The distinction between index layout and log layout is important, because incremental sync only cares about changes to the log layout. Changes to the index layout would only affect full sync, which uses a custom RGWListBucket extension to list the objects of each index shard separately. But by changing the scope of full sync from per-bucket-shard to per-bucket and using a normal bucket listing to get all objects, we can make full sync independent of the index layout. And once the replication logs are moved out of the bucket index, dynamic resharding is free to change the index layout as much as it wants with no effect on multisite replication.
Modify existing state machine for bucket reshard to mutate its existing bucket instance instead of creating a new one.
Add fields for log layout. When resharding a bucket whose logs are in the index:
Reframe in terms of log generations, instead of handling SYNCSTOP events with a special Stopped state: