docs/rfcs/031-sharding-static.md
To enable databases with sizes approaching the capacity of a pageserver's disk, it is necessary to break up the storage for the database, or shard it.
Sharding in general is a complex area. This RFC aims to define an initial capability that will permit creating large-capacity databases using a static configuration defined at time of Tenant creation.
Currently, all data for a Tenant, including all its timelines, is stored on a single pageserver. The local storage required may be several times larger than the actual database size, due to LSM write inflation.
If a database is larger than what one pageserver can hold, then it becomes impossible for the pageserver to hold it in local storage, as it must do to provide service to clients.
In Neon:
Prior art in other distributed systems is too broad to capture here: pretty much any scale out storage system does something like this.
Note: the definition of 'large database' is arbitrary, but the lower bound is to ensure that a database that a user might create on a current-gen enterprise SSD should also work well on Neon. The upper bound is whatever postgres can handle: i.e. we must make sure that the pageserver backend is not the limiting factor in the database size.
pageserver, control plane, postgres/smgr
Key: a postgres page number, qualified by relation. In the sense that the pageserver is a versioned key-value store,
the page number is the key in that store. Key is a literal data type in existing code.
LSN dimension: this just means the range of LSNs (history), when talking about the range of keys and LSNs as a two dimensional space.
When we think of sharding across the two dimensional key/lsn space, this is an opportunity to think about how the two dimensions differ:
The key sharding is the harder part, and also the more urgent one, to support larger capacity databases. Because distributing historical LSN read work is a relatively simpler problem that most users don't have, we defer it to future work. It is anticipated that some quite simple P2P offload model will enable distributing work for historical reads: a node which is low on space can call out to peer to ask it to download and serve reads from a historical layer.
Having decided to focus on key sharding, we must next decide how we will map keys to shards. It is proposed to use a "wide striping" approach, to obtain a good compromise between data locality and avoiding entire large relations mapping to the same shard.
We will define two spaces:
Keys are currently defined in the pageserver's getpage@lsn interface as follows:
pub struct Key {
pub field1: u8,
pub field2: u32,
pub field3: u32,
pub field4: u32,
pub field5: u8,
pub field6: u32,
}
fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
Key {
field1: 0x00,
field2: rel.spcnode,
field3: rel.dbnode,
field4: rel.relnode,
field5: rel.forknum,
field6: blknum,
}
}
Note: keys for relation metadata are ignored here, as this data will be mirrored to all shards. For distribution purposes, we only care about user data keys
The properties we want from our Key->Shard mapping are:
blknum, such that adjacent blknum will usually map to
the same stripe and consequently land on the same shard, even though the overall
collection of blocks in a relation will be spread over many stripes and therefore
many shards.relnode, these do not manifest as patterns
in data placement.To accomplish this, the blknum is used to select a stripe, and stripes are assigned to shards in a pseudorandom order via a hash. The motivation for pseudo-random distribution (rather than sequential mapping of stripe to shard) is to avoid I/O hotspots when sequentially reading multiple relations: we don't want all relations' stripes to touch pageservers in the same order.
To map a Key to a shard:
Key field 4 (relNode).blknum) field by the stripe size in pages, and combine the
hash of this with the hash from the previous step.Why don't we use the other fields in the Key?
forknum for key mapping, because it distinguishes different classes of data
in the same relation, and we would like to keep the data in a relation together.For example, consider the extreme large databases cases of postgres data layout in a system with 8 shards and a stripe size of 32k pages:
blknum division will break the data up into 4096
stripes, which will be scattered across the shards.Data placement will be more uneven on smaller databases:
These uneven cases for small amounts of data do not matter, as long as the stripe size is an order of magnitude smaller than the amount of data we are comfortable holding in a single shard: if our system handles shard sizes up to 10-100GB, then it is not an issue if a tenant has some shards with 256MB size and some shards with 512MB size, even though the standard deviation of shard size within the tenant is very high. Our key mapping scheme provides a statistical guarantee that as the tenant's overall data size increases, uniformity of placement will improve.
ShardIdentityProvides the information needed to know whether a particular key belongs to a particular shard:
This structure's size is constant. Note that if we had used a differnet key mapping scheme such as consistent hashing with explicit hash ranges assigned to each shard, then the ShardIdentity's size would grow with the shard count: the simpler key mapping scheme used here enables a small fixed size ShardIdentity.
Everywhere the Pageserver currently deals with Tenants, it will move to dealing with
TenantShards, which are just a Tenant plus a ShardIdentity telling it which part
of the keyspace it owns. An un-sharded tenant is just a TenantShard whose ShardIdentity
covers the whole keyspace.
When the pageserver writes layers and index_part.json to remote storage, it must include the shard index & count in the name, to avoid collisions (the count is necessary for future-proofing: the count will vary in time). These keys will also include a generation number: the generation numbers system will work exactly the same for TenantShards as it does for Tenants today: each shard will have its own generation number.
For tenants with >1 shard, layer files implicitly become sparse: within the key range described in the layer name, the layer file for a shard will only hold the content relevant to stripes assigned to the shard.
For this reason, the LayerFileName within a tenant is no longer unique: different shards may use the same LayerFileName to refer to different data. We may solve this simply by including the shard number in the keys used for layers.
The shard number will be included as a prefix (as part of tenant ID), like this:
pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/<layer file name>-<generation>
pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/index_part.json-<generation>
Reasons for this particular format:
In practice, we expect shard counts to be relatively small, so a u8 will be sufficient,
and therefore the shard part of the path can be a fixed-length hex string like {:02X}{:02X},
for example a single-shard tenant's prefix will be 0001.
For backward compatibility, we may define a special ShardIdentity that has shard_count==0,
and use this as a cue to construct paths with no prefix at all.
In the phase 1 described in this RFC, shards only reference layers they write themselves. However, when we implement shard splitting in future, it will be useful to enable shards to reference layers written by other shards (specifically the parent shard during a split), so that shards don't have to exhaustively copy all data into their own shard-prefixed keys.
To enable this, the IndexPart structure will be extended to store the (shard number, shard count)
tuple on each layer, such that it can construct paths for layers written by other shards. This
naturally raises the question of who "owns" such layers written by ancestral shards: this problem
will be addressed in phase 2.
For backward compatibility, any index entry without shard information will be assumed to be in the legacy shardidentity.
In Phase 1, all shards will subscribe to the safekeeper to download WAL content. They will filter it down to the pages relevant to their shard:
The pageservers must somehow give the safekeeper correct feedback on remote_consistent_lsn: one solution here is for the 0th shard to periodically peek at the IndexParts for all the other shards, and have only the 0th shard populate remote_consistent_lsn. However, this is relatively expensive: if the safekeeper can be made shard-aware then it could be taught to use the max() of all shards' remote_consistent_lsns to decide when to trim the WAL.
No changes needed.
The pageserver doesn't have to do anything special during compaction or GC. It is implicitly operating on the subset of keys that map to its ShardIdentity. This will result in sparse layer files, containing keys only in the stripes that this shard owns. Where optimizations currently exist in compaction for spotting "gaps" in the key range, these should be updated to ignore gaps that are due to sharding, to avoid spuriously splitting up layers ito stripe-sized pieces.
Compute endpoints will need to:
Doing this in compute rather than routing requests via a single pageserver is necessary to enable sharding tenants without adding latency from extra hops.
Tenants, or Projects in the control plane, will each own a set of TenantShards (this will be 1 for small tenants). Logic for placement of tenant shards is just the same as the current logic for placing tenants.
Tenant lifecycle operations like deletion will require fanning-out to all the shards in the tenant. The same goes for timeline creation and deletion: a timeline should not be considered created until it has been created in all shards.
Initially, we will explicitly enable sharding for large tenants only.
In future, this hint mechanism will become optional when we implement automatic re-sharding of tenants.
This section exists to indicate what will likely come next after this phase.
Phases 2a and 2b are amenable to execution in parallel.
Problem: when all shards consume the whole WAL, the network bandwidth used for transmitting the WAL from safekeeper to pageservers is multiplied by a factor of the shard count.
Network bandwidth is not our most pressing bottleneck, but it is likely to become a problem if we set a modest shard count (~8) on a significant number of tenants, especially as those larger tenants which we shard are also likely to have higher write bandwidth than average.
Problem: the number of shards in a tenant is defined at creation time and cannot be changed. This causes excessive sharding for most small tenants, and an upper bound on scale for very large tenants.
To address this, a splitting feature will later be added. One shard can split its
data into a number of children by doing a special compaction operation to generate
image layers broken up child-shard-wise, and then writing out an index_part.json for
each child. This will then require external coordination (by the control plane) to
safely attach these new child shards and then move them around to distribute work.
The opposite merging operation can also be imagined, but is unlikely to be implemented:
once a Tenant has been sharded, the marginal efficiency benefit of merging is unlikely to justify
the risk/complexity of implementing such a rarely-encountered scenario.
Problem: while sharding based on key is good for handling changes in overall database size, it is less suitable for spiky/unpredictable changes in the read workload to historical layers. Sudden increases in historical reads could result in sudden increases in local disk capacity required for a TenantShard.
Example: the extreme case of this would be to run a tenant for a year, then create branches with ancestors at monthly intervals. This could lead to a sudden 12x inflation in the on-disk capacity footprint of a TenantShard, since it would be serving reads from all those disparate historical layers.
If we can respond fast enough, then key-sharding a tenant more finely can help with this, but splitting may be a relatively expensive operation and the increased historical read load may be transient.
A separate mechanism for handling heavy historical reads could be something like a gossip mechanism for pageservers to communicate about their workload, and then a getpageatlsn offload mechanism where one pageserver can ask another to go read the necessary layers from remote storage to serve the read. This requires relativly little coordination because it is read-only: any node can service any read. All reads to a particular shard would still flow through one node, but the disk capactity & I/O impact of servicing the read would be distributed.
When a database is growing under a write workload, writes may predominantly hit the end of the keyspace, creating a bandwidth hotspot on that shard. Similarly, if the user is intensively re-writing a particular relation, if that relation lived in a particular shard then it would not achieve our goal of distributing the write work across shards.
In this model, there would be no explicit sharding of work, but the pageserver to which a tenant is attached would not hold all layers on its disk: instead, it would call out to peers to have them store some layers, and call out to those peers to request reads in those layers.
This mechanism will work well for distributing work in the LSN dimension, but in the key space dimension it has the major limitation of requiring one node to handle all incoming writes, and compactions. Even if the write workload for a large database fits in one pageserver, it will still be a hotspot and such tenants may still de-facto require their own pageserver.