Sharding Phase 1: Static Key-space Sharding

Summary

To enable databases with sizes approaching the capacity of a pageserver's disk, it is necessary to break up the storage for the database, or shard it.

Sharding in general is a complex area. This RFC aims to define an initial capability that will permit creating large-capacity databases using a static configuration defined at time of Tenant creation.

Motivation

Currently, all data for a Tenant, including all its timelines, is stored on a single pageserver. The local storage required may be several times larger than the actual database size, due to LSM write inflation.

If a database is larger than what one pageserver can hold, then it becomes impossible for the pageserver to hold it in local storage, as it must do to provide service to clients.

Prior art

In Neon:

Layer File Spreading: https://www.notion.so/neondatabase/One-Pager-Layer-File-Spreading-Konstantin-21fd9b11b618475da5f39c61dd8ab7a4
Layer File SPreading: https://www.notion.so/neondatabase/One-Pager-Layer-File-Spreading-Christian-eb6b64182a214e11b3fceceee688d843
Key Space partitioning: https://www.notion.so/neondatabase/One-Pager-Key-Space-Partitioning-Stas-8e3a28a600a04a25a68523f42a170677

Prior art in other distributed systems is too broad to capture here: pretty much any scale out storage system does something like this.

Requirements

Enable creating a large (for example, 16TiB) database without requiring dedicated pageserver nodes.
Share read/write bandwidth costs for large databases across pageservers, as well as storage capacity, in order to avoid large capacity databases acting as I/O hotspots that disrupt service to other tenants.
Our data distribution scheme should handle sparse/nonuniform keys well, since postgres does not write out a single contiguous ranges of page numbers.

Note: the definition of 'large database' is arbitrary, but the lower bound is to ensure that a database that a user might create on a current-gen enterprise SSD should also work well on Neon. The upper bound is whatever postgres can handle: i.e. we must make sure that the pageserver backend is not the limiting factor in the database size.

Non Goals

Independently distributing timelines within the same tenant. If a tenant has many timelines, then sharding may be a less efficient mechanism for distributing load than sharing out timelines between pageservers.
Distributing work in the LSN dimension: this RFC focuses on the Key dimension only, based on the idea that separate mechanisms will make sense for each dimension.

Impacted Components

pageserver, control plane, postgres/smgr

Terminology

Key: a postgres page number, qualified by relation. In the sense that the pageserver is a versioned key-value store, the page number is the key in that store. Key is a literal data type in existing code.

LSN dimension: this just means the range of LSNs (history), when talking about the range of keys and LSNs as a two dimensional space.

Implementation

Key sharding vs. LSN sharding

When we think of sharding across the two dimensional key/lsn space, this is an opportunity to think about how the two dimensions differ:

Sharding the key space distributes the write workload of ingesting data and compacting. This work must be carefully managed so that exactly one node owns a given key.
Sharding the LSN space distributes the historical read workload. This work can be done by anyone without any special coordination, as long as they can see the remote index and layers.

The key sharding is the harder part, and also the more urgent one, to support larger capacity databases. Because distributing historical LSN read work is a relatively simpler problem that most users don't have, we defer it to future work. It is anticipated that some quite simple P2P offload model will enable distributing work for historical reads: a node which is low on space can call out to peer to ask it to download and serve reads from a historical layer.

Key mapping scheme

Having decided to focus on key sharding, we must next decide how we will map keys to shards. It is proposed to use a "wide striping" approach, to obtain a good compromise between data locality and avoiding entire large relations mapping to the same shard.

We will define two spaces:

Key space: unsigned integer
Shard space: integer from 0 to N-1, where we have N shards.

Key -> Shard mapping

Keys are currently defined in the pageserver's getpage@lsn interface as follows:

pub struct Key {
    pub field1: u8,
    pub field2: u32,
    pub field3: u32,
    pub field4: u32,
    pub field5: u8,
    pub field6: u32,
}


fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key {
    Key {
        field1: 0x00,
        field2: rel.spcnode,
        field3: rel.dbnode,
        field4: rel.relnode,
        field5: rel.forknum,
        field6: blknum,
    }
}

Note: keys for relation metadata are ignored here, as this data will be mirrored to all shards. For distribution purposes, we only care about user data keys

The properties we want from our Key->Shard mapping are:

Locality in blknum, such that adjacent blknum will usually map to the same stripe and consequently land on the same shard, even though the overall collection of blocks in a relation will be spread over many stripes and therefore many shards.
Avoid the same blknum on different relations landing on the same stripe, so that with many small relations we do not end up aliasing data to the same stripe/shard.
Avoid vulnerability to aliasing in the values of relation identity fields, such that if there are patterns in the value of relnode, these do not manifest as patterns in data placement.

To accomplish this, the blknum is used to select a stripe, and stripes are assigned to shards in a pseudorandom order via a hash. The motivation for pseudo-random distribution (rather than sequential mapping of stripe to shard) is to avoid I/O hotspots when sequentially reading multiple relations: we don't want all relations' stripes to touch pageservers in the same order.

To map a Key to a shard:

Hash the Key field 4 (relNode).
Divide field 6 (blknum) field by the stripe size in pages, and combine the hash of this with the hash from the previous step.
The total hash modulo the shard count gives the shard holding this key.

Why don't we use the other fields in the Key?

We ignore forknum for key mapping, because it distinguishes different classes of data in the same relation, and we would like to keep the data in a relation together.
We would like to use spcNode and dbNode, but cannot. Postgres database creation operations can refer to an existing database as a template, such that the created database's blocks differ only by spcNode and dbNode from the original. To enable running this type of creation without cross-pageserver communication, we must ensure that these blocks map to the same shard -- we do this by excluding spcNode and dbNode from the hash.

Data placement examples

For example, consider the extreme large databases cases of postgres data layout in a system with 8 shards and a stripe size of 32k pages:

A single large relation: blknum division will break the data up into 4096 stripes, which will be scattered across the shards.
4096 relations of of 32k pages each: each relation will map to exactly one stripe, and that stripe will be placed according to the hash of the key fields 4. The data placement will be statistically uniform across shards.

Data placement will be more uneven on smaller databases:

A tenant with 2 shards and 2 relations of one stripe size each: there is a 50% chance that both relations land on the same shard and no data lands on the other shard.
A tenant with 8 shards and one relation of size 12 stripes: 4 shards will have double the data of the other four shards.

These uneven cases for small amounts of data do not matter, as long as the stripe size is an order of magnitude smaller than the amount of data we are comfortable holding in a single shard: if our system handles shard sizes up to 10-100GB, then it is not an issue if a tenant has some shards with 256MB size and some shards with 512MB size, even though the standard deviation of shard size within the tenant is very high. Our key mapping scheme provides a statistical guarantee that as the tenant's overall data size increases, uniformity of placement will improve.

Important Types

`ShardIdentity`

Provides the information needed to know whether a particular key belongs to a particular shard:

Layout version
Stripe size
Shard count
Shard index

This structure's size is constant. Note that if we had used a differnet key mapping scheme such as consistent hashing with explicit hash ranges assigned to each shard, then the ShardIdentity's size would grow with the shard count: the simpler key mapping scheme used here enables a small fixed size ShardIdentity.

Pageserver changes

Structural

Everywhere the Pageserver currently deals with Tenants, it will move to dealing with TenantShards, which are just a Tenant plus a ShardIdentity telling it which part of the keyspace it owns. An un-sharded tenant is just a TenantShard whose ShardIdentity covers the whole keyspace.

When the pageserver writes layers and index_part.json to remote storage, it must include the shard index & count in the name, to avoid collisions (the count is necessary for future-proofing: the count will vary in time). These keys will also include a generation number: the generation numbers system will work exactly the same for TenantShards as it does for Tenants today: each shard will have its own generation number.

Storage Format: Keys

For tenants with >1 shard, layer files implicitly become sparse: within the key range described in the layer name, the layer file for a shard will only hold the content relevant to stripes assigned to the shard.

For this reason, the LayerFileName within a tenant is no longer unique: different shards may use the same LayerFileName to refer to different data. We may solve this simply by including the shard number in the keys used for layers.

The shard number will be included as a prefix (as part of tenant ID), like this:

pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/<layer file name>-<generation>

pageserver/v1/tenants/<tenant_id>-<shard_number><shard_count>/timelines/<timeline id>/index_part.json-<generation>

Reasons for this particular format:

Use of a prefix is convenient for implementation (no need to carry the shard ID everywhere we construct a layer file name), and enables efficient listing of index_parts within a particular shard-timeline prefix.
Including the shard count as well as shard number means that in future when we implement shard splitting, it will be possible for a parent shard and one of its children to write the same layer file without a name collision. For example, a parent shard 0_1 might split into two (0_2, 1_2), and in the process of splitting shard 0_2 could write a layer or index_part that is distinct from what shard 0_1 would have written at the same place.

In practice, we expect shard counts to be relatively small, so a u8 will be sufficient, and therefore the shard part of the path can be a fixed-length hex string like {:02X}{:02X}, for example a single-shard tenant's prefix will be 0001.

For backward compatibility, we may define a special ShardIdentity that has shard_count==0, and use this as a cue to construct paths with no prefix at all.

Storage Format: Indices

In the phase 1 described in this RFC, shards only reference layers they write themselves. However, when we implement shard splitting in future, it will be useful to enable shards to reference layers written by other shards (specifically the parent shard during a split), so that shards don't have to exhaustively copy all data into their own shard-prefixed keys.

To enable this, the IndexPart structure will be extended to store the (shard number, shard count) tuple on each layer, such that it can construct paths for layers written by other shards. This naturally raises the question of who "owns" such layers written by ancestral shards: this problem will be addressed in phase 2.

For backward compatibility, any index entry without shard information will be assumed to be in the legacy shardidentity.

WAL Ingest

In Phase 1, all shards will subscribe to the safekeeper to download WAL content. They will filter it down to the pages relevant to their shard:

For ordinary user data writes, only retain a write if it matches the ShardIdentity
For metadata describing relations etc, all shards retain these writes.

The pageservers must somehow give the safekeeper correct feedback on remote_consistent_lsn: one solution here is for the 0th shard to periodically peek at the IndexParts for all the other shards, and have only the 0th shard populate remote_consistent_lsn. However, this is relatively expensive: if the safekeeper can be made shard-aware then it could be taught to use the max() of all shards' remote_consistent_lsns to decide when to trim the WAL.

Compaction/GC

No changes needed.

The pageserver doesn't have to do anything special during compaction or GC. It is implicitly operating on the subset of keys that map to its ShardIdentity. This will result in sparse layer files, containing keys only in the stripes that this shard owns. Where optimizations currently exist in compaction for spotting "gaps" in the key range, these should be updated to ignore gaps that are due to sharding, to avoid spuriously splitting up layers ito stripe-sized pieces.

Compute Endpoints

Compute endpoints will need to:

Accept a vector of connection strings as part of their configuration from the control plane
Route pageserver requests according to mapping the hash of key to the correct entry in the vector of connection strings.

Doing this in compute rather than routing requests via a single pageserver is necessary to enable sharding tenants without adding latency from extra hops.

Control Plane

Tenants, or Projects in the control plane, will each own a set of TenantShards (this will be 1 for small tenants). Logic for placement of tenant shards is just the same as the current logic for placing tenants.

Tenant lifecycle operations like deletion will require fanning-out to all the shards in the tenant. The same goes for timeline creation and deletion: a timeline should not be considered created until it has been created in all shards.

Selectively enabling sharding for large tenants

Initially, we will explicitly enable sharding for large tenants only.

In future, this hint mechanism will become optional when we implement automatic re-sharding of tenants.

Future Phases

This section exists to indicate what will likely come next after this phase.

Phases 2a and 2b are amenable to execution in parallel.

Phase 2a: WAL fan-out

Problem: when all shards consume the whole WAL, the network bandwidth used for transmitting the WAL from safekeeper to pageservers is multiplied by a factor of the shard count.

Network bandwidth is not our most pressing bottleneck, but it is likely to become a problem if we set a modest shard count (~8) on a significant number of tenants, especially as those larger tenants which we shard are also likely to have higher write bandwidth than average.

Phase 2b: Shard Splitting

Problem: the number of shards in a tenant is defined at creation time and cannot be changed. This causes excessive sharding for most small tenants, and an upper bound on scale for very large tenants.

To address this, a splitting feature will later be added. One shard can split its data into a number of children by doing a special compaction operation to generate image layers broken up child-shard-wise, and then writing out an index_part.json for each child. This will then require external coordination (by the control plane) to safely attach these new child shards and then move them around to distribute work. The opposite merging operation can also be imagined, but is unlikely to be implemented: once a Tenant has been sharded, the marginal efficiency benefit of merging is unlikely to justify the risk/complexity of implementing such a rarely-encountered scenario.

Phase N (future): distributed historical reads

Problem: while sharding based on key is good for handling changes in overall database size, it is less suitable for spiky/unpredictable changes in the read workload to historical layers. Sudden increases in historical reads could result in sudden increases in local disk capacity required for a TenantShard.

Example: the extreme case of this would be to run a tenant for a year, then create branches with ancestors at monthly intervals. This could lead to a sudden 12x inflation in the on-disk capacity footprint of a TenantShard, since it would be serving reads from all those disparate historical layers.

If we can respond fast enough, then key-sharding a tenant more finely can help with this, but splitting may be a relatively expensive operation and the increased historical read load may be transient.

A separate mechanism for handling heavy historical reads could be something like a gossip mechanism for pageservers to communicate about their workload, and then a getpageatlsn offload mechanism where one pageserver can ask another to go read the necessary layers from remote storage to serve the read. This requires relativly little coordination because it is read-only: any node can service any read. All reads to a particular shard would still flow through one node, but the disk capactity & I/O impact of servicing the read would be distributed.

FAQ/Alternatives

Why stripe the data, rather than using contiguous ranges of keyspace for each shard?

When a database is growing under a write workload, writes may predominantly hit the end of the keyspace, creating a bandwidth hotspot on that shard. Similarly, if the user is intensively re-writing a particular relation, if that relation lived in a particular shard then it would not achieve our goal of distributing the write work across shards.

Why not proxy read requests through one pageserver, so that endpoints don't have to change?

This would not achieve scale-out of network bandwidth: a busy tenant with a large database would still cause a load hotspot on the pageserver routing its read requests.
The additional hop through the "proxy" pageserver would add latency and overall resource cost (CPU, network bandwidth)

Layer File Spreading: use one pageserver as the owner of a tenant, and have it spread out work on a per-layer basis to peers

In this model, there would be no explicit sharding of work, but the pageserver to which a tenant is attached would not hold all layers on its disk: instead, it would call out to peers to have them store some layers, and call out to those peers to request reads in those layers.

This mechanism will work well for distributing work in the LSN dimension, but in the key space dimension it has the major limitation of requiring one node to handle all incoming writes, and compactions. Even if the write workload for a large database fits in one pageserver, it will still be a hotspot and such tenants may still de-facto require their own pageserver.