Pageserver: split-brain safety for remote storage through generation numbers

Summary

A scheme of logical "generation numbers" for tenant attachment to pageservers is proposed, along with changes to the remote storage format to include these generation numbers in S3 keys.

Using the control plane as the issuer of these generation numbers enables strong anti-split-brain properties in the pageserver cluster without implementing a consensus mechanism directly in the pageservers.

Motivation

Currently, the pageserver's remote storage format does not provide a mechanism for addressing split brain conditions that may happen when replacing a node or when migrating a tenant from one pageserver to another.

From a remote storage perspective, a split brain condition occurs whenever two nodes both think they have the same tenant attached, and both can write to S3. This can happen in the case of a network partition, pathologically long delays (e.g. suspended VM), or software bugs.

In the current deployment model, control plane guarantees that a tenant is attached to one pageserver at a time, thereby ruling out split-brain conditions resulting from dual attachment (however, there is always the risk of a control plane bug). This control plane guarantee prevents robust response to failures, as if a pageserver is unresponsive we may not detach from it. The mechanism in this RFC fixes this, by making it safe to attach to a new, different pageserver even if an unresponsive pageserver may be running.

Further lack of safety during split-brain conditions blocks two important features where occasional split-brain conditions are part of the design assumptions:

seamless tenant migration (RFC PR)
automatic pageserver instance failure handling (aka "failover") (RFC TBD)

Prior art

020-pageserver-s3-coordination.md
023-the-state-of-pageserver-tenant-relocation.md
026-pageserver-s3-mvcc.md

This RFC has broad similarities to the proposal to implement a MVCC scheme in S3 object names, but this RFC avoids a general purpose transaction scheme in favour of more specialized "generations" that work like a transaction ID that always has the same lifetime as a pageserver process or tenant attachment, whichever is shorter.

Requirements

Accommodate storage backends with no atomic or fencing capability (i.e. work within S3's limitation that there are no atomics and clients can't be fenced)
Don't depend on any STONITH or node fencing in the compute layer (i.e. we will not assume that we can reliably kill and EC2 instance and have it die)
Scoped per-tenant, not per-pageserver; for seamless tenant migration, we need per-tenant granularity, and for failover, we likely want to spread the workload of the failed pageserver instance to a number of peers, rather than monolithically moving the entire workload to another machine. We do not rule out the latter case, but should not constrain ourselves to it.

Design Tenets

These are not requirements, but are ideas that guide the following design:

Avoid implementing another consensus system: we already have a strongly consistent database in the control plane that can do atomic operations where needed, and we also have a Paxos implementation in the safekeeper.
Avoiding locking in to specific models of how failover will work (e.g. do not assume that all the tenants on a pageserver will fail over as a unit).
Be strictly correct when it comes to data integrity. Occasional failures of availability are tolerable, occasional data loss is not.

Non Goals

The changes in this RFC intentionally isolate the design decision of how to define logical generations numbers and object storage format in a way that is somewhat flexible with respect to how actual orchestration of failover works.

This RFC intentionally does not cover:

Failure detection
Orchestration of failover
Standby modes to keep data ready for fast migration
Intentional multi-writer operation on tenants (multi-writer scenarios are assumed to be transient split-brain situations).
Sharding.

The interaction between this RFC and those features is discussed in Appendix B

Impacted Components

pageserver, control plane, safekeeper (optional)

Implementation Part 1: Correctness

Summary

A per-tenant generation number is introduced to uniquely identifying tenant attachments to pageserver processes.
- This generation number increments each time the control plane modifies a tenant (Project)'s assigned pageserver, or when the assigned pageserver restarts.
- the control plane is the authority for generation numbers: only it may increment a generation number.
Object keys are suffixed with the generation number
Safety for multiply-attached tenants is provided by the generation number in the object key: the competing pageservers will not try to write to the same keys.
Safety in split brain for multiple nodes running with the same node ID is provided by the pageserver calling out to the control plane on startup, to re-attach and thereby increment the generations of any attached tenants
Safety for deletions is achieved by deferring the DELETE from S3 to a point in time where the deleting node has validated with control plane that no attachment with a higher generation has a reference to the to-be-DELETEd key.
The control plane is used to issue generation numbers to avoid the need for a built-in consensus system in the pageserver, although this could in principle be changed without changing the storage format.

Generation numbers

A generation number is associated with each tenant in the control plane, and each time the attachment status of the tenant changes, this is incremented. Changes in attachment status include:

Attaching the tenant to a different pageserver
A pageserver restarting, and "re-attaching" its tenants on startup

These increments of attachment generation provide invariants we need to avoid split-brain issues in storage:

If two pageservers have the same tenant attached, the attachments are guaranteed to have different generation numbers, because the generation would increment while attaching the second one.
If there are multiple pageservers running with the same node ID, all the attachments on all pageservers are guaranteed to have different generation numbers, because the generation would increment when the second node started and re-attached its tenants.

As long as the infrastructure does not transparently replace an underlying physical machine, we are totally safe. See the later unsafe case section for details.

Object Key Changes

Generation suffix

All object keys (layer objects and index objects) will contain the attachment generation as a suffix. This suffix is the primary mechanism for protecting against split-brain situations, and enabling safe multi-attachment of tenants:

Two pageservers running with the same node ID (e.g. after a failure, where there is some rogue pageserver still running) will not try to write to the same objects, because at startup they will have re-attached tenants and thereby incremented generation numbers.
Multiple attachments (to different pageservers) of the same tenant will not try to write to the same objects, as each attachment would have a distinct generation.

The generation is appended in hex format (8 byte string representing u32), to all our existing key names. A u32's range limit would permit 27 restarts per second over a 5 year system lifetime: orders of magnitude more than is realistic.

The exact meaning of the generation suffix can evolve over time if necessary, for example if we chose to implement a failover mechanism internally to the pageservers rather than going via the control plane. The storage format just sees it as a number, with the only semantic property being that the highest numbered index is the latest.

Index changes

Since object keys now include a generation suffix, the index of these keys must also be updated. IndexPart currently stores keys and LSNs sufficient to reconstruct key names: this would be extended to store the generation as well.

This will increase the size of the file, but only modestly: layers are already encoded as their string-ized form, so the overhead is about 10 bytes per layer. This will be less if/when the index storage format is migrated to a binary format from JSON.

Visibility

This section doesn't describe code changes, but extends on the consequences of the object key changes given above

Visibility of objects to pageservers

Pageservers can of course list objects in S3 at any time, but in practice their visible set is based on the contents of their LayerMap, which is initialized from the index_part.json.??? that they load.

Starting with the index_part from the most recent previous generation (see loading index_part), a pageserver initially has visibility of all the objects that were referenced in the loaded index. These objects are guaranteed to remain visible until the current generation is superseded, via pageservers in older generations avoiding deletions (see deletion).

The "most recent previous generation" is not necessarily the most recent in terms of walltime, it is the one that is readable at the time a new generation starts. Consider the following sequence of a tenant being re-attached to different pageserver nodes:

Create + attach on PS1 in generation 1
PS1 Do some work, write out index_part.json-0001
Attach to PS2 in generation 2
Read index_part.json-0001
PS2 starts doing some work...
Attach to PS3 in generation 3
Read index_part.json-0001
...PS2 finishes its work: now it writes index_part.json-0002
PS3 writes out index_part.json-0003

In the above sequence, the ancestry of indices is:

0001 -> 0002
     |
     -> 0003

This is not an issue for safety: if the 0002 references some object that is not in 0001, then 0003 simply does not see it, and will re-do whatever work was required (e.g. ingesting WAL or doing compaction). Objects referenced by only the 0002 index will never be read by future attachment generations, and will eventually be cleaned up by a scrub (see scrubbing).

Visibility of LSNs to clients

Because index_part.json is now written with a generation suffix, which data is visible depends on which generation the reader is operating in:

If one was passively reading from S3 from outside of a pageserver, the visibility of data would depend on which index_part.json-<generation> file one had chosen to read from.
If two pageservers have the same tenant attached, they may have different data visible as they're independently replaying the WAL, and maintaining independent LayerMaps that are written to independent index_part.json files. Data does not have to be remotely committed to be visible.
For a pageserver writing with a stale generation, historic LSNs remain readable until another pageserver (with a higher generation suffix) decides to execute GC deletions. At this point, we may think of the stale attachment's generation as having logically ended: during its existence the generation had a consistent view of the world.
For a newly attached pageserver, its highest visible LSN may appears to go backwards with respect to an earlier attachment, if that earlier attachment had not uploaded all data to S3 before the new attachment.

Deletion

Generation number validation

While writes are de-conflicted by writers always using their own generation number in the key, deletions are slightly more challenging: if a pageserver A is isolated, and the true active node is pageserver B, then it is dangerous for A to do any object deletions, even of objects that it wrote itself, because pageserver's B metadata might reference those objects.

We solve this by inserting a "generation validation" step between the write of a remote index that un-links a particular object from the index, and the actual deletion of the object, such that deletions strictly obey the following ordering:

Write out index_part.json: this guarantees that any subsequent reader of the metadata will not try and read the object we unlinked.
Call out to control plane to validate that the generation which we use for our attachment is still the latest.
If step 2 passes, it is safe to delete the object. Why? The check-in with control plane together with our visibility rules guarantees that any later generation will use either the exact index_part.json that we uploaded in step 1, or a successor of it; not an earlier one. In both cases, the index_part.json doesn't reference the key we are deleting anymore, so, the key is invisible to any later attachment generation. Hence it's safe to delete it.

Note that at step 2 we are only confirming that deletions of objects no longer referenced by the specific index_part.json written in step 1 are safe. If we were attempting other deletions concurrently, these would need their own generation validation step.

If step 2 fails, we may leak the object. This is safe, but has a cost: see scrubbing. We may avoid this entirely outside of node failures, if we do proper flushing of deletions on clean shutdown and clean migration.

To avoid doing a huge number of control plane requests to perform generation validation, validation of many tenants will be done in a single request, and deletions will be queued up prior to validation: see Persistent deletion queue for more.

`remote_consistent_lsn` updates

Remote objects are not the only kind of deletion the pageserver does: it also indirectly deletes WAL data, by feeding back remote_consistent_lsn to safekeepers, as a signal to the safekeepers that they may drop data below this LSN.

For the same reasons that deletion of objects must be guarded by an attachment generation number validation step, updates to remote_consistent_lsn are subject to the same rules, using an ordering as follows:

upload the index_part that covers data up to LSN L0 to S3
Call out to control plane to validate that the generation which we use for our attachment is still the latest.
advance the remote_consistent_lsn that we advertise to the safekeepers to L0

If step 2 fails, then the remote_consistent_lsn advertised to safekeepers will not advance again until a pageserver with the latest generation is ready to do so.

Note: at step 3 we are not advertising the latest remote_consistent_lsn, we are advertising the value in the index_part that we uploaded in step 1. This provides a strong ordering guarantee.

Internally to the pageserver, each timeline will have two remote_consistent_lsn values: the one that reflects its latest write to remote storage, and the one that reflects the most recent validation of generation number. It is only the latter value that may be advertised to the outside world (i.e. to the safekeeper).

The control plane remains unaware of remote_consistent_lsn: it only has to validate the freshness of generation numbers, thereby granting the pageserver permission to share the information with the safekeeper.

For convenience, in subsequent sections and RFCs we will use "deletion" to mean both deletion of objects in S3, and updates to the remote_consistent_lsn, as updates to the remote consistent LSN are de-facto deletions done via the safekeeper, and both kinds of deletion are subject to the same generation validation requirement.

Pageserver attach/startup changes

Attachment

Calls to /v1/tenant/{tenant_id}/attach are augmented with an additional generation field in the body.

The pageserver does not persist this: a generation is only good for the lifetime of a process.

Finding the remote indices for timelines

Because index files are now suffixed with generation numbers, the pageserver cannot always GET the remote index in one request, because it can't always know a-priori what the latest remote index is.

Typically, the most recent generation to write an index would be our own generation minus 1. However, this might not be the case: the previous node might have started and acquired a generation number, and then crashed before writing out a remote index.

In the general case and as a fallback, the pageserver may list all the index_part.json files for a timeline, sort them by generation, and pick the highest that is <= its current generation for this attachment. The tenant should never load an index with an attachment generation newer than its own. These two rules combined ensure that objects written by later generations are never visible to earlier generations.

Note that if a given attachment picks an index part from an earlier generation (say n-2), but crashes & restarts before it writes its own generation's index part, next time it tries to pick an index part there may be an index part from generation n-1. It would pick the n-1 index part in that case, because it's sorted higher than the previous one from generation n-2. So, above rules guarantee no determinism in selecting the index part. are allowed to be attached with stale attachment generations during a multiply-attached phase in a migration, and in this instance if the old location's pageserver restarts, it should not try and load the newer generation's index.

To summarize, on starting a timeline, the pageserver will:

Issue a GET for index_part.json-<my generation - 1>
If 1 failed, issue a ListObjectsv2 request for index_part.json* and pick the newest.

One could optimize this further by using the control plane to record specifically which generation most recently wrote an index_part.json, if necessary, to increase the probability of finding the index_part.json in one GET. One could also improve the chances by having pageservers proactively write out index_part.json after they get a new generation ID.

Re-attachment on startup

On startup, the pageserver will call out to an new control plane /re-attach API (see Generation API). This returns a list of tenants that should be attached to the pageserver, and their generation numbers, which the control plane will increment before returning.

The pageserver should still scan its local disk on startup, but should delete any local content for tenants not indicated in the /re-attach response: their absence is an implicit detach operation.

Note if a tenant is omitted from the re-attach response, its local disk content will be deleted. This will change in subsequent work, when the control plane gains the concept of a secondary/standby location: a node with local content may revert to this status and retain some local content.

Cleaning up previous generations' remote indices

Deletion of old indices is not necessary for correctness, although it is necessary to avoid the ListObjects fallback in the previous section becoming ever more expensive.

Once the new attachment has written out its index_part.json, it may asynchronously clean up historic index_part.json objects that were found.

We may choose to implement this deletion either as an explicit step after we write out index_part for the first time in a pageserver's lifetime, or for simplicity just do it periodically as part of the background scrub (see scrubbing);

Control Plane Changes

Store generations for attaching tenants

The Project table must store the generation number for use when attaching the tenant to a new pageserver.
The /v1/tenant/:tenant_id/attach pageserver API will require the generation number, which the control plane can supply by simply incrementing the Project's generation number each time the tenant is attached to a different server: the same database transaction that changes the assigned pageserver should also change the generation number.

Generation API

This section describes an API that could be provided directly by the control plane, or built as a separate microservice. In earlier parts of the RFC, when we discuss the control plane providing generation numbers, we are referring to this API.

The API endpoints used by the pageserver to acquire and validate generation numbers are quite simple, and only require access to some persistent and linerizable storage (such as a database).

Building this into the control plane is proposed as a least-effort option to exploit existing infrastructure and implement generation number issuance in the same transaction that mandates it (i.e., the transaction that updates the Project assignment to another pageserver). However, this is not mandatory: this "Generation Number Issuer" could be built as a microservice. In practice, we will write such a miniature service anyway, to enable E2E pageserver/compute testing without control plane.

The endpoints required by pageservers are:

`/re-attach`

Request: {node_id: <u32>}
Response:
- 200 {tenants: [{id: <TenantId>, gen: <u32>}]}
- 404: unknown node_id
- (Future: 429: flapping detected, perhaps nodes are fighting for the same node ID, or perhaps this node was in a retry loop)
- (On unknown tenants, omit tenant from tenants array)
Server behavior: query database for which tenants should be attached to this pageserver.
- for each tenant that should be attached, increment the attachment generation and include the new generation in the response
Client behavior:
- for all tenants in the response, activate with the new generation number
- for any local disk content not referenced in the response, act as if we had been asked to detach it (i.e. delete local files)

Note the node_id in this request will change in future if we move to ephemeral node IDs, to be replaced with some correlation ID that helps the control plane realize if a process is running with the same storage as a previous pageserver process (e.g. we might use EC instance ID, or we might just write some UUID to the disk the first time we use it)

`/validate`

Request: {'tenants': [{tenant: <tenant id>, attach_gen: <gen>}, ...]}'
Response:
- 200 {'tenants': [{tenant: <tenant id>, status: <bool>}...]}
- (On unknown tenants, omit tenant from tenants array)
Purpose: enable the pageserver to discover for the given attachments whether they are still the latest.
Server behavior: this is a read-only operation: simply compare the generations in the request with the generations known to the server, and set status to true if they match.
Client behavior: clients must not do deletions within a tenant's remote data until they have received a response indicating the generation they hold for the attachment is current.

Use of `/load` and `/ignore` APIs

Because the pageserver will be changed to only attach tenants on startup based on the control plane's response to a /re-attach request, the load/ignore APIs no longer make sense in their current form.

The /load API becomes functionally equivalent to attach, and will be removed: any location that used /load before should just attach instead.

The /ignore API is equivalent to detaching, but without deleting local files.

Timeline/Branch creation & deletion

All of the previous arguments for safety have described operations within a timeline, where we may describe a sequence that includes updates to index_part.json, and where reads and writes are coming from a postgres endpoint (writes via the safekeeper).

Creating or destroying timeline is a bit different, because writes are coming from the control plane.

We must be safe against scenarios such as:

A tenant is attached to pageserver B while pageserver A is in the middle of servicing an RPC from the control plane to create or delete a tenant.
A pageserver A has been sent a timeline creation request but becomes unresponsive. The tenant is attached to a different pageserver B, and the timeline creation request is sent there too.

Timeline Creation

If some very slow node tries to do a timeline creation after a more recent generation node has already created the timeline and written some data into it, that must not cause harm. This is provided in timeline creations by the way all the objects within the timeline's remote path include a generation suffix: a slow node in an old generation that attempts to "create" a timeline that already exists will just emit an index_part.json with an old generation suffix.

Timeline IDs are never reused, so we don't have to worry about the case of create/delete/create cycles. If they were re-used during a disaster recovery "un-delete" of a timeline, that special case can be handled by calling out to all available pageservers to check that they return 404 for the timeline, and to flush their deletion queues in case they had any deletions pending from the timeline.

The above makes it safe for control plane to change the assignment of tenant to pageserver in control plane while a timeline creation is ongoing. The reason is that the creation request against the new assigned pageserver uses a new generation number. However, care must be taken by control plane to ensure that a "timeline creation successful" response from some pageserver is checked for the pageserver's generation for that timeline's tenant still being the latest. If it is not the latest, the response does not constitute a successful timeline creation. It is acceptable to discard such responses, the scrubber will clean up the S3 state. It is better to issue a timeline deletion request to the stale attachment.

Timeline Deletion

Tenant/timeline deletion operations are exempt from generation validation on deletes, and therefore don't have to go through the same deletion queue as GC/compaction layer deletions. This is because once a delete is issued by the control plane, it is a promise that the control plane will keep trying until the deletion is done, so even stale pageservers are permitted to go ahead and delete the objects.

The implications of this for control plane are:

During timeline/tenant deletion, the control plane must wait for the deletion to be truly complete (status 404) and also handle the case where the pageserver becomes unavailable, either by waiting for a replacement with the same node_id, or by *re-attaching the tenant elsewhere.
The control plane must persist its intent to delete a timeline/tenant before issuing any RPCs, and then once it starts, it must keep retrying until the tenant/timeline is gone. This is already handled by using a persistent Operation record that is retried indefinitely.

Timeline deletion may result in a special kind of object leak, where the latest generation attachment completes a deletion (including erasing all objects in the timeline path), but some slow/partitioned node is writing into the timeline path with a stale generation number. This would not be caught by any per-timeline scrubbing (see scrubbing), since scrubbing happens on the attached pageserver, and once the timeline is deleted it isn't attached anywhere. This scenario should be pretty rare, and the control plane can make it even rarer by ensuring that if a tenant is in a multi-attached state (e.g. during migration), we wait for that to complete before processing the deletion. Beyond that, we may implement some other top-level scrub of timelines in an external tool, to identify any tenant/timeline paths that are not found in the control plane database.

Examples

Deletion, node restarts partway through:
- By the time we returned 202, we have written a remote delete marker
- Any subsequent incarnation of the same node_id will see the remote delete marker and continue to process the deletion
- If the original pageserver is lost permanently and no replacement with the same node_id is available, then the control plane must recover by re-attaching the tenant to a different node.
Creation, node becomes unresponsive partway through.
- Control plane will see HTTP request timeout, keep re-issuing request to whoever is the latest attachment point for the tenant until it succeeds.
- Stale nodes may be trying to execute timeline creation: they will write out index_part.json files with stale attachment generation: these will be eventually cleaned up by the same mechanism as other old indices.

Unsafe case on badly behaved infrastructure

This section is only relevant if running on a different environment than EC2 machines with ephemeral disks.

If we ever run pageservers on infrastructure that might transparently restart a pageserver while leaving an old process running (e.g. a VM gets rescheduled without the old one being fenced), then there is a risk of corruption, when the control plane attaches the tenant, as follows:

If the control plane sends an /attach request to node A, then node A dies and is replaced, and the control plane's retries the request without incrementing that attachment ID, then it could end up with two physical nodes both using the same generation number.
This is not an issue when using EC2 instances with ephemeral storage, as long as the control plane never re-uses a node ID, but it would need re-examining if running on different infrastructure.
To robustly protect against this class of issue, we would either:
- add a "node generation" to distinguish between different processes holding the same node_id.
- or, dispense with static node_id entirely and issue an ephemeral ID to each pageserver process when it starts.

Implementation Part 2: Optimizations

Persistent deletion queue

Between writing our a new index_part.json that doesn't reference an object, and executing the deletion, an object passes through a window where it is only referenced in memory, and could be leaked if the pageserver is stopped uncleanly. That introduces conflicting incentives: on the one hand, we would like to delay and batch deletions to

minimize the cost of the mandatory validations calls to control plane, and
minimize cost for DeleteObjects requests. On the other hand we would also like to minimize leakage by executing deletions promptly.

To resolve this, we may make the deletion queue persistent and then executing these in the background at a later time.

Note: The deletion queue's reason for existence is optimization rather than correctness, so there is a lot of flexibility in exactly how the it should work, as long as it obeys the rule to validate generations before executing deletions, so the following details are not essential to the overall RFC.

Scope

The deletion queue will be global per pageserver, not per-tenant. There are several reasons for this choice:

Use the queue as a central point to coalesce validation requests to the control plane: this avoids individual Timeline objects ever touching the control plane API, and avoids them having to know the rules about validating deletions. This separation of concerns will avoid burdening the already many-LoC Timeline type with even more responsibility.
Decouple the deletion queue from Tenant attachment lifetime: we may "hibernate" an inactive tenant by tearing down its Tenant/Timeline objects in the pageserver, without having to wait for deletions to be done.
Amortize the cost of I/O for the persistent queue, instead of having many tiny queues.
Coalesce deletions into a smaller number of larger DeleteObjects calls

Because of the cost of doing I/O for persistence, and the desire to coalesce generation validation requests across tenants, and coalesce deletions into larger DeleteObjects requests, there will be one deletion queue per pageserver rather than one per tenant. This has the added benefit that when deactivating a tenant, we do not have to drain their deletion queue: deletions can proceed for a tenant whose main Tenant object has been torn down.

Flow of deletion

The flow of a deletion is becomes:

Need for deletion of an object (=> layer file) is identified.
Unlink the object from all the places that reference it (=> index_part.json).
Enqueue the deletion to a persistent queue. Each entry is tenant_id, attachment_generation, S3 key.
Validate & execute in batches: 4.1 For a batch of entries, call into control plane. 4.2 For the subset of entries that passed validation, execute a DeleteObjects S3 DELETE request for their S3 keys.

As outlined in the Part 1 on correctness, it is critical that deletions are only executed once the key is not referenced anywhere in S3. This property is obviously upheld by the scheme above.

We Accept Object Leakage In Acceptable Circumstances

If we crash in the flow above between (2) and (3), we lose track of unreferenced object. Further, enqueuing a single to the persistent queue may not be durable immediately to amortize cost of flush to disk. This is acceptable for now, it can be caught by the scrubber.

There are various measures we can take to improve this in the future.

Cap amount of time until enqueued entry becomes durable (timeout for flush-to-tisk)
Proactively flush:
- On graceful shutdown, as we anticipate that some or all of our attachments may be re-assigned while we are offline.
- On tenant detach.
For each entry, keep track of whether it has passed (2). Only admit entries to (4) one they have passed (2). This requires re-writing / two queue entries (intent, commit) per deletion.

The important take-away with any of the above is that it's not disastrous to leak objects in exceptional circumstances.

Operations that may skip the queue

Deletions of an entire timeline are exempt from generation number validation. Once the control plane sends the deletion request, there is no requirement to retain the readability of any data within the timeline, and all objects within the timeline path may be deleted at any time from the control plane's deletion request onwards.

Since deletions of smaller timelines won't have enough objects to compose a full sized DeleteObjects request, it is still useful to send these through the last part of the deletion pipeline to coalesce with other executing deletions: to enable this, the deletion queue should expose two input channels: one for deletions that must be processed in a generation-aware way, and a fast path for timeline deletions, where that fast path may skip validation and the persistent queue.

Cleaning up orphan objects (scrubbing)

An orphan object is any object which is no longer referenced by a running node or by metadata.

Examples of how orphan objects arise:

A node PUTs a layer object, then crashes before it writes the index_part.json that references that layer.
A stale node carries on running for some time, and writes out an unbounded number of objects while it believes itself to be the rightful writer for a tenant.
A pageserver crashes between un-linking an object from the index, and persisting the object to its deletion queue.

Orphan objects are functionally harmless, but have a small cost due to S3 capacity consumed. We may clean them up at some time in the future, but doing a ListObjectsv2 operation and cross referencing with the latest metadata to identify objects which are not referenced.

Scrubbing will be done only by an attached pageserver (not some third party process), and deletions requested during scrub will go through the same validation as all other deletions: the attachment generation must be fresh. This avoids the possibility of a stale pageserver incorrectly thinking than an object written by a newer generation is stale, and deleting it.

It is not strictly necessary that scrubbing be done by an attached pageserver: it could also be done externally. However, an external scrubber would still require the same validation procedure that a pageserver's deletion queue performs, before actually erasing objects.

Operational impact

Availability

Coordination of generation numbers via the control plane introduce a dependency for certain operations:

Starting new pageservers (or activating pageservers after a restart)
Executing enqueued deletions
Advertising updated remote_consistent_lsn to enable WAL trimming

Item 1. would mean that some in-place restarts that previously would have resumed service even if the control plane were unavailable, will now not resume service to users until the control plane is available. We could avoid this by having a timeout on communication with the control plane, and after some timeout, resume service with the previous generation numbers (assuming this was persisted to disk). However, this is unlikely to be needed as the control plane is already an essential & highly available component. Also, having a node re-use an old generation number would complicate reasoning about the system, as it would break the invariant that a generation number uniquely identifies a tenant's attachment to a given pageserver process: it would merely identify the tenant's attachment to the pageserver machine or its on-disk-state.

Item 2. is a non-issue operationally: it's harmless to delay deletions, the only impact of objects pending deletion is the S3 capacity cost.

Item 3. could be an issue if safekeepers are low on disk space and the control plane is unavailable for a long time. If this became an issue, we could adjust the safekeeper to delete segments from local disk sooner, as soon as they're uploaded to S3, rather than waiting for remote_consistent_lsn to advance.

For a managed service, the general approach should be to make sure we are monitoring & respond fast enough that control plane outages are bounded in time.

There is also the fact that control plane runs in a single region. The latency for distant regions is not a big concern for us because all request types added by this RFC are either infrequent or not in the way of the data path. However, we lose region isolation for the operations listed above. The ongoing work to split console and control will give us per-region control plane, and all operations in this RFC can be handled by these per-region control planes. With that in mind, we accept the trade-offs outlined in this paragraph.

We will also implement an "escape hatch" config generation numbers, where in a major disaster outage, we may manually run pageservers with a hand-selected generation number, so that we can bring them online independently of a control plane.

Rollout

Although there is coupling between components, we may deploy most of the new data plane components independently of the control plane: initially they can just use a static generation number.

Phase 1

The pageserver is deployed with some special config to:

Always act like everything is generation 1 and do not wait for a control plane issued generation on attach
Skip the places in deletion and remote_consistent_lsn updates where we would call into control plane

Phase 2

The control plane changes are deployed: control plane will now track and increment generation numbers.

Phase 3

The pageserver is deployed with its control-plane-dependent changes enabled: it will now require the control plane to service re-attach requests on startup, and handle generation validation requests.

On-disk backward compatibility

Backward compatibility with existing data is straightforward:

When reading the index, we may assume that any layer whose metadata doesn't include generations will have a path without generation suffix.
When locating the index file on attachment, we may use the "fallback" listing path and if there is only an index without generation suffix, that is the one we load.

It is not necessary to re-write existing layers: even new index files will be able to represent generation-less layers.

On-disk forward compatibility

We will do a two phase rollout, probably over multiple releases because we will naturally have some of the read-side code ready before the overall functionality is ready:

Deploy pageservers which understand the new index format and generation suffixes in keys, but do not write objects with generation numbers in the keys.
Deploy pageservers that write objects with generation numbers in the keys.

Old pageservers will be oblivious to generation numbers. That means that they can't read objects with generation numbers in the name. This is why we must first step must deploy the ability to read, before the second step starts writing them.

Frequently Asked Questions

Why a generation suffix rather than prefix?

The choice is motivated by object listing, since one can list by prefix but not suffix.

In finding remote indices, we rely on being able to do a prefix listing for <tenant>/<timeline>/index_part.json*. That relies on the prefix listing.

The converse case of using a generation prefix and listing by generation is not needed: one could imagine listing by generation while scrubbing (so that a particular generation's layers could be scrubbed), but this is not part of normal operations, and the scrubber probably won't work that way anyway.

Wouldn't it be simpler to have a separate deletion queue per timeline?

Functionally speaking, we could. That's how RemoteTimelineClient currently works, but this approach does not map well to a long-lived persistent queue with generation validation.

Anything we do per-timeline generates tiny random I/O, on a pageserver with tens of thousands of timelines operating: to be ready for high scale, we should:

A) Amortize costs where we can (e.g. a shared deletion queue)
B) Expect to put tenants into a quiescent state while they're not busy: i.e. we shouldn't keep a tenant alive to service its deletion queue.

This was discussed in the scope part of the deletion queue section.

Appendix A: Examples of use in high availability/failover

The generation numbers proposed in this RFC are adaptable to a variety of different failover scenarios and models. The sections below sketch how they would work in practice.

In-place restart of a pageserver

"In-place" here means that the restart is done before any other element in the system has taken action in response to the node being down.

After restart, the node issues a re-attach request to the control plane, and receives new generation numbers for all its attached tenants.
Tenants may be activated with the generation number in the re-attach response.
If any of its attachments were in fact stale (i.e. had be reassigned to another node while this node was offline), then
- the re-attach response will inform the tenant about this by not including the tenant of this by not incrementing the generation for that attachment.
- This will implicitly block deletions in the tenant, but as an optimization the pageserver should also proactively stop doing S3 uploads when it notices this stale-generation state.
- The control plane is expected to eventually detach this tenant from the pageserver.

If the control plane does not include a tenant in the re-attach response, but there is still local state for the tenant in the filesystem, the pageserver deletes the local state in response and does not load/active the tenant. See the earlier section on pageserver startup for details. Control plane can use this mechanism to clean up a pageserver that has been down for so long that all its tenants were migrated away before it came back up again and asked for re-attach.

Failure of a pageserver

In this context, read "failure" as the most ambiguous possible case, where a pageserver is unavailable to clients and control plane, but may still be executing and talking to S3.

Case A: re-attachment to other nodes

Let's say node 0 becomes unresponsive in a cluster of three nodes 0, 1, 2.
Some external mechanism notices that the node is unavailable and initiates movement of all tenants attached to that node to a different node according to some distribution rule. In this example, it would mean incrementing the generation of all tenants that were attached to node 0, as each tenant's assigned pageserver changes.
A tenant which is now attached to node 1 will also still be attached to node 0, from the perspective of node 0. Node 0 will still be using its old generation, node 1 will be using a newer generation.
S3 writes will continue from nodes 0 and 1: there will be an index_part.json-00000001 _and* an index_part.json-00000002. Objects written under the old suffix after the new attachment was created do not matter from the rest of the system's perspective: the endpoints are reading from the new attachment location. Objects written by node 0 are just garbage that can be cleaned up at leisure. Node 0 will not do any deletions because it can't synchronize with control plane, or if it could, its deletion queue processing would get errors for the validation requests.

Case B: direct node replacement with same node_id and drive

This is the scenario we would experience if running pageservers in some dynamic VM/container environment that would auto-replace a given node_id when it became unresponsive, with the node's storage supplied by some network block device that is attached to the replacement VM/container.

Let's say node 0 fails, and there may be some other peers but they aren't relevant.
Some external mechanism notices that the node is unavailable, and creates a "new node 0" (Node 0b) which is a physically separate server. The original node 0 (Node 0a) may still be running, because we do not assume the environment fences nodes.
On startup, node 0b re-attaches and gets higher generation numbers for all tenants.
S3 writes continue from nodes 0a and 0b, but the writes do not collide due to different generation in the suffix, and the writes from node 0a are not visible to the rest of the system because endpoints are reading only from node 0b.

Appendix B: interoperability with other features

Sharded Keyspace

The design in this RFC maps neatly to a sharded keyspace design where subsets of the key space for a tenant are assigned to different pageservers:

the "unit of work" for attachments becomes something like a TenantShard rather than a Tenant
TenantShards get generation numbers just as Tenants do.
Write workload (ingest, compaction) for a tenant is spread out across pageservers via TenantShards, but each TenantShard still has exactly one valid writer at a time.

Read replicas

This section is about a passive reader of S3 pageserver state, not a postgres read replica

For historical reads to LSNs below the remote persistent LSN, any node may act as a reader at any time: remote data is logically immutable data, and the use of deferred deletion in this RFC helps mitigate the fact that remote data is not physically immutable (i.e. the actual data for a given page moves around as compaction happens).

A read replica needs to be aware of generations in remote data in order to read the latest metadata (find the index_part.json with the latest suffix). It may either query this from the control plane, or find it with ListObjectsv2 request

Seamless migration

To make tenant migration totally seamless, we will probably want to intentionally double-attach a tenant briefly, serving reads from the old node while waiting for the new node to be ready.

This RFC enables that double-attachment: two nodes may be attached at the same time, with the migration destination having a higher generation number. The old node will be able to ingest and serve reads, but not do any deletes. The new node's attachment must also avoid deleting layers that the old node may still use. A new piece of state will be needed for this in the control plane's definition of an attachment.

Warm secondary locations

To enable faster tenant movement after a pageserver is lost, we will probably want to spend some disk capacity on keeping standby locations populated with local disk data.

There's no conflict between this RFC and that: implementing warm secondary locations on a per-tenant basis would be a separate change to the control plane to store standby location(s) for a tenant. Because the standbys do not write to S3, they do not need to be assigned generation numbers. When a tenant is re-attached to a standby location, that would increment the tenant attachment generation and this would work the same as any other attachment change, but with a warm cache.

Ephemeral node IDs

This RFC intentionally avoids changing anything fundamental about how pageservers are identified and registered with the control plane, to avoid coupling the implementation of pageserver split brain protection with more fundamental changes in the management of the pageservers.

Moving to ephemeral node IDs would provide an extra layer of resilience in the system, as it would prevent the control plane accidentally attaching to two physical nodes with the same generation, if somehow there were two physical nodes with the same node IDs (currently we rely on EC2 guarantees to eliminate this scenario). With ephemeral node IDs, there would be no possibility of that happening, no matter the behavior of underlying infrastructure.

Nothing fundamental in the pageserver's handling of generations needs to change to handle ephemeral node IDs, since we hardly use the node_id anywhere. The /re-attach API would be extended to enable the pageserver to obtain its ephemeral ID, and provide some correlation identifier (e.g. EC instance ID), to help the control plane re-attach tenants to the same physical server that previously had them attached.

Pageserver: split-brain safety for remote storage through generation numbers

Pageserver: split-brain safety for remote storage through generation numbers

Summary

Motivation

Prior art

Requirements

Design Tenets

Non Goals

Impacted Components

Implementation Part 1: Correctness

Summary

Generation numbers

Object Key Changes

Generation suffix

Index changes

Visibility

Visibility of objects to pageservers

Visibility of LSNs to clients

Deletion

Generation number validation

remote_consistent_lsn updates

Pageserver attach/startup changes

Attachment

Finding the remote indices for timelines

Re-attachment on startup

Cleaning up previous generations' remote indices

Control Plane Changes

Store generations for attaching tenants

Generation API

/re-attach

/validate

Use of /load and /ignore APIs

Timeline/Branch creation & deletion

Timeline Creation

Timeline Deletion

Examples

Unsafe case on badly behaved infrastructure

Implementation Part 2: Optimizations

Persistent deletion queue

Scope

Flow of deletion

We Accept Object Leakage In Acceptable Circumstances

Operations that may skip the queue

Cleaning up orphan objects (scrubbing)

Operational impact

Availability

Rollout

Phase 1

Phase 2

Phase 3

On-disk backward compatibility

On-disk forward compatibility

Frequently Asked Questions

Why a generation suffix rather than prefix?

Wouldn't it be simpler to have a separate deletion queue per timeline?

Appendix A: Examples of use in high availability/failover

In-place restart of a pageserver

Failure of a pageserver

Case A: re-attachment to other nodes

Case B: direct node replacement with same node_id and drive

Appendix B: interoperability with other features

Sharded Keyspace

Read replicas

Seamless migration

Warm secondary locations

Ephemeral node IDs

`remote_consistent_lsn` updates

`/re-attach`

`/validate`

Use of `/load` and `/ignore` APIs