Seamless tenant migration

Author: [email protected]
Created on 2023-08-11
Implemented on ..

Summary

The preceding generation numbers RFC may be thought of as "making tenant migration safe". Following that, this RFC is about how those migrations are to be done:

Seamlessly (without interruption to client availability)
Quickly (enabling faster operations)
Efficiently (minimizing I/O and $ cost)

These points are in priority order: if we have to sacrifice efficiency to make a migration seamless for clients, we will do so, etc.

This is accomplished by introducing two high level changes:

A dual-attached state for tenants, used in a control-plane-orchestrated migration procedure that preserves availability during a migration.
Warm secondary locations for tenants, where on-disk content is primed for a fast migration of the tenant from its current attachment to this secondary location.

Motivation

Migrating tenants between pageservers is essential to operating a service at scale, in several contexts:

Responding to a pageserver node failure by migrating tenants to other pageservers
Balancing load and capacity across pageservers, for example when a user expands their database and they need to migrate to a pageserver with more capacity.
Restarting pageservers for upgrades and maintenance

The current situation steps for migration are:

detach from old node; skip if old node is dead; (the skip part is still WIP).
attach to new node
re-configure endpoints to use the new node

Once generation numbers are implemented, the detach step is no longer critical for correctness. So, we can

attach to a new node,
re-configure endpoints to use the new node, and then
detach from the old node.

However, this still does not meet our seamless/fast/efficient goals:

Not fast: The new node will have to download potentially large amounts of data from S3, which may take many minutes.
Not seamless: If we attach to a new pageserver before detaching an old one, the new one might delete some objects that interrupt availability of reads on the old one.
Not efficient: the old pageserver will continue uploading S3 content during the migration that will never be read.

The user expectations for availability are:

For planned maintenance, there should be zero availability gap. This expectation is fulfilled by this RFC.
For unplanned changes (e.g. node failures), there should be minimal availability gap. This RFC provides the mechanism to fail over quickly, but does not provide the failure detection nor failover policy.

Non Goals

Defining service tiers with different storage strategies: the same level of HA & overhead will apply to all tenants. This doesn't rule out adding such tiers in future.
Enabling pageserver failover in the absence of a control plane: the control plane will remain the source of truth for what should be attached where.
Totally avoiding availability gaps on unplanned migrations during a failure (we expect a small, bounded window of read unavailability of very recent LSNs)
Workload balancing: this RFC defines the mechanism for moving tenants around, not the higher level logic for deciding who goes where.
Defining all possible configuration flows for tenants: the migration process defined in this RFC demonstrates the sufficiency of the pageserver API, but is not the only kind of configuration change the control plane will ever do. The APIs defined here should let the control plane move tenants around in whatever way is needed while preserving data safety and read availability.

Impacted components

Pageserver, control plane

Terminology

Attachment: a tenant is attached to a pageserver if it has been issued a generation number, and is running an instance of the Tenant type, ingesting the WAL, and available to serve page reads.
Location: locations are a superset of attachments. A location is a combination of a tenant and a pageserver. We may attach at a location.
Secondary location: a location which is not currently attached.
Warm secondary location: a location which is not currently attached, but is endeavoring to maintain a warm local cache of layers. We avoid calling this a warm standby to avoid confusion with similar postgres features.

Implementation (high level)

Warm secondary locations

To enable faster migrations, we will identify at least one secondary location for each tenant. This secondary location will keep a warm cache of layers for the tenant, so that if it is later attached, it can catch up with the latest LSN quickly: rather than downloading everything, it only has to replay the recent part of the WAL to advance from the remote_consistent_offset to the most recent LSN in the WAL.

The control plane is responsible for selecting secondary locations, and calling into pageservers to configure tenants into a secondary mode at this new location, as well as attaching the tenant in its existing primary location.

The attached pageserver for a tenant will publish a layer heatmap to advise secondaries of which layers should be downloaded.

Location modes

Currently, we consider a tenant to be in one of two states on a pageserver:

Attached: active Tenant object, and layers on local disk
Detached: no layers on local disk, no runtime state.

We will extend this with finer-grained modes, whose purpose will become clear in later sections:

AttachedSingle: equivalent the existing attached state.
AttachedMulti: like AttachedSingle, holds an up to date generation, but does not do deletions.
AttachedStale: like AttachedSingle, holds a stale generation, do not do any remote storage operations.
Secondary: keep local state on disk, periodically update from S3.
Detached: equivalent to existing detached state.

To control these finer grained states, a new pageserver API endpoint will be added.

Cutover procedure

Define old location and new location as "Node A" and "Node B". Consider the case where both nodes are available, and Node B was previously configured as a secondary location for the tenant we are migrating.

The cutover procedure is orchestrated by the control plane, calling into the pageservers' APIs:

Call to Node A requesting it to flush to S3 and enter AttachedStale state
Increment generation, and call to Node B requesting it to enter AttachedMulti state with the new generation.
Call to Node B, requesting it to download the latest hot layers from remote storage, according to the latest heatmap flushed by Node A.
Wait for Node B's WAL ingestion to catch up with node A's
Update endpoints to use node B instead of node A
Call to node B requesting it to enter state AttachedSingle.
Call to node A requesting it to enter state Secondary

The following table summarizes how the state of the system advances:

Step	Node A	Node B	Node used by endpoints
1 (initial)	AttachedSingle	Secondary	A
2	AttachedStale	AttachedMulti	A
3	AttachedStale	AttachedMulti	A
4	AttachedStale	AttachedMulti	A
5 (cutover)	AttachedStale	AttachedMulti	B
6	AttachedStale	AttachedSingle	B
7 (final)	Secondary	AttachedSingle	B

The procedure described for a clean handover from a live node to a secondary is also used for failure cases and for migrations to a location that is not configured as a secondary, by simply skipping irrelevant steps, as described in the following sections.

Migration from an unresponsive node

If node A is unavailable, then all calls into node A are skipped and we don't wait for B to catch up before switching updating the endpoints to use B.

Migration to a location that is not a secondary

If node B is initially in Detached state, the procedure is identical. Since Node B is coming from a Detached state rather than Secondary, the download of layers and catch up with WAL will take much longer.

We might do this if:

Attached and secondary locations are both critically low on disk, and we need to migrate to a third node with more resources available.
We are migrating a tenant which does not use secondary locations to save on cost.

Permanent migration away from a node

In the final step of the migration, we generally request the original node to enter a Secondary state. This is typical if we are doing a planned migration during maintenance, or to balance CPU/network load away from a node.

One might also want to permanently migrate away: this can be done by simply removing the secondary location after the migration is complete, or as an optimization by substituting the Detached state for the Secondary state in the final step.

Cutover diagram

mermaid

sequenceDiagram
participant CP as Control plane
participant A as Node A
participant B as Node B
participant E as Endpoint

CP->>A: PUT Flush & go to AttachedStale
note right of A: A continues to ingest WAL
CP->>B: PUT AttachedMulti
CP->>B: PUT Download layers from latest heatmap
note right of B: B downloads from S3
loop Poll until download complete
CP->>B: GET download status
end
activate B
note right of B: B ingests WAL
loop Poll until catch up
CP->>B: GET visible WAL
CP->>A: GET visible WAL
end
deactivate B
CP->>E: Configure to use Node B
E->>B: Connect for reads
CP->>B: PUT AttachedSingle
CP->>A: PUT Secondary

Cutover from an unavailable pageserver

This case is far simpler: we may skip straight to our intended end state.

mermaid

sequenceDiagram
participant A as Node A
participant CP as Control plane
participant B as Node B
participant E as Endpoint

note right of A: Node A offline
activate A
CP->>B: PUT AttachedSingle
CP->>E: Configure to use Node B
E->>B: Connect for reads
deactivate A

Implementation (detail)

Purpose of AttachedMulti, AttachedStale

AttachedMulti

Ordinarily, an attached pageserver whose generation is the latest may delete layers at will (e.g. during compaction). If a previous generation pageserver is also still attached, and in use by endpoints, then this layer deletion could lead to a loss of availability for the endpoint when reading from the previous generation pageserver.

The AttachedMulti state simply disables deletions. These will be enqueued in RemoteTimelineClient until the control plane transitions the node into AttachedSingle, which unblocks deletions. Other remote storage operations such as uploads are not blocked.

AttachedMulti is not required for data safety, only to preserve availability on pageservers running with stale generations.

A node enters AttachedMulti only when explicitly asked to by the control plane. It should only remain in this state for the duration of a migration.

If a control plane bug leaves the node in AttachedMulti for a long time, then we must avoid unbounded memory use from enqueued deletions. This may be accomplished simply, by dropping enqueued deletions when some modest threshold of delayed deletions (e.g. 10k layers per tenant) is reached. As with all deletions, it is safe to skip them, and the leaked objects will be eventually cleaned up by scrub or by timeline deletion.

During AttachedMulti, the Tenant is free to drop layers from local disk in response to disk pressure: only the deletion of remote layers is blocked.

AttachedStale

Currently, a pageserver with a stale generation number will continue to upload layers, but be prevented from completing deletions. This is safe, but inefficient: layers uploaded by this stale generation will not be read back by future generations of pageservers.

The AttachedStale state disables S3 uploads. The stale pageserver will continue to ingest the WAL and write layers to local disk, but not to do any uploads to S3.

A node may enter AttachedStale in two ways:

Explicitly, when control plane calls into the node at the start of a migration.
Implicitly, when the node tries to validate some deletions and discovers that its generation is stale.

The AttachedStale state also disables sending consumption metrics from that location: it is interpreted as an indication that some other pageserver is already attached or is about to be attached, and that new pageserver will be responsible for sending consumption metrics.

Disk Pressure & AttachedStale

Over long periods of time, a tenant location in AttachedStale will accumulate data on local disk, as it cannot evict any layers written since it entered the AttachStale state. We rely on the control plane to revert the location to Secondary or Detached at the end of a migration.

This scenario is particularly noteworthy when evacuating all tenants on a pageserver: since all the attached tenants will go into AttachedStale, we will be doing no uploads at all, therefore ingested data will cause disk usage to increase continuously. Under nominal conditions, the available disk space on pageservers should be sufficient to complete the evacuation before this becomes a problem, but we must also handle the case where we hit a low disk situation while in this state.

The concept of disk pressure already exists in the pageserver: the disk_usage_eviction_task touches each Tenant when it determines that a low-disk condition requires some layer eviction. Having selected layers for eviction, the eviction task calls Timeline::evict_layers.

Safety: If evict_layers is called while in AttachedStale state, and some of the to-be-evicted layers are not yet uploaded to S3, then the block on uploads will be lifted. This will result in leaking some objects once a migration is complete, but will enable the node to manage its disk space properly: if a node is left with some tenants in AttachedStale indefinitely due to a network partition or control plane bug, these tenants will not cause a full disk condition.

Warm secondary updates

Layer heatmap

The secondary location's job is to serve reads with the same quality of service as the original location was serving them around the time of a migration. This does not mean the secondary location needs the whole set of layers: inactive layers that might soon be evicted on the attached pageserver need not be downloaded by the secondary. A totally idle tenant only needs to maintain enough on-disk state to enable a fast cold start (i.e. the most recent image layers are typically sufficient).

To enable this, we introduce the concept of a layer heatmap, which acts as an advisory input to secondary locations to decide which layers to download from S3.

Attached pageserver

The attached pageserver, if in state AttachedSingle, periodically uploads a serialized heat map to S3. It may skip this if there is no change since the last time it uploaded (e.g. if the tenant is totally idle).

Additionally, when the tenant is flushed to remote storage prior to a migration (the first step in cutover procedure), the heatmap is written out. This enables a future attached pageserver to get an up to date view when deciding which layers to download.

Secondary location behavior

Secondary warm locations run a simple loop, implemented separately from the main Tenant type, which represents attached tenants:

Download the layer heatmap
Select any "hot enough" layers to download, if there is sufficient free disk space.
Download layers, if they were not previously evicted (see below)
Download the latest index_part.json
Check if any layers currently on disk are no longer referenced by IndexPart & delete them

Note that the heatmap is only advisory: if a secondary location has plenty of disk space, it may choose to retain layers that aren't referenced by the heatmap, as long as they are still referenced by the IndexPart. Conversely, if a node is very low on disk space, it might opt to raise the heat threshold required to both downloading a layer, until more disk space is available.

Secondary locations & disk pressure

Secondary locations are subject to eviction on disk pressure, just as attached locations are. For eviction purposes, the access time of a layer in a secondary location will be the access time given in the heatmap, rather than the literal time at which the local layer file was accessed.

The heatmap will indicate which layers are in local storage on the attached location. The secondary will always attempt to get back to having that set of layers on disk, but to avoid flapping, it will remember the access time of the layer it was most recently asked to evict, and layers whose access time is below that will not be re-downloaded.

The resulting behavior is that after a layer is evicted from a secondary location, it is only re-downloaded once the attached pageserver accesses the layer and uploads a heatmap reflecting that access time. On a pageserver restart, the secondary location will attempt to download all layers in the heatmap again, if they are not on local disk.

This behavior will be slightly different when secondary locations are used for "low energy tenants", but that is beyond the scope of this RFC.

Location configuration API

Currently, the /tenant/<tenant_id>/config API defines various tunables like compaction settings, which apply to the tenant irrespective of which pageserver it is running on.

A new "location config" structure will be introduced, which defines configuration which is per-tenant, but local to a particular pageserver, such as the attachment mode and whether it is a secondary.

The pageserver will expose a new per-tenant API for setting the state: /tenant/<tenant_id>/location/config.

Body content:

{
  state: 'enum{Detached, Secondary, AttachedSingle, AttachedMulti, AttachedStale}',
  generation: Option<u32>,
  configuration: `Option<TenantConfig>`
  flush: bool
}

Existing /attach and /detach endpoint will have the same behavior as calling /location/config with AttachedSingle and Detached states respectively. These endpoints will be deprecated and later removed.

The generation attribute is mandatory for entering AttachedSingle or AttachedMulti.

The configuration attribute is mandatory when entering any state other than Detached. This configuration is the same as the body for the existing /tenant/<tenant_id>/config endpoint.

The flush argument indicates whether the pageservers should flush to S3 before proceeding: this only has any effect if the node is currently in AttachedSingle or AttachedMulti. This is used during the first phase of migration, when transitioning the old pageserver to AttachedSingle.

The /re-attach API response will be extended to include a state as well as a generation, enabling the pageserver to enter the correct state for each tenant on startup.

Database schema for locations

A new table ProjectLocation:

pageserver_id: int
tenant_id: TenantId
generation: Option<int>
state: enum(Secondary, AttachedSingle, AttachedMulti)

Notes:

It is legacy for a Project to have zero ProjectLocations
The pageserver column in Project now means "to which pageserver should endpoints connect", rather than simply which pageserver is attached.
The generation column in Project remains, and is incremented and used to set the generation of ProjectLocation rows when they are set into an attached state.
The Detached state is implicitly represented as the absence of a ProjectLocation.

Executing migrations

Migrations will be implemented as Go functions, within the existing Operation framework in the control plane. These operations are persistent, such that they will always keep trying until completion: this property is important to avoid leaving garbage behind on pageservers, such as AttachedStale locations.

Recovery from failures during migration

During migration, the control plane may encounter failures of either the original or new pageserver, or both:

If the original fails, skip past waiting for the new pageserver to catch up, and put it into AttachedSingle immediately.
If the new node fails, put the old pageserver into Secondary and then back into AttachedSingle (this has the effect of retaining on-disk state and granting it a fresh generation number).
If both nodes fail, keep trying until one of them is available again.

Control plane -> Pageserver reconciliation

A migration may be done while the old node is unavailable, in which case the old node may still be running in an AttachedStale state.

In this case, it is undesirable to have the migration Operation stay alive until the old node eventually comes back online and can be cleaned up. To handle this, the control plane should run a background reconciliation process to compare a pageserver's attachments with the database, and clean up any that shouldn't be there any more.

Note that there will be no work to do if the old node was really offline, as during startup it will call into /re-attach and be updated that way. The reconciliation will only be needed if the node was unavailable but still running.

Alternatives considered

Only enabling secondary locations for tenants on a higher service tier

This will make sense in future, especially for tiny databases that may be downloaded from S3 in milliseconds when needed.

However, it is not wise to do it immediately, because pageservers contain a mixture of higher and lower tier workloads. If we had 1 tenant with a secondary location and 9 without, then those other 9 tenants will do a lot of I/O as they try to recover from S3, which may degrade the service of the tenant which had a secondary location.

Until we segregate tenant on different service tiers on different pageserver nodes, or implement & test QoS to ensure that tenants with secondaries are not harmed by tenants without, we should use the same failover approach for all the tenants.

Hot secondary locations (continuous WAL replay)

Instead of secondary locations populating their caches from S3, we could have them consume the WAL from safekeepers. The downsides of this would be:

Double load on safekeepers, which are a less scalable service than S3
Secondary locations' on-disk state would end up subtly different to the remote state, which would make synchronizing with S3 more complex/expensive when going into attached state.

The downside of only updating secondary locations from S3 is that we will have a delay during migration from replaying the LSN range between what's in S3 and what's in the pageserver. This range will be very small on planned migrations, as we have the old pageserver flush to S3 immediately before attaching the new pageserver. On unplanned migrations (old pageserver is unavailable), the range of LSNs to replay is bounded by the flush frequency on the old pageserver. However, the migration doesn't have to wait for the replay: it's just that not-yet-replayed LSNs will be unavailable for read until the new pageserver catches up.

We expect that pageserver reads of the most recent LSNs will be relatively rare, as for an active endpoint those pages will usually still be in the postgres page cache: this leads us to prefer synchronizing from S3 on secondary locations, rather than consuming the WAL from safekeepers.

Cold secondary locations

It is not functionally necessary to keep warm caches on secondary locations at all. However, if we do not, then we would experience a de-facto availability loss in unplanned migrations, as reads to the new node would take an extremely long time (many seconds, perhaps minutes).

Warm caches on secondary locations are necessary to meet our availability goals.

Pageserver-granularity failover

Instead of migrating tenants individually, we could have entire spare nodes, and on a node death, move all its work to one of these spares.

This approach is avoided for several reasons:

we would still need fine-grained tenant migration for other purposes such as balancing load
by sharing the spare capacity over many peers rather than one spare node, these peers may use the capacity for other purposes, until it is needed to handle migrated tenants. e.g. for keeping a deeper cache of their attached tenants.

Readonly during migration

We could simplify migrations by making both previous and new nodes go into a readonly state, then flush remote content from the previous node, then activate attachment on the secondary node.

The downside to this approach is a potentially large gap in readability of recent LSNs while loading data onto the new node. To avoid this, it is worthwhile to incur the extra cost of double-replaying the WAL onto old and new nodes' local storage during a migration.

Peer-to-peer pageserver communication

Rather than uploading the heatmap to S3, attached pageservers could make it available to peers.

Currently, pageservers have no peer to peer communication, so adding this for heatmaps would incur significant overhead in deployment and configuration of the service, and ensuring that when a new pageserver is deployed, other pageservers are updated to be aware of it.

As well as simplifying implementation, putting heatmaps in S3 will be useful for future analytics purposes -- gathering aggregated statistics on activity patterns across many tenants may be done directly from data in S3.