docs/rfcs/028-pageserver-migration.md
The preceding generation numbers RFC may be thought of as "making tenant migration safe". Following that, this RFC is about how those migrations are to be done:
These points are in priority order: if we have to sacrifice efficiency to make a migration seamless for clients, we will do so, etc.
This is accomplished by introducing two high level changes:
Migrating tenants between pageservers is essential to operating a service at scale, in several contexts:
The current situation steps for migration are:
Once generation numbers are implemented, the detach step is no longer critical for correctness. So, we can
However, this still does not meet our seamless/fast/efficient goals:
The user expectations for availability are:
Pageserver, control plane
Attachment: a tenant is attached to a pageserver if it has
been issued a generation number, and is running an instance of
the Tenant type, ingesting the WAL, and available to serve
page reads.
Location: locations are a superset of attachments. A location is a combination of a tenant and a pageserver. We may attach at a location.
Secondary location: a location which is not currently attached.
Warm secondary location: a location which is not currently attached, but is endeavoring to maintain a warm local cache of layers. We avoid calling this a warm standby to avoid confusion with similar postgres features.
To enable faster migrations, we will identify at least one secondary location for each tenant. This secondary location will keep a warm cache of layers for the tenant, so that if it is later attached, it can catch up with the latest LSN quickly: rather than downloading everything, it only has to replay the recent part of the WAL to advance from the remote_consistent_offset to the most recent LSN in the WAL.
The control plane is responsible for selecting secondary locations, and calling into pageservers to configure tenants into a secondary mode at this new location, as well as attaching the tenant in its existing primary location.
The attached pageserver for a tenant will publish a layer heatmap to advise secondaries of which layers should be downloaded.
Currently, we consider a tenant to be in one of two states on a pageserver:
Tenant object, and layers on local diskWe will extend this with finer-grained modes, whose purpose will become clear in later sections:
To control these finer grained states, a new pageserver API endpoint will be added.
Define old location and new location as "Node A" and "Node B". Consider the case where both nodes are available, and Node B was previously configured as a secondary location for the tenant we are migrating.
The cutover procedure is orchestrated by the control plane, calling into the pageservers' APIs:
The following table summarizes how the state of the system advances:
| Step | Node A | Node B | Node used by endpoints |
|---|---|---|---|
| 1 (initial) | AttachedSingle | Secondary | A |
| 2 | AttachedStale | AttachedMulti | A |
| 3 | AttachedStale | AttachedMulti | A |
| 4 | AttachedStale | AttachedMulti | A |
| 5 (cutover) | AttachedStale | AttachedMulti | B |
| 6 | AttachedStale | AttachedSingle | B |
| 7 (final) | Secondary | AttachedSingle | B |
The procedure described for a clean handover from a live node to a secondary is also used for failure cases and for migrations to a location that is not configured as a secondary, by simply skipping irrelevant steps, as described in the following sections.
If node A is unavailable, then all calls into node A are skipped and we don't wait for B to catch up before switching updating the endpoints to use B.
If node B is initially in Detached state, the procedure is identical. Since Node B is coming from a Detached state rather than Secondary, the download of layers and catch up with WAL will take much longer.
We might do this if:
In the final step of the migration, we generally request the original node to enter a Secondary state. This is typical if we are doing a planned migration during maintenance, or to balance CPU/network load away from a node.
One might also want to permanently migrate away: this can be done by simply removing the secondary location after the migration is complete, or as an optimization by substituting the Detached state for the Secondary state in the final step.
sequenceDiagram
participant CP as Control plane
participant A as Node A
participant B as Node B
participant E as Endpoint
CP->>A: PUT Flush & go to AttachedStale
note right of A: A continues to ingest WAL
CP->>B: PUT AttachedMulti
CP->>B: PUT Download layers from latest heatmap
note right of B: B downloads from S3
loop Poll until download complete
CP->>B: GET download status
end
activate B
note right of B: B ingests WAL
loop Poll until catch up
CP->>B: GET visible WAL
CP->>A: GET visible WAL
end
deactivate B
CP->>E: Configure to use Node B
E->>B: Connect for reads
CP->>B: PUT AttachedSingle
CP->>A: PUT Secondary
This case is far simpler: we may skip straight to our intended end state.
sequenceDiagram
participant A as Node A
participant CP as Control plane
participant B as Node B
participant E as Endpoint
note right of A: Node A offline
activate A
CP->>B: PUT AttachedSingle
CP->>E: Configure to use Node B
E->>B: Connect for reads
deactivate A
Ordinarily, an attached pageserver whose generation is the latest may delete layers at will (e.g. during compaction). If a previous generation pageserver is also still attached, and in use by endpoints, then this layer deletion could lead to a loss of availability for the endpoint when reading from the previous generation pageserver.
The AttachedMulti state simply disables deletions. These will be enqueued
in RemoteTimelineClient until the control plane transitions the
node into AttachedSingle, which unblocks deletions. Other remote storage operations
such as uploads are not blocked.
AttachedMulti is not required for data safety, only to preserve availability on pageservers running with stale generations.
A node enters AttachedMulti only when explicitly asked to by the control plane. It should only remain in this state for the duration of a migration.
If a control plane bug leaves the node in AttachedMulti for a long time, then we must avoid unbounded memory use from enqueued deletions. This may be accomplished simply, by dropping enqueued deletions when some modest threshold of delayed deletions (e.g. 10k layers per tenant) is reached. As with all deletions, it is safe to skip them, and the leaked objects will be eventually cleaned up by scrub or by timeline deletion.
During AttachedMulti, the Tenant is free to drop layers from local disk in response to disk pressure: only the deletion of remote layers is blocked.
Currently, a pageserver with a stale generation number will continue to upload layers, but be prevented from completing deletions. This is safe, but inefficient: layers uploaded by this stale generation will not be read back by future generations of pageservers.
The AttachedStale state disables S3 uploads. The stale pageserver will continue to ingest the WAL and write layers to local disk, but not to do any uploads to S3.
A node may enter AttachedStale in two ways:
The AttachedStale state also disables sending consumption metrics from that location: it is interpreted as an indication that some other pageserver is already attached or is about to be attached, and that new pageserver will be responsible for sending consumption metrics.
Over long periods of time, a tenant location in AttachedStale will accumulate data on local disk, as it cannot evict any layers written since it entered the AttachStale state. We rely on the control plane to revert the location to Secondary or Detached at the end of a migration.
This scenario is particularly noteworthy when evacuating all tenants on a pageserver: since all the attached tenants will go into AttachedStale, we will be doing no uploads at all, therefore ingested data will cause disk usage to increase continuously. Under nominal conditions, the available disk space on pageservers should be sufficient to complete the evacuation before this becomes a problem, but we must also handle the case where we hit a low disk situation while in this state.
The concept of disk pressure already exists in the pageserver: the disk_usage_eviction_task
touches each Tenant when it determines that a low-disk condition requires
some layer eviction. Having selected layers for eviction, the eviction
task calls Timeline::evict_layers.
Safety: If evict_layers is called while in AttachedStale state, and some of the to-be-evicted layers are not yet uploaded to S3, then the block on uploads will be lifted. This will result in leaking some objects once a migration is complete, but will enable the node to manage its disk space properly: if a node is left with some tenants in AttachedStale indefinitely due to a network partition or control plane bug, these tenants will not cause a full disk condition.
The secondary location's job is to serve reads with the same quality of service as the original location was serving them around the time of a migration. This does not mean the secondary location needs the whole set of layers: inactive layers that might soon be evicted on the attached pageserver need not be downloaded by the secondary. A totally idle tenant only needs to maintain enough on-disk state to enable a fast cold start (i.e. the most recent image layers are typically sufficient).
To enable this, we introduce the concept of a layer heatmap, which acts as an advisory input to secondary locations to decide which layers to download from S3.
The attached pageserver, if in state AttachedSingle, periodically uploads a serialized heat map to S3. It may skip this if there is no change since the last time it uploaded (e.g. if the tenant is totally idle).
Additionally, when the tenant is flushed to remote storage prior to a migration (the first step in cutover procedure), the heatmap is written out. This enables a future attached pageserver to get an up to date view when deciding which layers to download.
Secondary warm locations run a simple loop, implemented separately from
the main Tenant type, which represents attached tenants:
Note that the heatmap is only advisory: if a secondary location has plenty of disk space, it may choose to retain layers that aren't referenced by the heatmap, as long as they are still referenced by the IndexPart. Conversely, if a node is very low on disk space, it might opt to raise the heat threshold required to both downloading a layer, until more disk space is available.
Secondary locations are subject to eviction on disk pressure, just as attached locations are. For eviction purposes, the access time of a layer in a secondary location will be the access time given in the heatmap, rather than the literal time at which the local layer file was accessed.
The heatmap will indicate which layers are in local storage on the attached location. The secondary will always attempt to get back to having that set of layers on disk, but to avoid flapping, it will remember the access time of the layer it was most recently asked to evict, and layers whose access time is below that will not be re-downloaded.
The resulting behavior is that after a layer is evicted from a secondary location, it is only re-downloaded once the attached pageserver accesses the layer and uploads a heatmap reflecting that access time. On a pageserver restart, the secondary location will attempt to download all layers in the heatmap again, if they are not on local disk.
This behavior will be slightly different when secondary locations are used for "low energy tenants", but that is beyond the scope of this RFC.
Currently, the /tenant/<tenant_id>/config API defines various
tunables like compaction settings, which apply to the tenant irrespective
of which pageserver it is running on.
A new "location config" structure will be introduced, which defines configuration which is per-tenant, but local to a particular pageserver, such as the attachment mode and whether it is a secondary.
The pageserver will expose a new per-tenant API for setting
the state: /tenant/<tenant_id>/location/config.
Body content:
{
state: 'enum{Detached, Secondary, AttachedSingle, AttachedMulti, AttachedStale}',
generation: Option<u32>,
configuration: `Option<TenantConfig>`
flush: bool
}
Existing /attach and /detach endpoint will have the same
behavior as calling /location/config with AttachedSingle and Detached
states respectively. These endpoints will be deprecated and later
removed.
The generation attribute is mandatory for entering AttachedSingle or
AttachedMulti.
The configuration attribute is mandatory when entering any state other
than Detached. This configuration is the same as the body for
the existing /tenant/<tenant_id>/config endpoint.
The flush argument indicates whether the pageservers should flush
to S3 before proceeding: this only has any effect if the node is
currently in AttachedSingle or AttachedMulti. This is used
during the first phase of migration, when transitioning the
old pageserver to AttachedSingle.
The /re-attach API response will be extended to include a state as
well as a generation, enabling the pageserver to enter the
correct state for each tenant on startup.
A new table ProjectLocation:
enum(Secondary, AttachedSingle, AttachedMulti)Notes:
ProjectLocationspageserver column in Project now means "to which pageserver should
endpoints connect", rather than simply which pageserver is attached.generation column in Project remains, and is incremented and used
to set the generation of ProjectLocation rows when they are set into
an attached state.Detached state is implicitly represented as the absence of
a ProjectLocation.Migrations will be implemented as Go functions, within the
existing Operation framework in the control plane. These
operations are persistent, such that they will always keep
trying until completion: this property is important to avoid
leaving garbage behind on pageservers, such as AttachedStale
locations.
During migration, the control plane may encounter failures of either the original or new pageserver, or both:
A migration may be done while the old node is unavailable, in which case the old node may still be running in an AttachedStale state.
In this case, it is undesirable to have the migration Operation
stay alive until the old node eventually comes back online
and can be cleaned up. To handle this, the control plane
should run a background reconciliation process to compare
a pageserver's attachments with the database, and clean up
any that shouldn't be there any more.
Note that there will be no work to do if the old node was really
offline, as during startup it will call into /re-attach and
be updated that way. The reconciliation will only be needed
if the node was unavailable but still running.
This will make sense in future, especially for tiny databases that may be downloaded from S3 in milliseconds when needed.
However, it is not wise to do it immediately, because pageservers contain a mixture of higher and lower tier workloads. If we had 1 tenant with a secondary location and 9 without, then those other 9 tenants will do a lot of I/O as they try to recover from S3, which may degrade the service of the tenant which had a secondary location.
Until we segregate tenant on different service tiers on different pageserver nodes, or implement & test QoS to ensure that tenants with secondaries are not harmed by tenants without, we should use the same failover approach for all the tenants.
Instead of secondary locations populating their caches from S3, we could have them consume the WAL from safekeepers. The downsides of this would be:
The downside of only updating secondary locations from S3 is that we will have a delay during migration from replaying the LSN range between what's in S3 and what's in the pageserver. This range will be very small on planned migrations, as we have the old pageserver flush to S3 immediately before attaching the new pageserver. On unplanned migrations (old pageserver is unavailable), the range of LSNs to replay is bounded by the flush frequency on the old pageserver. However, the migration doesn't have to wait for the replay: it's just that not-yet-replayed LSNs will be unavailable for read until the new pageserver catches up.
We expect that pageserver reads of the most recent LSNs will be relatively rare, as for an active endpoint those pages will usually still be in the postgres page cache: this leads us to prefer synchronizing from S3 on secondary locations, rather than consuming the WAL from safekeepers.
It is not functionally necessary to keep warm caches on secondary locations at all. However, if we do not, then we would experience a de-facto availability loss in unplanned migrations, as reads to the new node would take an extremely long time (many seconds, perhaps minutes).
Warm caches on secondary locations are necessary to meet our availability goals.
Instead of migrating tenants individually, we could have entire spare nodes, and on a node death, move all its work to one of these spares.
This approach is avoided for several reasons:
We could simplify migrations by making both previous and new nodes go into a readonly state, then flush remote content from the previous node, then activate attachment on the secondary node.
The downside to this approach is a potentially large gap in readability of recent LSNs while loading data onto the new node. To avoid this, it is worthwhile to incur the extra cost of double-replaying the WAL onto old and new nodes' local storage during a migration.
Rather than uploading the heatmap to S3, attached pageservers could make it available to peers.
Currently, pageservers have no peer to peer communication, so adding this for heatmaps would incur significant overhead in deployment and configuration of the service, and ensuring that when a new pageserver is deployed, other pageservers are updated to be aware of it.
As well as simplifying implementation, putting heatmaps in S3 will be useful for future analytics purposes -- gathering aggregated statistics on activity patterns across many tenants may be done directly from data in S3.