docs/rfcs/2025-03-17-compute-prewarm.md
Created on 2025-03-17 Implemented on TBD Author: Alexey Kondratov (@ololobus)
This RFC describes an approach to reduce performance degradation due to missing caches after compute node restart, i.e.:
Neon currently implements several features that guarantee high uptime of compute nodes:
This helps us to be well-within the uptime SLO of 99.95% most of the time. Problems begin when we go up to multi-TB workloads and 32-64 CU computes. During restart, compute loses all caches: LFC, shared buffers, file system cache. Depending on the workload, it can take a lot of time to warm up the caches, so that performance could be degraded and might be even unacceptable for certain workloads. The latter means that although current approach works well for small to medium workloads, we still have to do some additional work to avoid performance degradation after restart of large instances.
Postgres, compute_ctl, Control plane, Endpoint storage for unlogged storage of compute files. For the latter, we will need to implement a uniform abstraction layer on top of S3, ABS, etc., but S3 is used in text interchangeably with 'endpoint storage' for simplicity.
We are going to extend the current compute spec with the following attributes
struct ComputeSpec {
/// [All existing attributes]
...
/// Whether to do auto-prewarm at start or not.
/// Default to `false`.
pub lfc_auto_prewarm: bool
/// Interval in seconds between automatic dumps of
/// LFC state into S3. Default `None`, which means 'off'.
pub lfc_dump_interval_sec: Option<i32>
}
When lfc_dump_interval_sec is set to N, compute_ctl will periodically dump the LFC state
and store it in S3, so that it could be used either for auto-prewarm after restart or by replica
during the rolling restart. For enabling periodic dumping, we should consider the following value
lfc_dump_interval_sec=300 (5 minutes), same as in the upstream's pg_prewarm.autoprewarm_interval.
When lfc_auto_prewarm is set to true, compute_ctl will start prewarming the LFC upon restart
iif some of the previous states is present in S3.
POST /store_lfc_state -- dump LFC state using Postgres SQL interface and store result in S3.
This has to be a blocking call, i.e. it will return only after the state is stored in S3.
If there is any concurrent request in progress, we should return 429 Too Many Requests,
and let the caller to retry.
GET /dump_lfc_state -- dump LFC state using Postgres SQL interface and return it as is
in text format suitable for the future restore/prewarm. This API is not strictly needed at
the end state, but could be useful for a faster prototyping of a complete rolling restart flow
with prewarm, as it doesn't require persistent for LFC state storage.
POST /restore_lfc_state -- restore/prewarm LFC state with request
RestoreLFCStateRequest:
oneOf:
- type: object
required:
- lfc_state
properties:
lfc_state:
type: string
description: Raw LFC content dumped with GET `/dump_lfc_state`
- type: object
required:
- lfc_cache_key
properties:
lfc_cache_key:
type: string
description: |
endpoint_id of the source endpoint on the same branch
to use as a 'donor' for LFC content. Compute will look up
LFC content dump in S3 using this key and do prewarm.
where lfc_state and lfc_cache_key are mutually exclusive.
The actual prewarming will happen asynchronously, so the caller need to check the
prewarm status using the compute's standard GET /status API.
GET /status -- extend existing API with following attributes
struct ComputeStatusResponse {
// [All existing attributes]
...
pub prewarm_state: PrewarmState
}
/// Compute prewarm state. Will be stored in the shared Compute state
/// in compute_ctl
struct PrewarmState {
pub status: PrewarmStatus
/// Total number of pages to prewarm
pub pages_total: i64
/// Number of pages prewarmed so far
pub pages_processed: i64
/// Optional prewarm error
pub error: Option<String>
}
pub enum PrewarmStatus {
/// Prewarming was never requested on this compute
Off,
/// Prewarming was requested, but not started yet
Pending,
/// Prewarming is in progress. The caller should follow
/// `PrewarmState::progress`.
InProgress,
/// Prewarming has been successfully completed
Completed,
/// Prewarming failed. The caller should look at
/// `PrewarmState::error` for the reason.
Failed,
/// It is intended to be used by auto-prewarm if none of
/// the previous LFC states is available in S3.
/// This is a distinct state from the `Failed` because
/// technically it's not a failure and could happen if
/// compute was restart before it dumped anything into S3,
/// or just after the initial rollout of the feature.
Skipped,
}
POST /promote -- this is a blocking API call to promote compute replica into primary.
This API should be very similar to the existing POST /configure API, i.e. accept the
spec (primary spec, because originally compute was started as replica). It's a distinct
API method because semantics and response codes are different:
200 OK.compute_ctl
will return 412 Precondition Failed.429 Too Many Requests.500 Internal Server Error
will be returned.The complete flow will be present as a sequence diagram in the next section, but here we just want to list some important steps that have to be done by control plane during the rolling restart via warm replica, but without much of low-level implementation details.
Register the 'intent' of the instance restart, but not yet interrupt any workload at
primary and also accept new connections. This may require some endpoint state machine
changes, e.g. introduction of the pending_restart state. Being in this state also
mustn't prevent any other operations except restart: suspend, live-reconfiguration
(e.g. due to notify-attach call from the storage controller), deletion.
Start new replica compute on the same timeline and start prewarming it. This process may take quite a while, so the same concurrency considerations as in 1. should be applied here as well.
When warm replica is ready, control plane should:
3.1. Terminate the primary compute. Starting from here, this is a critical section, if anything goes off, the only option is to start the primary normally and proceed with auto-prewarm.
3.2. Send cache invalidation message to all proxies, notifying them that all new connections should request and wait for the new connection details. At this stage, proxy has to also drop any existing connections to the old primary, so they didn't do stale reads.
3.3. Attach warm replica compute to the primary endpoint inside control plane metadata database.
3.4. Promote replica to primary.
3.5. When everything is done, finalize the endpoint state to be just active.
sequenceDiagram
autonumber
participant proxy as Neon proxy
participant cplane as Control plane
participant primary as Compute (primary)
box Compute (replica)
participant ctl as compute_ctl
participant pg as Postgres
end
box Endpoint unlogged storage
participant s3proxy as Endpoint storage service
participant s3 as S3/ABS/etc.
end
cplane ->> primary: POST /store_lfc_state
primary -->> cplane: 200 OK
cplane ->> ctl: POST /restore_lfc_state
activate ctl
ctl -->> cplane: 202 Accepted
activate cplane
cplane ->> ctl: GET /status: poll prewarm status
ctl ->> s3proxy: GET /read_file
s3proxy ->> s3: read file
s3 -->> s3proxy: file content
s3proxy -->> ctl: 200 OK: file content
proxy ->> cplane: GET /proxy_wake_compute
cplane -->> proxy: 200 OK: old primary conninfo
ctl ->> pg: prewarm LFC
activate pg
pg -->> ctl: prewarm is completed
deactivate pg
ctl -->> cplane: 200 OK: prewarm is completed
deactivate ctl
deactivate cplane
cplane -->> cplane: reassign replica compute to endpoint,
start terminating the old primary compute
activate cplane
cplane ->> proxy: invalidate caches
proxy ->> cplane: GET /proxy_wake_compute
cplane -x primary: POST /terminate
primary -->> cplane: 200 OK
note over primary: old primary
compute terminated
cplane ->> ctl: POST /promote
activate ctl
ctl ->> pg: pg_ctl promote
activate pg
pg -->> ctl: done
deactivate pg
ctl -->> cplane: 200 OK
deactivate ctl
cplane -->> cplane: finalize operation
cplane -->> proxy: 200 OK: new primary conninfo
deactivate cplane
It's currently known that pageserver can sustain about 3000 RPS per shard for a few running computes. Large tenants are usually split into 8 shards, so the final formula may look like this:
8 shards * 3000 RPS * 8 KB =~ 190 MB/s
so depending on the LFC size, prewarming will take at least:
In total, one pageserver is normally capped by 30k RPS, so it obviously can't sustain many computes doing prewarm at the same time. Later, we may need an additional mechanism for computes to throttle the prewarming requests gracefully.
We consider following failures while implementing this RFC:
Compute got interrupted/crashed/restarted during prewarm. The caller -- control plane -- should detect that and start prewarm from the beginning.
Control plane promotion request timed out or hit network issues. If it never reached the
compute, control plane should just repeat it. If it did reach the compute, then during
retry control plane can hit 409 as previous request triggered the promotion already.
In this case, control plane need to retry until either 200 or
permanent error 500 is returned.
Compute got interrupted/crashed/restarted during promotion. At restart it will ask for a spec from control plane, and its content should signal compute to start as primary, so it's expected that control plane will continue polling for certain period of time and will discover that compute is ready to accept connections if restart is fast enough.
Any other unexpected failure or timeout during prewarming. This failure mustn't be fatal, control plane has to report failure, terminate replica and keep primary running.
Any other unexpected failure or timeout during promotion. Unfortunately, at this moment we already have the primary node stopped, so the only option is to start primary again and proceed with auto-prewarm.
Any unexpected failure during auto-prewarm. This failure mustn't be fatal,
compute_ctl has to report the failure, but do not crash the compute.
Control plane failed to confirm that old primary has terminated. This can happen, especially in the future HA setup. In this case, control plane has to ensure that it sent VM deletion and pod termination requests to k8s, so long-term we do not have two running primaries on the same timeline.
There are two security implications to consider:
Access to compute_ctl API. It has to be accessible from the outside of compute, so all
new API methods have to be exposed on the external HTTP port and must be authenticated
with JWT.
Read/write only your own LFC state data in S3. Although it's not really a security concern,
since LFC state is just a mapping of blocks present in LFC at certain moment in time;
it still has to be highly restricted, so that i) only computes on the same timeline can
read S3 state; ii) each compute can only write to the path that contains it's endpoint_id.
Both of this must be validated by Endpoint storage service using the JWT token provided by compute_ctl.
Currently, we only label computes with endpoint_id after attaching them to the endpoint.
In this proposal, this means that temporary replica will remain unlabelled until it's promoted
to primary. We can also hide it from users in the control plane API, but what to do with
billing and monitoring is still unclear.
We can probably mark it as 'billable' and tag with project_id, so it will be billed, but
not interfere in any way with the current primary monitoring.
Another thing to consider is how logs and metrics export will switch to the new compute. It's expected that OpenTelemetry collector will auto-discover the new compute and start scraping metrics from it.
It's still an open question whether we need auto-prewarm at all. The author's gut-feeling is that yes, we need it, but might be not for all workloads, so it could end up exposed as a user-controllable knob on the endpoint. There are two arguments for that:
Auto-prewarm existing in upstream's pg_prewarm, probably for a reason.
There are still could be 2 flows when we cannot perform the rolling restart via the warm replica: i) any failure or interruption during promotion; ii) wake up after scale-to-zero. The latter might be challenged as well, i.e. one can argue that auto-prewarm may and will compete with user-workload for storage resources. This is correct, but it might as well reduce the time to get warm LFC and good performance.
There are many things to consider here, but three items just off the top of my head:
How to properly start the walproposer inside Postgres.
What to do with logical replication. Currently, we do not include logical replication slots
inside basebackup, because nobody advances them at replica, so they just prevent the WAL
deletion. Yet, we do need to have them at primary after promotion. Starting with Postgres 17,
there is a new feature called
logical replication failover
and synchronized_standby_slots setting, but we need a plan for the older versions. Should we
request a new basebackup during promotion?
How do we guarantee that replica will receive all the latest WAL from safekeepers? Do some 'shallow' version of sync safekeepers without data copying? Or just a standard version of sync safekeepers?
The proposal already assumes one of the alternatives -- do not have any persistent storage for LFC state. This is possible to implement faster with the proposed API, but it means that we do not implement auto-prewarm yet.
At the end of implementing this RFC we should have two high-level settings that enable:
It also has to be decided what's the criteria for enabling one or both of these flows for certain clients.