docs/rfcs/037-storage-controller-restarts.md
This RFC describes the issues around the current storage controller restart procedure and describes an implementation which reduces downtime to a few milliseconds on the happy path.
Storage controller upgrades (restarts, more generally) can cause multi-second availability gaps. While the storage controller does not sit on the main data path, it's generally not acceptable to block management requests for extended periods of time (e.g. https://github.com/neondatabase/neon/issues/8034).
The storage controller runs in a Kubernetes Deployment configured for one replica and strategy set to Recreate. In non Kubernetes terms, during an upgrade, the currently running storage controller is stopped and, only after, a new instance is created.
At start-up, the storage controller calls into all the pageservers it manages (retrieved from DB) to learn the latest locations of all tenant shards present on them. This is usually fast, but can push into tens of seconds under unfavourable circumstances: pageservers are heavily loaded or unavailable.
There's probably as many ways of handling restarts gracefully as there are distributed systems. Some examples include:
At a very high level the proposed idea is to start a new storage controller instance while the previous one is still running and cut-over to it when it becomes ready. The new instance, should coordinate with the existing one and transition responsibility gracefully. While the controller has built in safety against split-brain situations (via generation numbers), we'd like to avoid such scenarios since they can lead to availability issues for tenants that underwent changes while two controllers were operating at the same time and require operator intervention to remedy.
On the Kubernetes configuration side, the proposal is to update the storage controller Deployment
to use spec.strategy.type = RollingUpdate, spec.strategy.rollingUpdate.maxSurge=1 and spec.strategy.maxUnavailable=0.
Under the hood, Kubernetes creates a new replica set and adds one pod to it (maxSurge=1). The old replica set does not
scale down until the new replica set has one replica in the ready state (maxUnavailable=0).
The various possible failure scenarios are investigated in the Handling Failures section.
This section describes the primitives required on the storage controller side and the flow of the happy path.
A new table should be added to the storage controller database for leader synchronization during startup.
This table will always contain at most one row. The proposed name for the table is leader and the schema
contains two elements:
hostname: represents the hostname for the current storage controller leader - should be addressible
from other pods in the deploymentstart_timestamp: holds the start timestamp for the current storage controller leader (UTC timezone) - only required
for failure case handling: see Previous Leader Crashes Before New Leader ReadinessStorage controllers will read the leader row at start-up and then update it to mark themselves as the leader
at the end of the start-up sequence. We want compare-and-exchange semantics for the update: avoid the
situation where two concurrent updates succeed and overwrite each other. The default Postgres isolation
level is READ COMMITTED, which isn't strict enough here. This update transaction should use at least REPEATABLE READ isolation level in order to prevent lost updates. Currently,
the storage controller uses the stricter SERIALIZABLE isolation level for all transactions. This more than suits
our needs here.
START TRANSACTION ISOLATION LEVEL REPEATABLE READ
UPDATE leader SET hostname=<new_hostname>, start_timestamp=<new_start_ts>
WHERE hostname=<old_hostname>, start_timestampt=<old_start_ts>;
If the transaction fails or if no rows have been updated, then the compare-and-exchange is regarded as a failure.
A new HTTP endpoint should be added to the storage controller: POST /control/v1/step_down. Upon receiving this
request the leader cancels any pending reconciles and goes into a mode where it replies with 503 to all other APIs
and does not issue any location configurations to its pageservers. The successful HTTP response will return a serialized
snapshot of the observed state.
If other step down requests come in after the initial one, the request is handled and the observed state is returned (required for failure scenario handling - see Handling Failures).
At start-up, the first thing the storage controller does is retrieve the sole row from the new
leader table. If such an entry exists, send a /step_down PUT API call to the current leader.
This should be retried a few times with a short backoff (see [1]). The aspiring leader loads the
observed state into memory and the start-up sequence proceeds as usual, but without querying the
pageservers in order to build up the observed state.
Before doing any reconciliations or persistence change, update the leader database table as described in the Database Table For Leader Synchronization
section. If this step fails, the storage controller process exits.
Note that no row will exist in the leaders table for the first graceful restart. In that case, force update the leader table
(without the WHERE clause) and perform with the pre-existing start-up procedure (i.e. build observed state by querying pageservers).
Summary of proposed new start-up sequence:
/step_downSome things to note from the steps above:
[1] The API call might fail because there's no storage controller running (i.e. restart), so we don't want to extend the unavailability period by much. We still want to retry since that's not the common case.
The storage controller may crash or be restarted outside of roll-outs. When a new pod is created, its call to
/step_down will fail since the previous leader is no longer reachable. In this case perform the pre-existing
start-up procedure and update the leader table (with the WHERE clause). If the update fails, the storage controller
exists and consistency is maintained.
When the previous leader (P1) crashes before the new leader (P2) passses the readiness check, Kubernetes will
reconcile the old replica set and create a new pod for it (P1'). The /step_down API call will fail for P1'
(see [2]).
Now we have two cases to consider:
leader table first: The database update from P1' will fail and P1' will exit, or be terminated
by Kubernetes depending on timings.leader table first: The hostname field of the leader row stays the same, but the start_timestamp field changes.
The database update from P2 will fail (since start_timestamp does not match). P2 will exit and Kubernetes will
create a new replacement pod for it (P2'). Now the entire dance starts again, but with P1' as the leader and P2' as the incumbent.[2] P1 and P1' may (more likely than not) be the same pod and have the same hostname. The implementation should avoid this self reference and fail the API call at the client if the persisted hostname matches the current one.
The deployment's replica sets already satisfy the deployment's replica count requirements and the Kubernetes deployment rollout will just clean up the dead pod.
The deployment controller scales up the new replica sets by creating a new pod. The entire procedure is repeated with the new pod.
This feels very unlikely, but should be considered in any case. P2 (the new aspiring leader) fails the /step_down
API call into P1 (the current leader). P2 proceeds with the pre-existing startup procedure and updates the leader table.
Kubernetes will terminate P1, but there may be a brief period where both storage controller can drive reconciles.
As we've seen in the previous section, we can end up with two storage controller running at the same time. The split brain duration is not bounded since the Kubernetes controller might become partitioned from the pods (unlikely though). While these scenarios are not fatal, they can cause tenant unavailability, so we'd like to reduce the chances of this happening. The rest of this section sketches some safety measure. It's likely overkill to implement all of them however.
The storage controller has two types of side effects: location config requests into pageservers and compute notifications into the control plane. Before issuing either, the storage controller could check that it is indeed still the leader by querying the database. Side effects might still be applied if they race with the database updatem, but the situation will eventually be detected. The storage controller process should terminate in these cases.
Up until now, the leadership defined by this RFC is static. In order to bound the length of the split brain scenario, we could require the leadership to be renewed periodically. Two new columns would be added to the leaders table:
last_renewed - timestamp indicating when the lease was last renewedlease_duration - duration indicating the amount of time after which the lease expiresThe leader periodically attempts to renew the lease by checking that it is in fact still the legitimate leader and updating last_renewed in the
same transaction. If the update fails, the process exits. New storage controller instances wishing to become leaders must wait for the current lease
to expire before acquiring leadership if they have not succesfully received a response to the /step_down request.
Each time that leadership changes, we can bump a term integer column in the leader table. This term uniquely identifies a leader.
Location config requests and re-attach responses can include this term. On the pageserver side, keep the latest term in memory and refuse
anything which contains a stale term (i.e. smaller than the current one).
The storage controller should expose a metric which describes it's state (Active | WarmingUp | SteppedDown).
Per region alerts should be added on this metric which triggers when:
Active state for an extended period of timeActive stateAn alert that periodically verifies that the leader table is in sync with the metric above would be very useful.
We'd have to expose the storage controller read only database to Grafana (perhaps it is already done).
Kubernetes has a lease primitive which can be used to implement leader election. Only one instance may hold a lease at any given time. This lease needs to be periodically renewed and has an expiration period.
In our case, it would work something like this:
/step_down deletes the lease or stops it from renewingThe kubert crate implements a lightweight lease API, but it's still not exactly trivial to implement.
This approach has the benefit of baked in observability (kubectl describe lease), but:
To my mind, the "row in database" approach is straightforward enough that we don't have to offload this to something external.