Rolling Storage Controller Restarts

Summary

This RFC describes the issues around the current storage controller restart procedure and describes an implementation which reduces downtime to a few milliseconds on the happy path.

Motivation

Storage controller upgrades (restarts, more generally) can cause multi-second availability gaps. While the storage controller does not sit on the main data path, it's generally not acceptable to block management requests for extended periods of time (e.g. https://github.com/neondatabase/neon/issues/8034).

Current Implementation

The storage controller runs in a Kubernetes Deployment configured for one replica and strategy set to Recreate. In non Kubernetes terms, during an upgrade, the currently running storage controller is stopped and, only after, a new instance is created.

At start-up, the storage controller calls into all the pageservers it manages (retrieved from DB) to learn the latest locations of all tenant shards present on them. This is usually fast, but can push into tens of seconds under unfavourable circumstances: pageservers are heavily loaded or unavailable.

Prior Art

There's probably as many ways of handling restarts gracefully as there are distributed systems. Some examples include:

Active/Standby architectures: Two or more instance of the same service run, but traffic is only routed to one of them. For fail-over, traffic is routed to one of the standbys (which becomes active).
Consensus Algorithms (Raft, Paxos and friends): The part of consensus we care about here is leader election: peers communicate to each other and use a voting scheme that ensures the existence of a single leader (e.g. Raft epochs).

Requirements

Reduce storage controller unavailability during upgrades to milliseconds
Minimize the interval in which it's possible for more than one storage controller to issue reconciles.
Have one uniform implementation for restarts and upgrades
Fit in with the current Kubernetes deployment scheme

Non Goals

Implement our own consensus algorithm from scratch
Completely eliminate downtime storage controller downtime. Instead we aim to reduce it to the point where it looks like a transient error to the control plane

Impacted Components

storage controller
deployment orchestration (i.e. Ansible)
helm charts

Terminology

Observed State: in-memory mapping between tenant shards and their current pageserver locations - currently built up at start-up by quering pageservers
Deployment: Kubernetes primitive that models a set of replicas

Implementation

High Level Flow

At a very high level the proposed idea is to start a new storage controller instance while the previous one is still running and cut-over to it when it becomes ready. The new instance, should coordinate with the existing one and transition responsibility gracefully. While the controller has built in safety against split-brain situations (via generation numbers), we'd like to avoid such scenarios since they can lead to availability issues for tenants that underwent changes while two controllers were operating at the same time and require operator intervention to remedy.

Kubernetes Deployment Configuration

On the Kubernetes configuration side, the proposal is to update the storage controller Deployment to use spec.strategy.type = RollingUpdate, spec.strategy.rollingUpdate.maxSurge=1 and spec.strategy.maxUnavailable=0. Under the hood, Kubernetes creates a new replica set and adds one pod to it (maxSurge=1). The old replica set does not scale down until the new replica set has one replica in the ready state (maxUnavailable=0).

The various possible failure scenarios are investigated in the Handling Failures section.

Storage Controller Start-Up

This section describes the primitives required on the storage controller side and the flow of the happy path.

Database Table For Leader Synchronization

A new table should be added to the storage controller database for leader synchronization during startup. This table will always contain at most one row. The proposed name for the table is leader and the schema contains two elements:

hostname: represents the hostname for the current storage controller leader - should be addressible from other pods in the deployment
start_timestamp: holds the start timestamp for the current storage controller leader (UTC timezone) - only required for failure case handling: see Previous Leader Crashes Before New Leader Readiness

Storage controllers will read the leader row at start-up and then update it to mark themselves as the leader at the end of the start-up sequence. We want compare-and-exchange semantics for the update: avoid the situation where two concurrent updates succeed and overwrite each other. The default Postgres isolation level is READ COMMITTED, which isn't strict enough here. This update transaction should use at least REPEATABLE READ isolation level in order to prevent lost updates. Currently, the storage controller uses the stricter SERIALIZABLE isolation level for all transactions. This more than suits our needs here.

START TRANSACTION ISOLATION LEVEL REPEATABLE READ
UPDATE leader SET hostname=<new_hostname>, start_timestamp=<new_start_ts>
WHERE hostname=<old_hostname>, start_timestampt=<old_start_ts>;

If the transaction fails or if no rows have been updated, then the compare-and-exchange is regarded as a failure.

Step Down API

A new HTTP endpoint should be added to the storage controller: POST /control/v1/step_down. Upon receiving this request the leader cancels any pending reconciles and goes into a mode where it replies with 503 to all other APIs and does not issue any location configurations to its pageservers. The successful HTTP response will return a serialized snapshot of the observed state.

If other step down requests come in after the initial one, the request is handled and the observed state is returned (required for failure scenario handling - see Handling Failures).

Graceful Restart Happy Path

At start-up, the first thing the storage controller does is retrieve the sole row from the new leader table. If such an entry exists, send a /step_down PUT API call to the current leader. This should be retried a few times with a short backoff (see [1]). The aspiring leader loads the observed state into memory and the start-up sequence proceeds as usual, but without querying the pageservers in order to build up the observed state.

Before doing any reconciliations or persistence change, update the leader database table as described in the Database Table For Leader Synchronization section. If this step fails, the storage controller process exits.

Note that no row will exist in the leaders table for the first graceful restart. In that case, force update the leader table (without the WHERE clause) and perform with the pre-existing start-up procedure (i.e. build observed state by querying pageservers).

Summary of proposed new start-up sequence:

Call /step_down
Perform any pending database migrations
Load state from database
Load observed state returned in step (1) into memory
Do initial heartbeat round (may be moved after 5)
Mark self as leader by updating the database
Reschedule and reconcile everything

Some things to note from the steps above:

The storage controller makes no changes to the cluster state before step (5) (i.e. no location config calls to the pageserver and no compute notifications)
Ask the current leader to step down before loading state from database so we don't get a lost update if the transactions overlap.
Before loading the observed state at step (3), cross-validate against the database. If validation fails, fall back to asking the pageservers about their current locations.
Database migrations should only run after the previous instance steps down (or the step down times out).

[1] The API call might fail because there's no storage controller running (i.e. restart), so we don't want to extend the unavailability period by much. We still want to retry since that's not the common case.

Handling Failures

Storage Controller Crash Or Restart

The storage controller may crash or be restarted outside of roll-outs. When a new pod is created, its call to /step_down will fail since the previous leader is no longer reachable. In this case perform the pre-existing start-up procedure and update the leader table (with the WHERE clause). If the update fails, the storage controller exists and consistency is maintained.

Previous Leader Crashes Before New Leader Readiness

When the previous leader (P1) crashes before the new leader (P2) passses the readiness check, Kubernetes will reconcile the old replica set and create a new pod for it (P1'). The /step_down API call will fail for P1' (see [2]).

Now we have two cases to consider:

P2 updates the leader table first: The database update from P1' will fail and P1' will exit, or be terminated by Kubernetes depending on timings.
P1' updates the leader table first: The hostname field of the leader row stays the same, but the start_timestamp field changes. The database update from P2 will fail (since start_timestamp does not match). P2 will exit and Kubernetes will create a new replacement pod for it (P2'). Now the entire dance starts again, but with P1' as the leader and P2' as the incumbent.

[2] P1 and P1' may (more likely than not) be the same pod and have the same hostname. The implementation should avoid this self reference and fail the API call at the client if the persisted hostname matches the current one.

Previous Leader Crashes After New Leader Readiness

The deployment's replica sets already satisfy the deployment's replica count requirements and the Kubernetes deployment rollout will just clean up the dead pod.

New Leader Crashes Before Pasing Readiness Check

The deployment controller scales up the new replica sets by creating a new pod. The entire procedure is repeated with the new pod.

Network Partition Between New Pod and Previous Leader

This feels very unlikely, but should be considered in any case. P2 (the new aspiring leader) fails the /step_down API call into P1 (the current leader). P2 proceeds with the pre-existing startup procedure and updates the leader table. Kubernetes will terminate P1, but there may be a brief period where both storage controller can drive reconciles.

Dealing With Split Brain Scenarios

As we've seen in the previous section, we can end up with two storage controller running at the same time. The split brain duration is not bounded since the Kubernetes controller might become partitioned from the pods (unlikely though). While these scenarios are not fatal, they can cause tenant unavailability, so we'd like to reduce the chances of this happening. The rest of this section sketches some safety measure. It's likely overkill to implement all of them however.

Ensure Leadership Before Producing Side Effects

The storage controller has two types of side effects: location config requests into pageservers and compute notifications into the control plane. Before issuing either, the storage controller could check that it is indeed still the leader by querying the database. Side effects might still be applied if they race with the database updatem, but the situation will eventually be detected. The storage controller process should terminate in these cases.

Leadership Lease

Up until now, the leadership defined by this RFC is static. In order to bound the length of the split brain scenario, we could require the leadership to be renewed periodically. Two new columns would be added to the leaders table:

last_renewed - timestamp indicating when the lease was last renewed
lease_duration - duration indicating the amount of time after which the lease expires

The leader periodically attempts to renew the lease by checking that it is in fact still the legitimate leader and updating last_renewed in the same transaction. If the update fails, the process exits. New storage controller instances wishing to become leaders must wait for the current lease to expire before acquiring leadership if they have not succesfully received a response to the /step_down request.

Notify Pageserver Of Storage Controller Term

Each time that leadership changes, we can bump a term integer column in the leader table. This term uniquely identifies a leader. Location config requests and re-attach responses can include this term. On the pageserver side, keep the latest term in memory and refuse anything which contains a stale term (i.e. smaller than the current one).

Observability

The storage controller should expose a metric which describes it's state (Active | WarmingUp | SteppedDown). Per region alerts should be added on this metric which triggers when:
- no storage controller has been in the Active state for an extended period of time
- more than one storage controllers are in the Active state
An alert that periodically verifies that the leader table is in sync with the metric above would be very useful. We'd have to expose the storage controller read only database to Grafana (perhaps it is already done).

Alternatives

Kubernetes Leases

Kubernetes has a lease primitive which can be used to implement leader election. Only one instance may hold a lease at any given time. This lease needs to be periodically renewed and has an expiration period.

In our case, it would work something like this:

/step_down deletes the lease or stops it from renewing
lease acquisition becomes part of the start-up procedure

The kubert crate implements a lightweight lease API, but it's still not exactly trivial to implement.

This approach has the benefit of baked in observability (kubectl describe lease), but:

We offload the responsibility to Kubernetes which makes it harder to debug when things go wrong.
More code surface than the simple "row in database" approach. Also, most of this code would be in a dependency not subject to code review, etc.
Hard to test. Our testing infra does not run the storage controller in Kubernetes and changing it do so is not simple and complictes and the test set-up.

To my mind, the "row in database" approach is straightforward enough that we don't have to offload this to something external.