docs/rfcs/2025-07-07-node-deletion-api-improvement.md
Created on 2025-07-07 Implemented on TBD
This RFC describes improvements to the storage controller API for gracefully deleting pageserver nodes.
The basic node deletion API introduced in #8226 has several limitations:
In this context, "graceful" node deletion means that users do not experience any disruption or negative effects, provided the system remains in a healthy state (i.e., the remaining pageservers can handle the workload and all requirements are met). To achieve this, the system must perform live migration of all tenant shards from the node being deleted while the node is still running and continue processing all incoming requests. The node is removed only after all tenant shards have been safely migrated.
Although live migrations can be achieved with the drain functionality, it leads to incorrect shard placement, such as not matching availability zones. This results in unnecessary work to optimize the placement that was just recently performed.
If we delete a node before its tenant shards are fully moved, the new node won't have all the needed data (e.g. heatmaps) ready. This means user requests to the new node will be much slower at first. If there are many tenant shards, this slowdown affects a huge amount of users.
Graceful node deletion is more complicated and can introduce new issues. It takes longer because live migration of each tenant shard can last several minutes. Using non-blocking accessors may also cause deletion to wait if other processes are holding inner state lock. It also gets trickier because we need to handle other requests, like drain and fill, at the same time.
To resolve the problem of deleted nodes re-adding themselves, a tombstone mechanism was introduced
as part of the node stored information. Each node has a separate NodeLifecycle field with two
possible states: Active and Deleted. When node deletion completes, the database row is not
deleted but instead has its NodeLifecycle column switched to Deleted. Nodes with Deleted
lifecycle are treated as if the row is absent for most handlers, with several exceptions: reattach
and register functionality must be aware of tombstones. Additionally, new debug handlers are
available for listing and deleting tombstones via the /debug/v1/tombstone path.
The problem of making node deletion graceful is complex and involves several challenges:
See below for a detailed breakdown of the proposed changes and mechanisms.
New NodeLifecycle enum and a matching database field with these values:
Active: The normal state. All operations are allowed.ScheduledForDeletion: The node is marked to be deleted soon. Deletion may be in progress or
will happen later, but the node will eventually be removed. All operations are allowed.Deleted: The node is fully deleted. No operations are allowed, and the node cannot be brought
back. The only action left is to remove its record from the database. Any attempt to register a
node in this state will fail.This state persists across storage controller restarts.
State transition
+--------------------+
+---| Active |<---------------------+
| +--------------------+ |
| ^ |
| start_node_delete | cancel_node_delete |
v | |
+----------------------------------+ |
| ScheduledForDeletion | |
+----------------------------------+ |
| |
| node_register |
| |
| delete_node (at the finish) |
| |
v |
+---------+ tombstone_delete +----------+
| Deleted |-------------------------------->| no row |
+---------+ +----------+
A Deleting variant to the NodeSchedulingPolicy enum. This means the deletion function is
running for the node right now. Only one node can have the Deleting policy at a time.
The NodeSchedulingPolicy::Deleting state is persisted in the database. However, after a storage
controller restart, any node previously marked as Deleting will have its scheduling policy reset
to Pause. The policy will only transition back to Deleting when the deletion operation is
actively started again, as triggered by the node's NodeLifecycle::ScheduledForDeletion state.
NodeSchedulingPolicy transition details:
node_delete begins, set the policy to NodeSchedulingPolicy::Deleting.node_delete is cancelled (for example, due to a concurrent drain operation), revert the
policy to its previous value. The policy is persisted in storcon DB.node_delete completes, the final value of the scheduling policy is irrelevant, since
NodeLifecycle::Deleted prevents any further access to this field.The deletion process cannot be initiated for nodes currently undergoing deployment-related
operations (Draining, Filling, or PauseForRestart policies). Deletion will only be triggered
once the node transitions to either the Active or Pause state.
A replacement for Option<OperationHandler> ongoing_operation, the OperationTracker is a
dedicated service state object responsible for managing all long-running node operations (drain,
fill, delete) with robust concurrency control.
Key responsibilities:
When deleting a node, handle each attached tenant shard as follows:
This process safely moves all attached shards before deleting the node.
When deleting a node, handle each secondary tenant shard as follows:
This ensures all secondary shards are safely moved before deleting the node.
In case of a storage controller failure and following restart, the system behavior depends on the
NodeLifecycle state:
NodeLifecycle is Active: No action is taken for this node.NodeLifecycle is Deleted: The node will not be re-added.NodeLifecycle is ScheduledForDeletion: A deletion background task will be launched for
this node.In case of a pageserver node failure during deletion, the behavior depends on the force flag:
force is set: The node deletion will proceed regardless of the node's availability.force is not set: The deletion will be retried a limited number of times. If the node
remains unavailable, the deletion process will pause and automatically resume when the node
becomes healthy again.The following sections describe the behavior when different types of requests arrive at the storage controller and how they interact with ongoing operations.
Handler: PUT /control/v1/node/:node_id/delete
NodeLifecycle::ScheduledForDeletion:
200 OK: there is already an ongoing deletion request for this nodeNodeLifecycle::ScheduledForDeletionHandler: DELETE /control/v1/node/:node_id/delete
NodeLifecycle::ScheduledForDeletion:
404 Not Found: there is no current deletion request for this nodeNodeLifecycle::Active409 Conflict: queueing of drain/fill processes is not supported400 Bad Request: cancellation request is incorrect, operations are not the sameforce flag is implemented and provides fast, failure-tolerant node removal (e.g., when a
pageserver node does not respond)