docs/rfcs/20221018_cluster_bootstrap.md
This document describes upcoming changes to the steps performed by Redpanda when instantiating a cluster, with the intent to simplify configuration of nodes and eliminate the class of incidents caused by the incorrect configuration of node IDs and seed servers. The user-facing effect of this proposal is that the configuration of each node can be made identical with one another -- we will require no special node and no per-node specifics when configuring a cluster.
TL;DR
seed_servers config, and this heterogeneous configuration requirement causes trouble when not implementedCurrent implementation of cluster formation:
The above requirements of today’s configuration has led to poor experience for customers by way of incidents as severe as having a controller topic with split brain.
Since there are several issues being addressed in this project, there are a few disparate tasks involved in this overhaul of bootstrapping:
Rather than relying on a single node to be the root node, the entire seed servers list will be taken to be the initial Raft group for the controller. Once all the seed servers are up, each seed server should determine whether or not all seed servers have the same list of seed servers. If so, the nodes are primed for starting a cluster and should await instruction to do so. While waiting for start up, we may initialize most subsystems, but must not initialize the controller, a bulk of the admin server, Kafka API, to name a few.
The initialization on seed servers (non-seed servers continue with today’s join-until-success logic), in short:
The initial seed server check will be done via a new RPC endpoint with the following request and response structure:
struct initial_cluster_info_request {
// Empty.
};
struct initial_cluster_info_reply {
// A cluster UUID, if one already exists, indicating we should join.
string cluster_uuid;
// Manually assigned node ID, if any, so we can form our initial mapping of UUIDs.
int node_id;
// UUID of the destination node, so we can form our initial mapping of UUIDs.
string node_uuid;
// Seeds from the destination server. Maybe useful to log if there's a problem.
// Useful in detecting the empty_seed_servers_forms_cluster=True case.
vector<string> seed_servers;
};
The subsystem that serves these RPCs must initialize before the controller forms its initial Raft group. As such, we will instantiate an RPC protocol on the seed servers only that accepts the initial_cluster_info calls. Once the controller has initialized, we will add the normal RPC protocol for normal runtime.
Alternate approaches to initial controller group sizing
Alternate approaches to controller group initialization
wait_for_init to be false
Alternate approaches to initial_cluster_info endpoint
It doesn’t seem unreasonable that some clusters knowingly start out with cluster size 1, and then grow as use cases grow more mature on the cluster. Additionally, the existing user experience for developer clusters relies heavily on being able to not provide a seed server list.
To that end, we will continue to support supplying an empty seed servers list or a seed server list of size one, which will be construed to mean starting a cluster with size 1 node.
Alternatives
minimum_initial_cluster_size configurationA new controller log command will be added to indicate the initial desired state of the cluster, including initial admin secrets, license, etc. This will be a new controller command with roughly the following payload:
struct cluster_init_cmd_data {
string cluster_uuid;
vector<tuple<uuid, node_id, addr>> seed_servers;
vector<pair<credential_user, scram_credential>> user_creds;
string license;
};
Upon electing the leader of the controller group, the leader should determine whether it has this message by checking its kv-store for a cluster UUID. If not, it will replicate a new message. The contents of the message must be known before controller group formation.
Only once this entry is replicated and applied to a node will that node open up its admin server, Kafka ports, etc. and continue with bootstrapping. The apply of this message must also affect the corresponding state in the members table, feature table, and security manager.
It should be noted that clusters today already initialize a cluster UUID as a part of the metrics subsystem. This old cluster UUID is used for phone home.
Alternatives
Upon starting up, if there is not already a UUID in the local kv-store, initialize a UUID and persist it in the local key-value. This will be the node’s UUID.
The join_node request messages already include an unused node_uuid field and an existing node_id field (moving forward, just a hint). The response node_id field will indicate what node_id has been assigned.
Attempts to join a cluster with the same node_uuid after having already joined that cluster with that UUID should fail. This would indicate a bug wherein a node that has data attempts to join a cluster twice.
Attempts to join a cluster with an in-use node_id should fail, if the UUID in the request differs from the one persisted in the members table.
Auto-generated node_ids will be monotonically increasing for now, even if decommissioning and recommissioning a node — such a node will be assigned a new node_uuid and a new node_id.
Today, nodes only send join requests after it has started up its RPC server and is ready to begin accepting Raft requests for its node ID. In the future, when nodes start up without a node ID, we need to uphold this ordering to avoid some extraneous delays in bootstrapping. To that end, assignment of node ID and joining a cluster will be two separate operations. The controller provides a means to perform operations serially and in a deterministic order, so it seems like an ideal place to assign node IDs.
When starting up with no node ID, it will send a join request to the seed servers to replicate a controller command that registers the input UUID with an unused node ID. Since the ordering of controller apply operations is deterministic, the IDs assigned are deterministic as well, and can be inferred by the order in the log with a simple counter.
Initial node IDs for the seed servers are taken to be the index in the seed servers list.
Alternatives to node ID assignment mechanism
Alternatives to initial node ID assignment.
seed_server_addr=>seed_server_node_id}.
Alternatives to mechanics of node ID assignment before join
configuration_update that indicates new UUIDs
replicas_to_add fieldSuperuser configuration
Continue using environment variables, and eventually implement as a part of an initialize command.
Alternatives
NOTE: as of v22.3, this is not yet implemented.
Supply as a part of the bootstrap config YAML, and eventually implement as a part of an initialize command.
Alternatives
Single node clusters will not be a special case. Everything else about the cluster will still reflect the changes laid out in this RFC. Cluster UUIDs will still be assigned, node UUIDs will still be generated, joining nodes will be automatically assigned node IDs, etc.
Alternatives
empty_seed_servers_starts_cluster that means seed servers should attempt to join if receiving initial_cluster_info response with empty seed_servers
NOTE: as of v22.3, this is not yet implemented.
Only once the rest of this project is done will we consider implementing an init command.
The rpk init tool will trigger the request to start the cluster, and will return only after the cluster has formed successfully, returning its cluster UUID.
The tool will take as input just a single address, potentially localhost. If the node is a seed server and is awaiting initialization, it will run and forward to all other seed servers. If the node is not a seed server, it will just forward the request via RPC to each seed server. If a cluster has already been formed (as indicated by a persisted cluster UUID), the server will return immediately with a cluster UUID.
The tool will periodically retry the request until a cluster UUID is returned, at which point it will exit.
Alternatives
rpk cluster start
NOTE: as of v22.3, this is not yet implemented.
Either run from localhost, or run with a superuser defined in the bootstrap YAML and RP_BOOTSTRAP_USER.
Since we are still going to support the existing pattern of configuration, we won’t have to immediately update any k8s operator, ansible, etc. deployments. Some already have mechanisms to avoid starting up with an empty seed servers list after initial bootstrap.
Moving forward, at the very least, we should remove the generation of node ID from these deployments.
Longer term, we should be very intentional about the seed servers list, e.g. at least one seed in each AZ, to avoid creating a new cluster when an AZ goes down.
NOTE: as of v22.3, this is not yet implemented.
There already exists a cluster UUID that is inferred by every node, as it is based on the controller log. We hash the first controller messages and timestamps, and use that as the seed for the cluster UUID, deterministically ensuring it’s identical on every node. This gets persisted in the config manager.
Moving forward, the metrics reporter should check to see if a cluster ID exists in the config, and pull it directly from the shard-local config if so. If not, it should use the cluster UUID.