membership.md
Simon (@superfell) and I (@ongardie) talked through reworking this library's cluster membership changes last Friday. We don't see a way to split this into independent patches, so we're taking the next best approach: submitting the plan here for review, then working on an enormous PR. Your feedback would be appreciated. (@superfell is out this week, however, so don't expect him to respond quickly.)
These are the main goals:
peers.json file, to avoid issues of consistency between that and the log/snapshot.We propose to re-define a configuration as a set of servers, where each server includes an address (as it does today) and a mode that is either:
All changes to the configuration will be done by writing a new configuration to the log. The new configuration will be in affect as soon as it is appended to the log (not when it is committed like a normal state machine command). Note that, per my dissertation, there can be at most one uncommitted configuration at a time (the next configuration may not be created until the prior one has been committed). It's not strictly necessary to follow these same rules for the nonvoter/staging servers, but we think its best to treat all changes uniformly.
Each server will track two configurations:
When there's no membership change happening, these two will be the same. The latest configuration is almost always the one used, except:
We propose the following operations for clients to manipulate the cluster configuration:
This diagram, of which I'm quite proud, shows the possible transitions:
+-----------------------------------------------------------------------------+
| |
| Start -> +--------+ |
| ,------<------------| | |
| / | absent | |
| / RemovePeer--> | | <---RemovePeer |
| / | +--------+ \ |
| / | | \ |
| AddNonvoter | AddVoter \ |
| | ,->---' `--<-. | \ |
| v / \ v \ |
| +----------+ +----------+ +----------+ |
| | | ---AddVoter--> | | -log caught up --> | | |
| | nonvoter | | staging | | voter | |
| | | <-DemoteVoter- | | ,- | | |
| +----------+ \ +----------+ / +----------+ |
| \ / |
| `--------------<---------------' |
| |
+-----------------------------------------------------------------------------+
While these operations aren't quite symmetric, we think they're a good set to capture the possible intent of the user. For example, if I want to make sure a server doesn't have a vote, but the server isn't part of the configuration at all, it probably shouldn't be added as a nonvoting server.
Each of these application-level operations will be interpreted by the leader and, if it has an effect, will cause the leader to write a new configuration entry to its log. Which particular application-level operation caused the log entry to be written need not be part of the log entry.
This is a non-exhaustive list, but we came up with a few things:
peers.json file introduces the possibility of getting out of sync with the log and snapshot, and it's hard to maintain this atomically as the log changes. It's not clear whether it's meant to track the committed or latest configuration, either.peers.json but should initialize the log or snapshot with an application-provided configuration entry.Again, we're looking for feedback here before we start working on this. Here are some questions to think about: