docs/RFCS/20180411_finalize_cluster_upgrade_automatically.md
We will add an auto-finalization feature to help operators finalize their cluster upgrade process without manually running the set cluster version sql command.
Meanwhile, we will also add a feature that allows operators to opt out of auto-finalization to preserve the ability to downgrade.
In the past, in order to finalize the cluster upgrade process, operators must manually run
SET CLUSTER SETTING version = crdb_internal.node_executable_version();
However, this step is easy to forget and without it, we cannot unlock backwards-incompatible features. Therefore, it is necessary to add a version check mechanism that automatically runs the finalization command once knowing all the nodes in the cluster are running the new version.
Because this finalization step is irreversible, we want to allow
operators to make their own decisions on whether to use this auto-finalization
feature or not. For example, operators who want to opt out of auto-upgrade from
version 2.0 to its next version should run:
SET CLUSTER SETTING cluster.preserve_downgrade_option = '2.0';
Please note that the opt-out feature is on a per version basis. That is to say,
if operators want to keep use manual upgrade for version 2.1, they have to
re-run the above command with 2.0 replaced with 2.1. Otherwise, the cluster
version will still be upgraded automatically.
If the operators change their mind after running the manual upgrade command, they can run the following command to change it back to auto upgrade:
RESET CLUSTER SETTING cluster.preserve_downgrade_option;
Finally, trying to set the cluster setting version when operators have already
set cluster.preserve_downgrade_option is not allowed. Operators have to reset
the cluster.preserve_downgrade_option to allow auto upgrade to happen.
cluster.preserve_downgrade_optionThe field should be added to CLUSTER SETTING in the format of:
cluster.preserve_downgrade_option = 2.0
By default, the value should be "", an empty string, to indicate we are always
using auto upgrade. We validate that the value being stored in this field matches
the current cluster version before storing it.
AND
crdb_internal.gossip_nodes and verify that all live nodes are
running the new version.AND
NodeLivenessStatus from liveness server and verify that all
non-decommissioned nodes are alive (no missing nodes).SET CLUSTER SETTING version = crdb_internal.node_executable_version();.upgrade.go:Test Steps covered by upgrade.go
(sleep between each step for a certain amount of time if necessary):
Start a cluster of N nodes running v2.0.0.
Perform a rolling upgrade for node 1 - N-1. At every iteration, check
that the cluster version is not upgraded.
Stop node N-1.
Perform an upgrade for node N, which was running v2.0.0.
Check that the cluster version is not upgraded.
Decommission node N-2 (decommissioned nodes should not affect auto upgrade).
Check cannot set cluster.preserve_downgrade_option to any value besides
2.0, which is the current cluster version.
Set cluster.preserve_downgrade_option to be 2.0.
Restart node N-1, which was previously force stopped.
Check that the cluster version is not upgraded.
Check cannot set the cluster setting version until cluster.preserve_downgrade_option
is cleared.
Reset cluster.preserve_downgrade_option.
Check that the cluster version is upgraded to new version.
Check that cluster.preserve_downgrade_option has been reset.
TestClusterVersionUpgrade:TestClusterVersionUpgrade follows similar test steps as upgrade.go. Since
it serves as a quick check for CI, it's not as comprehensive as the roachtest.
If the user opt out of the auto upgrade, we should have a banner that alert users when all the nodes are running a higher version than the cluster version to instruct them to manually run the upgrade command.
If all live nodes are running the newest version but some nodes are down and not decommissioned, we should have a banner to alert operators to either revive or decommission the down nodes.
Let the operator know if the auto-upgrade is disabled or not at current version.
cluster.preserve_downgrade_option is equal to cluster.version,
auto-upgrade is disabled. Otherwise it's enabled.