docs/rfcs/20190926_cluster_controller.md
two-phase local node bootstrap.
controller is our global partition. It is a model::ntp("redpanda", "controller", 0) alias. controller is effectively a global metadata cache. It is a compacted topic.
Upon local node start, restart, crash, etc, we bootstrap by read the controller contents entirely. We expect the ammount of data to be actually small, i.e.: less than 1GB of content.
While we are reading the contents, we delegate to
sharded<cluster::partition_manager> pm the recovery of particular ntp via
cluster::partition_manager>::manage(model::ntp, raft::group_id).
This partition manager holds one storage::log_manager instance. The
storage::log_manager::manage(ntp) will perform the filesystem recovery of
the log segments for a given ntp. Note that a ntp may have thousands
of individual log segments.
This proposal removes the knowledge from the storage about which core ntp
belong to. It pushes this to the place of knowledge cluster::controller
+ Kafka +-----------+
| + |
| | |
| (a) | |
| | |
| | | (h)
| | |
| v |
| cluster |
| + |
| | |
| (b) | |
| | |
| v | read-path [h]
| consensus::raft |
write-path | + + |
[ a-g ] | (d) | (e)| |
| v v |
| RPC storage<----+
| +
| | (leader / replica 1)
| |
| |
| |
| (f) | (g)
| +------v-------+
| | |
| v v
+ replica-2 replica-3
We need to have a sound bootstraping and log recovery mechanism upon a machine restart (from shutdown, reboot, crash, etc)
cluster::controller reads the full cluster::partition. It will scan for
all the assignments for model::node_id belongs to the machine.
It will build 2 indexes. model::ntp -> seastar::shard_id and a
raft::group_id -> seastar::shard_id. This abstraction is captured by
the cluster::shard_table.
cluster::controller copies the full assignment to all cores.
Once cluster::shard_table is built (and copied to all cores).
We will do a concurrent recovery of all the model::ntp. First,
we delegate the recovery to the sharded<cluster::partition_manager>.
Every time we ask the
cluster::partition_manager>::manage(model::ntp, raft::group_id) to
manage an ntp and raft group, it will create an entry on the
storage::log_manager via the storage::log_manager::manage(ntp).
The storage::log_manager::manage(ntp) will perform the particular ntp
recovery of all log segments on disk.
Note that this operation is happening concurrently. Currently the
storage::log_manager::manage(ntp) uses the seastar::default_priority
so it means that recovery is happening fairly across all cores and within
each core doing all the IO for recovery.
This RFC is attached to an interface prototype.
Why should we not do this?
Recovery now has a constant cost before actual machine bootstrap can occur. However, as long as controller is small (1GB), two-phase recovery will be a negligible cost since the majority of the cost will be on recovering the actual log segments.
In addition, when we hit this bottleneck, we can create a secondary log with only the assignments for this machine which maybe quite small (thousands) A cache of assignments.
log_manager level
** Wiring of components happens at main.ccmodel::parition_id::type % smp::core_count is fragile and offers no real
load balacing system. For example, core0 will always do a little more work
in most systems because it is the only core guaranteed to exist.
Second, having a cluster level knowledge gives us the opportunity to optimize
physical core placement.
Third, having a global component allows us to do global optimization
strategies rather than node-local-maxima strategies