docs/dev/view-building-coordinator.md
The view building coordinator is responsible for building views from tablet-based keyspaces.
In contrast to vnode-based views, which are built by node-local view builder, the view building coordinator is a single entity within the whole cluster, running on the raft leader alongside the topology coordinator.
The coordinator state is stored in following group0 tables and
it's loaded into view_building_state_machine while group0 state is applied.
Currently the coordinator processes at most only one base table at the time, building all views for this base table.
Whole view building process is split into smaller view building tasks.
Each task is associated with particular tablet replica (host_id, shard, tablet_id) of a certain base table.
There are 2 types of tasks:
build_range - generate view updates from the tablet replica of the base table to build viewprocess_staging - process (generate view updates and move to base directory)
all staging sstables associated with the tablet replica of the base tableView building tasks are created when:
build_range tasks:
process_staging tasks:
view_building_workerThe group0 state stores only tasks that haven't been completed yet or were aborted but haven't been cleaned up yet.
When a task is created, it is stored in group0 state (system.view_building_tasks) to be processed in the future.
Then at some point, the view building cooridnator will decide to process it by sending a work_on_view_building_tasks RPC to a worker.
Unless the task was aborted, the worker will eventually reply that the task was finished. After the coordinator gets the response from the worker,
it temporarily saves list of ids of finished tasks and removes those tasks from group0 state (pernamently marking them as finished) in 200ms intervals. (*)
This batching of removing finished tasks is done in order to reduce number of generated group0 operations.
On the other hand, view building tasks can can also be aborted due to 2 main reasons:
aborted flag to true. We need to do this because we need the task information
to create new adjusted tasks (if the operation succeeded) or rollback them (if the operation failed).
Once a task is aborted by setting the flag, this cannot be revoked, so rolling back a task means creating its duplicate and removing the original task.(*) - Because there is a time gap between when the coordinator learns that a task is finished (from the RPC response) and when the task is marked as completed, it is possible that the coordinator may lose this information (e.g. due to Raft leader change). But each view building worker keeps track of finished tasks locally, so when a new coordinator will send an RPC with the same view building tasks, the worker will immediately response that those tasks were completed. In the worst case, when both the coordinator and worker nodes are restarted, we can completely lose that information and will have to redo the work. However, view building tasks are idempotent.
View building task struct:
struct view_building_task {
enum class task_type {
build_range,
process_staging,
};
utils::UUID id;
task_type type;
bool aborted;
table_id base_id;
table_id view_id; // is default value when `type == task_type::process_staging`
locator::tablet_replica replica;
locator::tablet_id tid;
};
State machine:
stateDiagram-v2
state "the task is alive" as NORMAL
state "aborted flag is set to true" as ABORTED
[*] --> NORMAL: task is created
NORMAL --> work_on_view_building_tasks: view building coordinator sends a RPC
work_on_view_building_tasks --> [*]: the coordinator gets response from the RPC call if the task wasn't aborted in the mean time
NORMAL --> [*]: aborted due to keyspace/view drop
NORMAL --> ABORTED: aborted due to tablet operation
ABORTED --> [*]: new adjusted task is created
ABORTED --> [*]: the task is rolled back with new ID
state work_on_view_building_tasks {
state "view building worker is executing the task" as EXECUTE
state "the worker saves information that the task was finished locally" as DONE
[*] --> EXECUTE
EXECUTE --> DONE
DONE --> [*]
}
The most important table is system.view_building_tasks, which stores all unfinished view building tasks
CREATE TABLE system.view_building_tasks (
key text,
id timeuuid,
type text,
aborted boolean,
base_id uuid,
view_id uuid, -- NULL for "process_staging" tasks
base_tablet_id bigint,
host_id uuid, -- Host of the tablet replica
shard int, -- Shard of the tablet replica
PRIMARY KEY (key, id)
)
The view building coordinator stores currently processing base table in system.scylla_local
under view_building_processing_base key.
The entry is managed by group0.
Once the view is built, an entry in system.built_views is created. Before the view building coordinator,
this table was node-local one. But now the table is partially managed by group0,
meaning that all entries from tablet-based keyspaces are managed by group0 and
entries from vnode-based keyspaces are still node local.
The view building worker is node-local service responsible for executing view building tasks. It handles work on view building tasks RPC and responses the coordinator once the tasks are finished. The worker also observes the group0 state to notice when tasks are aborted (by deleting them or by setting the aborted flag).
The worker groups multiple view building tasks into a batch and it can execute only one batch per shard (it's the coordinator responsibility to schedule only batch per tablet replica).
Tasks can be in one batch only if they have the same:
The view building worker doesn't mark tasks as finished (it doesn't do group0 operation with one exception). Instead, it saves ids of finished tasks and it is the coordinator who asks the worker what is the result of some tasks using following RPC call:
verb [[cancellable]] work_on_view_building_tasks(raft::term_t term, shard_id shard, std::vector<utils::UUID> tasks_ids) -> std::vector<utils::UUID>
The worker registers handler for the RPC, which:
The view building coordinator needs to react to following tablet operations:
base_id == tablet operation table_id and replica == source replica and tablet ids are matched:
base_id == tablet operation table_id:
n -> (2n, 2n+1)n -> n/2 if such task doesn't existIn all of the cases, the tasks are aborted at the start of the operation and new tasks are created at the end. In case of failure, new copies of aborted tasks are created during rollback.
The view building coordinator can also handle staging sstables using process_staging view building tasks.
We do this because we don't want to start generating view updates from a staging sstable prematurely,
before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).
Firstly, db::view::check_needs_view_update_path() now returns db::view::sstable_destination_decision,
instead of bool value determining if the sstable should go to base or staging directory.
enum class sstable_destination_decision {
normal_directory, // use normal sstable directory
staging_directly_to_generator, // use staging directory and notify view building worker
staging_managed_by_vbc // use staging directory and register the sstable to view update generator
};
For vnode-based sstables, the function works the same, but for tablet-based sstables the logic is: