doc/developer/design/20240117_decoupled_storage_controller.md
As part of the platform v2 work (specifically use-case isolation) we want to develop a scalable and isolated serving layer that is made up of multiple processes that interact with distributed primitives at the moments where coordination is required.
The way the compute controller and storage controller work currently will not work well for that, because they're coupled together inside one process.
Overarching/long-term goal: make it possible to run controller code distributed across different processes. Say, one compute controller per cluster, that is local to that cluster.
Finer-grained goals:
persist-txn. But there is more work in
that area, specifically we need to replace the current process-level write
lock.Advantages of this approach:
Implications:
num_collection * num_clusters
since handles, where before it was only num_collection handles.We can move table-related things out into a TableWriter because the StorageController doesn't do much with/to tables, except:
For both of these use cases, the StorageController can be given access to a StorageCollections, and acquire read holds same as everyone else (same as compute and the adapter). Upper updates will no longer have to flow through a special channel, the StorageCollections will be keeping uppers/sinces up to date same as for other collections: through persist pubsub.
If we want to achieve full physical use-case isolation, where we have the
serving work (and therefore also the controller work) of an environment split
across multiple processes and not one centralized environmentd, we also need
StorageController to work in that world. That is, it needs to become more like
ComputeController where there is a per-cluster controller and not one
monolithic controller inside environmentd.
We only need #1 for use-case isolation Milestone 2, where we want better
isolated components and use-case isolation within the single environmentd.
For Milestone 3, full physical use-case isolation, we also need #2.
We can keep a centralized StorageController that runs as a singleton in one process. Whenever other processes want to, for example, acquire or release read holds they have to talk to this process via RPC.
Arguments against this alternative:
SinceHandles (and other resources) for each cluster makes it clearer
who is holding on to things.None so far.