docs/RFCS/20150820_store_pool.md
Add a new StorePool service on each node that monitors all the stores and
reports on their current status and health. Based on just a store ID, the
pool will report the health of the store. Initially this health will only
be if the store is dead or not, but will expand to include other factors in the
future. This will also be the ideal location to add any calculations about
which store would be best suited to take on a new replica, subsuming some of
the work from the allocator.
This new service will work perfectly with #2153 and #2171
The decisions about when to add/remove replicas for rebalancing and repairing require the knowledge about the health of other stores. There needs to be a local source of truth for those decisions.
Add a new configuration setting called TimeUntilStoreDead which contains
the number of seconds after which if a store was not heard from, it is
considered dead. The default value for this will be 5 minutes.
Add a new service called StorePool that starts when the node is started.
This new service will run until the stopper is called and have access to
gossip.
StorePool will maintain a map of store IDs to store descriptors and a variety
of heath statistic about the store. It will also maintain a LastUpdatedTime
which will be set whenever a store descriptor is updated. When this happens,
if the store was previously marked as dead, it will restored. To maintain this
map, a callback from gossip for store descriptors will be added. When this
LastUpdatedTime is longer than the TimeUntilStoreDead, the store is
considered dead and any replicas on this store may be removed. Note that that
the work to remove replicas is performed elsewhere.
Monitor will maintain a timespan timeUntilNextDead which is calculated by
taking the nearest LastUpdatedTime for all the stores and adding
TimeUntilStoreDead and the store ID associated with the timeout.
Monitor will trigger on timeUntilNextDead which when triggered checks to see
if that store has not been updated.
If the store hasn't been updated, it will mark it as dead.
Then it will calculate the next timeUntilNextDead to wake up the service.
Can't think of any right now. Perhaps that we're adding a new service, but it should be very lightweight.
If RFCs #2153 and #2171 aren't implemented, should we consider another option?