docs/dev/hinted_handoff_design.md
Hinted Handoff is a feature that allows replaying failed writes. The mutation and the destination replica are saved in a log and replayed later according to the feature configuration.
$SCYLLA_HOME/hintsOnce the WRITE mutation fails with a timeout we create a hints_queue for the target node.
Each hint is specified by:
Hints are appended to the hints_queue until (all this should be done using the existing or slightly modified commitlog API):
As long as hints are appended to the queue the files are closed and flushed to the disk once they reach the maximum allowed size (32MB) or when the queue is forcefully flushed (see "Hints sending" below).
We are going to reuse the commitlog infrastructure for writing hints to disk - it provides both the internal buffering and the memory consumption control.
Hints to the specific destination are stored under the hints_directory/<shard ID>/<node host ID> directory.
Scylla is moving away from using IP addresses to identify nodes in its internals and that role is being taken over by host IDs. Hinted Handoff is no exception to that and the module uses the new type now.
However, to prepare for upgrading Scylla to a new version from one where Hinted Handoff still used IP addresses, a migration process has been introduced. Its purpose is to map existing hint directories on disk so that their names all represent valid host IDs.
When the whole cluster starts using a version of Scylla that supports host-ID based Hinted Handoff, the module is suspended (i.e. no new hints are accepted and no hints are being sent) and we start renaming hint directories to host IDs. Hinted Handoff does NOT work until the migration process has finished.
As a side effect, all sync points that were created up to then will be canceled, i.e. an exception will be issued instead of a resolved future.
A major consequence of the migration process is also possible data loss. If there is no corresponding host ID for a given
IP address in locator::token_metadata or if renaming a directory fails, the directory shall be removed with all of its
contents. In that case, a warning will be issued.
Migration won't be started if a node is being stopped or drained.