docs/hnsw-index-replication.md
An HNSW (Hierarchical Navigable Small World) vector index in Dragonfly is global: a
single graph per (index, field) pair is shared across all shards, while the documents
it indexes are distributed per-shard. The full-sync RDB stream is by construction
per-shard. Replication must therefore carry two orthogonal pieces of state — the graph
and each shard's key space — and reassemble them on the replica, including the case where
master and replica shard counts differ.
This document specifies the wire format, the master/replica protocol, the index state machine, and the invariants that make the scheme safe under concurrent writes.
DocId is a 32-bit shard-local document identifier. GlobalDocId is a 64-bit value
containing the shard id in its upper 32 bits and the DocId in its lower 32 bits, and is
the only identifier used inside the HNSW graph. Because GlobalDocId encodes the
master's shard id, it must be rewritten when shard counts differ (§6).
(index_name, field_name), addressed by
GlobalDocId.DocIds. DocIds
are issued independently per shard; together with the shard id they form the
GlobalDocId stored in the graph.HnswNodeData — internal_id, global_id, level, and for each level l in
0..level the neighbour-link list (so level + 1 link arrays). Vector payloads are
not part of this record; they are restored from the normal key stream.
Capacity, top level and element count are derived from the node stream itself during
restore; the entry point is the only graph-level parameter carried on the wire, and it
travels in the RDB_OPCODE_VECTOR_INDEX header (§3.2). Empty graphs are not
transmitted; the replica rebuilds them from the (empty) keyspace.
| Key | Payload | Scope |
|---|---|---|
search-index | "<index_name> <FT.CREATE arguments…>" | Summary flow and SINGLE_SHARD_WITH_SUMMARY flows unconditionally; per-shard flows only if the replica advertises capability VER6 or later. Omitted entirely from RDB-to-disk saves. |
search-synonyms | "<index_name> <group_id> <terms…>" | Summary flow. |
shard-count | integer | Summary and SINGLE_SHARD_WITH_SUMMARY flows. Load-bearing: controls the remap branch (§6). |
RDB_OPCODE_VECTOR_INDEX (value 222). One block per non-empty global HNSW index,
emitted by shard 0 only.
opcode u8 = 222
index_key string "<index_name>:<field_name>"
enterpoint_node len
elements_number len
repeated elements_number times:
internal_id u32 (little-endian raw)
global_id u64
level u32
for l in 0..=level:
links_num u32
links u32 × links_num
index_key is split on the final :. enterpoint_node and elements_number are RDB
varints (SaveLen/LoadLen); the per-node fields are packed little-endian. Indices
whose graph is empty are not emitted at all.
RDB_OPCODE_SHARD_DOC_INDEX (value 223). One block per HNSW-indexed index, emitted by
every shard.
opcode u8 = 223
shard_id len
index_name string
mapping_count len
repeated mapping_count times:
key string
doc_id len
In order of appearance in the RDB stream:
search-index is written for every index. The summary flow also
writes search-synonyms once per synonym group and shard-count once.RDB_OPCODE_SHARD_DOC_INDEX block per
HNSW-indexed index, containing a snapshot of that shard's key→DocId table.kBuilding → kSerializing transition to every shard; indices in other states are
left as-is. It then acquires the read half of each global index's MRMW mutex in
turn and emits one RDB_OPCODE_VECTOR_INDEX block per index, releasing the lock
and flushing the serializer at the end of each one.kSerializing and returns those indices to kBuilding.The graph dump is additionally gated by the --serialize_hnsw_index flag on the master.
When the flag is off, steps 2 and 3 are skipped and the replica rebuilds every index
from the key stream.
Inline processing:
search-index dispatches FT.CREATE with idempotent semantics (existing definitions
are left in place).shard-count is recorded and used to select the restore branch.RDB_OPCODE_SHARD_DOC_INDEX block is parked as a pending mapping keyed by the
master's shard id.RDB_OPCODE_VECTOR_INDEX is either restored in place (same shard count) or parked
as pending nodes (different shard count).Post-load reconciliation, run once after the stream has been fully consumed:
kRestoring) or to rebuild the index from scratch
(graph missing or inconsistent, → kBuilding directly).GlobalDocId. The key→DocId mapping is
revalidated against the snapshot before any removal, because concurrent fibers may
reuse freed DocIds. Nodes whose document is missing are removed from the graph; keys
whose hydration cannot complete immediately are queued for the final drain.kRestoring → kBuilding. The drain is
deferred until all shards have completed because the graph is global; a single
shard cannot mark itself ready while others may still install vector data into the
same graph. Shards that took the full-rebuild path are already in kBuilding and
skip the drain.The replica's --deserialize_hnsw_index flag mirrors the master flag: when off, both
opcodes are skipped on arrival and every index is rebuilt from the key stream.
Each shard holds one state per index, drawn from the four values
{kProhibit, kRestoring, kSerializing, kBuilding}. States on different shards and on
different indices are independent. The state reflects the index's phase (loading,
restoring, serializing, steady-state) rather than the node's role — any instance
visits master-side states while saving and replica-side states while loading — but
in a typical replication topology each side only visits its own subset.
Replica side. The replica visits kProhibit first, then either rebuilds from
scratch or hydrates a restored graph, ending in kBuilding.
| State | When entered | Mutations | Exit |
|---|---|---|---|
kProhibit | Default at InitIndex while the shard is in LOADING. The shard has not yet decided whether to rebuild from scratch or restore from RDB graph data. | Buffered into the pending-updates list (may be discarded — see exit). | Full-rebuild path: directly to kBuilding; the pending-updates list is cleared, not replayed (the rebuild reindexes every document from the key stream). Restore path: to kRestoring. |
kRestoring | Restore path only: graph data was installed from RDB and Rebuild(is_restored=true) ran, leaving vector payloads to be hydrated from the key stream. | Buffered for replay. | Drain → kBuilding, after hydration completes on every shard. |
kBuilding | Reached directly via full rebuild, or via the drain out of kRestoring. Terminal under a single full-sync. | Applied inline to the graph. | Promoted to kSerializing only if this replica later runs its own snapshot. |
The distinction between kProhibit and kRestoring is which path was chosen:
kProhibit means the shard is still buffering ops without a commitment to replay
them, and the full-rebuild exit will discard the buffer; kRestoring means the shard
has committed to keeping the buffer and replaying it on drain.
Master side. The master starts in kBuilding and is briefly promoted to
kSerializing for the duration of each graph dump.
| State | When entered | Mutations | Exit |
|---|---|---|---|
kBuilding | Steady state on the master. | Applied inline to the graph. | Promoted to kSerializing at the start of every graph dump. |
kSerializing | Set right before a graph dump, only on indices currently in kBuilding. | Buffered for replay. | Drain → kBuilding, after the dump completes. |
Transitions. Drains replay the pending-updates list against the current database
state; the full-rebuild exit out of kProhibit is the only state change that
discards buffered ops instead of replaying them.
kProhibit → kBuilding — full-rebuild path on the replica: the shard discards any buffered ops and rebuilds the index from scratch from the key stream.kProhibit → kRestoring — restore path on the replica: graph data was installed from RDB; the shard now needs to hydrate vector payloads.kRestoring → kBuilding — drain at the end of post-load reconciliation, after all shards finish hydration.kBuilding → kSerializing — issued at the start of the graph dump on the side that is saving.kSerializing → kBuilding — drain at the end of the graph dump.There is no direct transition between kRestoring and kSerializing.
The global graph is mutated only when the owning shard is in kBuilding. All other
states divert Add and Remove calls to a per-shard pending-updates list, which is
replayed on drain.
While a graph dump is in progress, shard-level writes are accepted at the key layer; their index side-effects are buffered. The MRMW read lock held on the graph prevents any buffered mutation from committing until the dump completes.
A graph can be configured to store pointers into hash field sds rather than copying
vectors. In copy mode this invariant is a no-op. In borrowed mode, any mutation that
would free or overwrite such an sds while the index is not in kBuilding must retain
the original sds until the drain. Retention is per-shard, per-field. The containing
PrimeValue additionally suppresses defragmentation for the lifetime of the node to
prevent relocation of borrowed storage.
GlobalDocIds are unique across shards by construction. On fresh mapping load and on
remap (§6), per-shard DocId counters issue DocIds in the same order in which keys are
later installed, so counter values coincide with the DocIds materialized by the replica's
key index.
Graph structure is restored before vector payloads are hydrated. A node whose document
is missing in the local database at hydration is removed from the graph. Nodes whose
vector data cannot be installed immediately are deferred to the post-load drain
(§4.2 step 5); they exist briefly with stale payloads but are never reachable by a
search query, which requires the index to be in kBuilding.
When the shard-count AUX value differs from the replica's shard count, GlobalDocIds
carried in graph blocks refer to the master's shard layout and must be rewritten before
the graph can be installed.
(index, master_shard, key, old_doc_id) received in
the mapping stream, compute the replica target shard for the key and issue a fresh
DocId from a per-target counter. The counter order must match the order in which the
replica later installs keys, so that counter values coincide with materialized DocIds.global_id in parked node records through the
table, then restore the graph. Any index that cannot be fully remapped is added to a
failed set.| Master shards | Replica shards | Mappings present | Action |
|---|---|---|---|
| N | N | — | Direct restore on each shard at opcode arrival; no remap. |
| N | M (≠ N) | yes | Park nodes; remap, restore, pre-distributed mapping apply at post-load. |
| N | M (≠ N) | no, or remap fails | Discard graph, rebuild from key stream. |
The wire format is gated on the replica capability VER6, advertised through
REPLCONF capa dragonfly, and on the master flag --serialize_hnsw_index. Replicas
below VER6 receive only the FT.CREATE definition through the summary flow and
reconstruct each index from the key stream alone. RDB saves to disk omit all
search-index data: search indices are a replication-only concern.