docs/src/guide/distributed_indexing.md
!!! warning Lance exposes public APIs that can be integrated into an external distributed index build workflow, but Lance itself does not provide a full distributed scheduler or end-to-end orchestration layer.
This page describes the current model, terminology, and execution flow so
that callers can integrate these APIs correctly.
Distributed index build in Lance follows the same high-level pattern as distributed write:
For vector indices, the worker outputs are segments stored directly
under indices/<segment_uuid>/. Lance can turn these outputs into one or more
physical segments and then commit them as one logical index.
This guide uses the following terms consistently:
execute_uncommitted() under
indices/<segment_uuid>/For example, a distributed vector build may create a layout like:
indices/<segment_uuid_0>/
├── index.idx
└── auxiliary.idx
indices/<segment_uuid_1>/
├── index.idx
└── auxiliary.idx
indices/<segment_uuid_2>/
├── index.idx
└── auxiliary.idx
After segment build, Lance produces one or more segment directories:
indices/<physical_segment_uuid_0>/
├── index.idx
└── auxiliary.idx
indices/<physical_segment_uuid_1>/
├── index.idx
└── auxiliary.idx
These physical segments are then committed together as one logical index. In the
common no-merge case, the input segments are already the physical
segments and build_all() returns them unchanged.
There are two parties involved in distributed indexing:
Lance does not provide a distributed scheduler. The caller is responsible for launching workers and driving the overall workflow.
The current model for distributed vector indexing has two layers of parallelism.
First, multiple workers build segments in parallel:
create_index_builder(...).fragments(...).execute_uncommitted()
or Python create_index_uncommitted(..., fragment_ids=...)indices/<segment_uuid>/Then the caller decides whether those existing segments should be committed as-is or merged into larger segments:
commit_existing_index_segments(...), ormerge_existing_index_segments(...) for each caller-defined groupcommit_existing_index_segments(...)Within a single commit, built segments must have disjoint fragment coverage.
Internally, Lance models distributed vector segment build as:
The merge step is driven directly by the IndexMetadata returned from
execute_uncommitted().
This is intentionally a storage-level model:
The caller chooses the final segment grouping:
The grouping decision is separate from worker build. Workers only build segments; Lance applies the segment build policy when it plans physical segments.
The caller is expected to know:
Lance is responsible for:
If a staging root or built segment directory is never committed, it remains an
unreferenced index directory under _indices/. These artifacts are cleaned up
by cleanup_old_versions(...) using the same age-based rules as other
unreferenced index files.
This split keeps distributed scheduling outside the storage engine while still letting Lance own the on-disk index format.