docs/src/guide/distributed_indexing.md
!!! warning Lance exposes public APIs that can be integrated into an external distributed index build workflow, but Lance itself does not provide a full distributed scheduler or end-to-end orchestration layer.
This page describes the current model, terminology, and execution flow so
that callers can integrate these APIs correctly.
Distributed index build in Lance follows the same high-level pattern as distributed write:
For vector indices, the worker outputs are segments stored directly
under indices/<segment_uuid>/. Lance can turn these outputs into one or more
physical segments and then commit them as one logical index.
This guide uses the following terms consistently:
execute_uncommitted() under
indices/<segment_uuid>/For example, a distributed vector build may create a layout like:
indices/<segment_uuid_0>/
├── index.idx
└── auxiliary.idx
indices/<segment_uuid_1>/
├── index.idx
└── auxiliary.idx
indices/<segment_uuid_2>/
├── index.idx
└── auxiliary.idx
After segment build, Lance produces one or more segment directories:
indices/<physical_segment_uuid_0>/
├── index.idx
└── auxiliary.idx
indices/<physical_segment_uuid_1>/
├── index.idx
└── auxiliary.idx
These physical segments are then committed together as one logical index. In the
common no-merge case, the input segments are already the physical
segments and build_all() returns them unchanged.
There are two parties involved in distributed indexing:
Lance does not provide a distributed scheduler. The caller is responsible for launching workers and driving the overall workflow.
The current model for distributed vector indexing has two layers of parallelism.
First, multiple workers build segments in parallel:
create_index_builder(...).fragments(...).execute_uncommitted()
or Python create_index_uncommitted(..., fragment_ids=...)indices/<segment_uuid>/Then the caller turns those existing segments into one or more physical segments:
create_index_segment_builder()with_segments(...)with_target_segment_bytes(...)plan() to get Vec<IndexSegmentPlan>At that point the caller has two execution choices:
build(plan) for each plan and run those builds in parallelbuild_all() to let Lance build every planned segment on the current nodeAfter the physical segments are built, publish them with
commit_existing_index_segments(...).
Internally, Lance models distributed vector segment build as:
The plan step is driven by the segment metadata returned from
execute_uncommitted() and any additional inputs requested by the segment
build APIs.
This is intentionally a storage-level model:
When Lance builds segments from existing inputs, it may either:
The grouping decision is separate from worker build. Workers only build segments; Lance applies the segment build policy when it plans physical segments.
The caller is expected to know:
Lance is responsible for:
This split keeps distributed scheduling outside the storage engine while still letting Lance own the on-disk index format.