skills/lance-user-guide/references/index-selection.md
Use this file when the user asks "which index should I use" or "how do I tune it".
Always confirm:
| Workload | Recommended starting point | Notes |
|---|---|---|
Filter-only scans (scanner(filter=...)) | Create a scalar index on the filtered column | Choose scalar index type based on predicate shape and cardinality |
Vector search only (nearest=...) on large data | Build a vector index | Start with IVF_PQ if you need compression; tune nprobes / refine_factor |
| Vector search + selective filter | Scalar index for filter + vector index for search | Use prefilter=True when you need true top-k among filtered rows |
| Vector search + non-selective filter | Vector index only (or scalar index optional) | Consider prefilter=False for speed; accept fewer than k results |
| Text search | Create an INVERTED scalar index | Use full_text_query=... when available; note that FTS is not a universal alias in all SDK versions |
Vector index names typically follow a pattern like {clustering}_{sub_index}_{quantization}.
Common combinations:
IVF_PQ: IVF clustering + PQ compressionIVF_HNSW_SQ: IVF clustering + HNSW + SQIVF_SQ: IVF clustering + SQIVF_RQ: IVF clustering + RQIVF_FLAT: IVF clustering + no quantization (exact vectors within clusters)If you are unsure which types are supported in the user's environment, recommend starting with IVF_PQ and fall back to "try and see" (the API will error on unsupported types).
Start with:
index_type="IVF_PQ"target_partition_size: start with 8192 and adjust based on the dataset size and latency/recall needsnum_sub_vectors: choose a value that divides the vector dimensionPractical warning (performance):
(dimension / num_sub_vectors) % 8 == 0 is a common sweet spot for faster index creation.Tune recall vs latency with:
nprobes: how many IVF partitions to searchrefine_factor: how many candidates to re-rank to improve accuracyWhen a user reports "too slow" or "bad recall", ask for:
nprobes, refine_factor, and index typeprefilterChoose scalar index type based on the filter expression:
BTREEBITMAPLABEL_LISTcontains(...) filters on strings: start with NGRAMINVERTEDZONEMAP when appropriate)BLOOMFILTER (inexact)RTREELance scalar indices are created on physical columns. If you want to index a JSON sub-field:
add_columns)Example (Python, using SQL expressions):
ds = lance.dataset(uri)
ds.add_columns({"country": "json_extract(payload, '$.country')"})
ds.create_scalar_index("country", "BTREE", replace=True)
If you cannot confidently map the filter to an index type, recommend BTREE as a safe baseline and confirm via a small benchmark on representative queries.