Back to Lance

N-gram Index

docs/src/format/table/index/scalar/ngram.md

4.0.11.3 KB
Original Source

N-gram Index

N-gram indices break text into overlapping sequences (trigrams) for efficient substring matching. They provide fast text search by indexing all 3-character sequences in the text after applying ASCII folding and lowercasing.

Index Details

protobuf
%%% proto.message.NGramIndexDetails %%%

Storage Layout

The N-gram index stores tokenized text as trigrams with their posting lists:

  1. ngram_postings.lance - Trigram tokens and their posting lists

File Schema

ColumnTypeNullableDescription
tokensUInt32trueHashed trigram token
posting_listBinaryfalseCompressed bitmap of row IDs containing the token

Accelerated Queries

The N-gram index provides inexact results for the following query types:

Query TypeDescriptionOperationResult Type
containsSubstring search in textFinds all trigrams in query, intersects posting listsAtMost