docs/fts.md
This proposal is to use the https://github.com/quickwit-oss/tantivy library to provide full-text search capabilities in tursodb.
introduction and terminology:
Term: a normalized token extracted from text during indexing.
Posting list: (also referred to as an inverted list) is the list of all documents that contain a given term, along with metadata needed for scoring the results.
Document: Tantivy’s unit of indexing and retrieval, analogous to a single row in a db table.
Field: Analogous to a column in a table
Position: The sequential offset/token index of a term within a field.
Segment: self-contained and immutable chunk of the index. Each commit writes one or more new segments to disk, searches query all active segments in parallel.
Inside each segment:
More details on everything here: https://fulmicoton.gitbooks.io/tantivy-doc/content/step-by-step.html
At a high level, a Tantivy index is a self-contained directory containing immutable “segments” of inverted index data file.
“Your index is in fact divided into smaller independent indexes called segments. The reason for this division is explained in Incremental Indexing. The UUID part stands as the segment name, while the extension express which data-structure is stored in the file.
The last file, meta.json, contains in a JSON format :
source: https://fulmicoton.gitbooks.io/tantivy-doc/content/index-files.html
IndexWriter, which owns the memory arena and background worker threads.add_document() tokenizes and indexes the text fields into in-memory postings.commit() is called, Tantivy serializes a new segment (posting lists, term dictionary, stored field data) and updates the meta.json (all these segments are Write-once, Read-many, with the exception of the per-index meta.json).commit() is atomic: either the new segment is visible or not at all.delete_term() records tombstones that are applied at merge time (all updates done are just “delete + insert”).Tantivy’s I/O is implemented through the Directory trait (https://github.com/quickwit-oss/tantivy/blob/dabcaa58093a3f7f10e98a5a3b06cfe2370482f9/src/directory/directory.rs#L107), which is kinda similar to our own IO / File traits.
Tantivy includes MmapDirectory (mmap’d file based segments, the default option), and RamDirectory (in-memory, usually for testing). Obviously the default will not work for our use case, as we cannot have a whole directory created for each index with a bunch of files.
reader.reload() or automatic reload.Schema describing indexed fields.Index using a Directory.IndexWriter, call add_document() or delete_term().commit() to persist and publish new data.IndexReader/Searcher + QueryParser + TopDocs to execute searches.From the user perspective, the syntax for using full text search features in tursodb will look like the following:
CREATE INDEX idx_posts
ON posts USING fts (title, body);
You can configure the tokenizer used for text processing via the WITH clause:
-- Use raw tokenizer for exact-match fields (IDs, tags)
CREATE INDEX idx_tags ON articles USING fts (tag) WITH (tokenizer = 'raw');
-- Use ngram tokenizer for autocomplete/substring matching
CREATE INDEX idx_products ON products USING fts (name) WITH (tokenizer = 'ngram');
| Tokenizer | Description | Use Case |
|---|---|---|
default | Lowercase, punctuation split, 40 char limit | General English text |
raw | No tokenization - exact match only | IDs, UUIDs, tags, categories |
simple | Basic whitespace/punctuation split | Simple text without lowercase |
whitespace | Split on whitespace only | Space-separated tokens |
ngram | 2-3 character n-grams | Autocomplete, substring matching |
You can configure relative importance of indexed columns using the weights parameter:
-- Title matches are 2x more important than body matches
CREATE INDEX idx_articles ON articles USING fts (title, body)
WITH (weights = 'title=2.0,body=1.0');
-- Combined with tokenizer
CREATE INDEX idx_docs ON docs USING fts (name, description)
WITH (tokenizer = 'simple', weights = 'name=3.0,description=1.0');
Weight behavior:
1.0 for all fieldsfts_score()-- Default tokenizer: "Hello World" → ["hello", "world"]
-- Searches for "hello" or "HELLO" will match
-- Raw tokenizer: "user-123" → ["user-123"]
-- Only exact match "user-123" will work, "user" won't match
-- Ngram tokenizer: "iPhone" → ["iP", "iPh", "Ph", "Pho", "ho", "hon", "on", "one", "ne"]
-- Search for "Pho" will match documents containing "iPhone"
Three FTS functions are provided:
fts_score(col1, col2, ..., 'query') - Returns the BM25 relevance score for each matching rowfts_match(col1, col2, ..., 'query') - Returns a boolean indicating if the row matches (used in WHERE clause)fts_highlight(col1, col2, ..., before_tag, after_tag, 'query') - Returns text with matching terms wrapped in tagsWHERE col MATCH 'query' is available for use instead of fts_match(col, 'query')
-- Get scores for matching documents, ordered by relevance
SELECT fts_score(title, body, 'database') as score, id, title
FROM articles
ORDER BY score DESC
LIMIT 10;
-- Simple match filter
SELECT id, title FROM articles WHERE fts_match(body, 'science') LIMIT 10;
The fts_highlight function wraps matching query terms with custom tags for display:
-- Basic highlighting (single column)
SELECT fts_highlight('Learn about database optimization', '<b>', '</b>', 'database');
-- Returns: "Learn about <b>database</b> optimization"
-- Multiple columns - text is concatenated with spaces
SELECT fts_highlight(title, body, '<mark>', '</mark>', 'database') as highlighted
FROM articles
WHERE fts_match(title, body, 'database');
-- If title='Database Design' and body='Learn about optimization',
-- Returns: "<mark>Database</mark> Design Learn about optimization"
-- Use with FTS queries to highlight matched content
SELECT
id,
title,
fts_highlight(body, '<mark>', '</mark>', 'database') as highlighted_body
FROM articles
WHERE fts_match(title, body, 'database')
ORDER BY fts_score(title, body, 'database') DESC;
-- Multiple terms are highlighted
SELECT fts_highlight('The quick brown fox', '<em>', '</em>', 'quick fox');
-- Returns: "The <em>quick</em> brown <em>fox</em>"
Features:
The FTS index recognizes and optimizes these query patterns:
| Pattern | Description |
|---|---|
SELECT fts_score(...) as score FROM t ORDER BY score DESC LIMIT ? | Score with ORDER BY and LIMIT |
SELECT fts_score(...) as score FROM t WHERE fts_match(...) ORDER BY score DESC LIMIT ? | Combined score+match with ORDER BY and LIMIT |
SELECT fts_score(...) as score FROM t WHERE fts_match(...) ORDER BY score DESC | Combined without LIMIT |
SELECT fts_score(...) as score FROM t WHERE fts_match(...) LIMIT ? | Combined with LIMIT only |
SELECT fts_score(...) as score FROM t WHERE fts_match(...) | Combined without ORDER BY or LIMIT |
SELECT * FROM t WHERE fts_match(...) LIMIT ? | Match filter with LIMIT |
SELECT * FROM t WHERE fts_match(...) | Match filter only |
Queries that don't exactly match the predefined patterns still work via function recognition. When fts_match() or fts_score() functions are detected in a query, the FTS index is used even with:
AND category = 'tech')-- Complex query with extra columns and WHERE conditions
SELECT id, author, title, category, views, fts_score(title, body, 'Rust') as score
FROM articles
WHERE fts_match(title, body, 'Rust')
AND category = 'tech'
AND views > 100
ORDER BY score DESC;
-- ORDER BY non-score column
SELECT id, title FROM docs WHERE fts_match(title, body, 'Rust') ORDER BY created_at DESC;
The query string passed to fts_match/fts_score supports Tantivy's QueryParser syntax:
| Syntax | Example | Description |
|---|---|---|
| Single term | database | Match documents containing "database" |
| Multiple terms (OR) | database sql | Match documents with "database" OR "sql" |
| AND operator | database AND sql | Match documents with both terms |
| NOT operator | database NOT nosql | Match "database" but exclude "nosql" |
| Phrase search | "full text search" | Match exact phrase |
| Prefix search | data* | Match terms starting with "data" |
| Column filter | title:database | Match "database" only in title field |
| Boosting | title:database^2 body:database | Boost title matches |
This syntax can be improved on in the future, and maybe eventually we can support some fancy elasticsearch/paradeDB syntax.
DML statements (INSERT, UPDATE, DELETE) work automatically with FTS indexes:
BATCH_COMMIT_SIZE).-- These trigger automatic FTS index updates
INSERT INTO articles VALUES (1, 'Title', 'Body text');
UPDATE articles SET body = 'New body' WHERE id = 1;
DELETE FROM articles WHERE id = 1;
For the above SELECT query, we would look up which FTS index handles t.name from the in-memory schema representation, construct a Tantivy query for this index with QueryParser::parse_query("tursodb"). Tantivy then returns a list of (score, doc_address) pairs, which we translate each result into (rowid, rating) because we stored the table’s PK or rowid as a field during the indexing. Then an index lookup/SeekRowID for each of those result rows from the FtsQuery internal function fetches the t.* columns and no extra sort is required.
If there are additional filters on t (WHERE t.created_at > ...) we should probably prefer ‘fts-first’, then filter out results from that. Run the Tantivy query, materialize results into an ephemeral table of some sort and then discard rows failing the post-filter. TODO: expand on this a bit more
CREATE INDEX statements are already stored in the schema table, when we parse the sqlite_schema table which contains the DDL statement that created the index, we build the in-memory schema representation that allows us to send these queries to Tantivy.
We implement Tantivy’s Directory over our pager/B-tree but just as a regular table. We do not reinterpret Tantivy’s files, we store them exactly as Tantivy names and writes them. We should probably store the files as chunks of 256-512 kb blobs.
One table, and one index: per each FTS index
Data table:
CREATE TABLE fts_dir_{idx_id} (
path TEXT NOT NULL,
chunk_no INTEGER NOT NULL,
bytes BLOB NOT NULL
);
Index:
CREATE INDEX IF NOT EXISTS idx_name ON table_name USING backing_btree (path, chunk_no, bytes)
Use backing_btree to create a BTree that stores all columns without rowid indirection
This allows direct cursor access with the exact key structure. This way we can use an index cursor to SeekGE (path, chunk_no) where chunk_no is just computed from the offset requested by read_bytes on the file handle.
The architecture uses a hybrid approach that balances memory usage and performance:
┌─────────────────────────────────────────────────────────────┐
│ HybridBTreeDirectory │
├─────────────────────────────────────────────────────────────┤
│ File Catalog (always in memory) │
│ ├── path -> FileMetadata { size, num_chunks, category } │
│ │
│ Hot Cache (metadata + term dictionaries) │
│ ├── meta.json, .managed.json (always loaded) │
│ ├── .term files (loaded on first access) │
│ ├── .fast, .fieldnorm (small, frequently accessed) │
│ │
│ Chunk LRU Cache (lazy-loaded segment data) │
│ ├── .idx, .pos, .store chunks │
│ └── Eviction when over memory budget │
└─────────────────────────────────────────────────────────────┘
Files are classified based on their role in Tantivy operations:
| Category | Files | Behavior |
|---|---|---|
| Metadata | meta.json, .managed.json, .lock | Always in hot cache |
| TermDictionary | *.term | Hot, loaded on first access |
| FastFields | *.fast, *.fieldnorm | Hot, small and frequent |
| SegmentData | *.idx, *.pos, *.store | Lazy-loaded on demand |
The FTS cursor uses an async state machine to load the index:
Additional states for write operations:
The state machine is driven by FtsCursor which handles the async IO integration with our pager.
Since blocking reads are acceptable in the query path:
impl FileHandle for LazyFileHandle {
fn read_bytes(&self, range: Range<usize>) -> io::Result<OwnedBytes> {
// 1. Check hot cache
// 2. Calculate required chunks from byte range
// 3. For each chunk: check LRU cache, or blocking BTree fetch
// 4. Assemble and return result
}
}
The get_chunk_blocking method creates a temporary BTree cursor and loops on pager.io.step() until the chunk is fetched.
pub const DEFAULT_CHUNK_CACHE_BYTES: usize = 128 * 1024 * 1024; // 128MB
pub const DEFAULT_HOT_CACHE_BYTES: usize = 64 * 1024 * 1024; // 64MB
The ChunkLruCache evicts least-recently-used chunks when over capacity. Each FTS index has its own cache for isolation.
| Component | Purpose |
|---|---|
FileCategory | Classifies files for caching decisions |
FileMetadata | Stores file size, chunk count, category |
ChunkLruCache | Memory-bounded LRU cache for segment chunks |
HybridBTreeDirectory | Implements Tantivy's Directory trait |
LazyFileHandle | Fetches chunks on demand |
| Index Size | Old (CachedBTreeDirectory) | New (HybridBTreeDirectory) |
|---|---|---|
| 100MB | ~100MB | ~25MB |
| 500MB | ~500MB | ~80MB |
| 1GB | ~1GB | ~150MB |
DEFAULT_MEMORY_BUDGET_BYTES = 64 MB // Tantivy IndexWriter memory budget
DEFAULT_CHUNK_SIZE = 1 MB // BTree blob chunk size
DEFAULT_HOT_CACHE_BYTES = 64 MB // Bounded LRU cache for metadata/term dicts
DEFAULT_CHUNK_CACHE_BYTES = 128 MB // Bounded LRU cache for segment chunks
BATCH_COMMIT_SIZE = 1000 // Documents per Tantivy commit
Both the hot cache and chunk cache use approximate LRU eviction to stay within their memory budgets, preventing unbounded memory growth with many FTS indexes.
The OPTIMIZE INDEX command merges all Tantivy segments into a single segment, which can improve query performance and reduce storage overhead. This is especially useful after bulk inserts that create many small segments.
-- Optimize a specific FTS index
OPTIMIZE INDEX fts_articles;
-- Optimize all FTS indexes in the database
OPTIMIZE INDEX;
When to use:
What it does:
Note: Optimization can take time on large indexes. For very large indexes with millions of documents, consider running this during off-peak hours.
The FTS implementation has some known limitations that may be addressed in future versions:
| Limitation | Description |
|---|---|
No snippet() function | Cannot return context snippets around matches (highlight is available) |
| No automatic segment merging | Uses NoMergePolicy - use OPTIMIZE INDEX for manual segment merging |
| No read-your-writes in transaction | FTS changes within a transaction aren't visible to queries until COMMIT |
| No MATCH operator syntax | Must use fts_match() function instead of WHERE table MATCH 'query' |
Note on transactions: ROLLBACK works correctly - FTS data is stored in BTrees that participate in the same WAL transaction as table data. When a transaction is rolled back, both table changes and FTS index changes are discarded together.
| Feature | SQLite FTS5 | tursodb FTS |
|---|---|---|
| MATCH operator | WHERE t MATCH 'query' | WHERE fts_match(col, 'query') |
| Ranking | bm25(t), rank column | fts_score(col, 'query') |
| Highlighting | highlight(t, ...) | fts_highlight(text, query, before, after) ✓ |
| Snippets | snippet(t, ...) | Not implemented |
| Boolean operators | AND, OR, NOT | AND, OR, NOT ✓ |
| Phrase search | "exact phrase" | "exact phrase" ✓ |
| Prefix search | word* | word* ✓ |
| Column filter | col:term | col:term ✓ |
| Tokenizers | unicode61, ascii, porter | default, raw, simple, whitespace, ngram ✓ |
| External content | contentless tables | Not supported |