docs/token-search.md
Token search splits multi-word queries into individual terms, fuzzy-matches each term independently using the Bitap algorithm, and ranks results using BM25-style IDF weighting.
The default fuzzy search treats the entire query as a single pattern. That works well for short, single-word lookups like "javscript" → "JavaScript". But for multi-word queries like "javascript design patterns", a single Bitap search hits the 32-character limit and can't match each word independently.
Token search is designed for these cases:
"react state management"If your queries are typically one word or a short phrase, the default fuzzy search is simpler and faster.
const fuse = new Fuse(docs, {
useTokenSearch: true,
keys: ['title', 'author', 'description']
})
fuse.search('javascrpt paterns')
// → [{ item: { title: 'JavaScript Patterns', ... }, score: 0.12 }]
All standard options work as before: includeScore, includeMatches, keys with weights, threshold, limit, shouldSort, etc.
Tokenization — The query is split into individual words using a unicode-aware regex (/[\p{L}\p{M}\p{N}_]+/gu) by default. Each word becomes a separate fuzzy search. The default handles CJK, Cyrillic, Greek, Arabic, Hebrew, Devanagari, etc. out of the box. You can override it with the tokenize option.
Per-term fuzzy matching — Each term is matched against each field using the Bitap algorithm with ignoreLocation: true, so terms can appear anywhere in the field. This means multi-word queries are no longer limited by the 32-character Bitap pattern cap.
IDF weighting — An inverted index is built at construction time. The IDF weight for each term uses the BM25 formula:
idf = log(1 + (fieldCount - docFreq + 0.5) / (docFreq + 0.5))
Rare terms (appearing in fewer documents) are weighted higher than common terms. A match on a distinctive word contributes more to the score than a match on a ubiquitous one.
Score combination — Per-term scores are combined additively with IDF weights, then normalized to Fuse's 0–1 range (0 = perfect match).
"patterns javascript" and "javascript patterns" produce identical results.tokenMatchBy default token search combines query words with OR (tokenMatch: 'any'): a record is returned if it matches any one word. That's right for ranked search, where you want the best matches first and partial matches still surface.
For filtering, where adding a word should narrow the list, set tokenMatch: 'all' (AND). A record is then returned only when every query word matches somewhere in it.
const list = ['red shirt', 'red hat', 'blue shirt']
new Fuse(list, { useTokenSearch: true })
.search('red shirt')
.map((r) => r.item)
// 'any' (default): ['red shirt', 'red hat', 'blue shirt'] ← matches either word
new Fuse(list, { useTokenSearch: true, tokenMatch: 'all' })
.search('red shirt')
.map((r) => r.item)
// 'all': ['red shirt'] ← matches both words
'all' is evaluated per record, across all fields and array elements — not per field. So each word only has to appear somewhere in the record:
const products = [
{ title: 'Red', description: 'cotton shirt' }, // "red" + "shirt" in different fields
{ title: 'Red dress', description: 'silk' }
]
new Fuse(products, {
useTokenSearch: true,
tokenMatch: 'all',
keys: ['title', 'description']
})
.search('red shirt')
.map((r) => r.item)
// → [{ title: 'Red', description: 'cotton shirt' }] (the second has no "shirt" anywhere)
Notes:
'all' changes only which records are returned, not how survivors are ranked — the IDF scoring above is unchanged.tokenMatch only affects token search. It has no effect unless useTokenSearch is true, and it's separate from the logical $and / $or operators, which combine keyed clauses rather than the words of one query.The default tokenizer (/[\p{L}\p{M}\p{N}_]+/gu) treats any unicode letter, mark, or number as part of a word. That works well for most natural-language text, but two cases need an override:
node.js, c++, U.S.A, file paths, or hashtags. Pass a custom regex that includes the punctuation you want to keep.Intl.Segmenter to split into actual words.const fuse = new Fuse(docs, {
useTokenSearch: true,
keys: ['text'],
// keep dots, plusses, and dashes inside tokens
tokenize: /[\w.+-]+/g
})
fuse.search('node.js') // matches docs containing the literal "node.js"
The regex must have the global (g) flag, otherwise only the first token of each text is returned. In development builds, a missing g flag emits a one-time console.warn.
For CJK and other scripts that don't separate words with whitespace, the function form lets you plug in Intl.Segmenter — a built-in API in modern browsers and Node that does locale-aware word segmentation. Filter by isWordLike so punctuation and whitespace segments are dropped:
const segmenter = new Intl.Segmenter('zh', { granularity: 'word' })
const fuse = new Fuse(docs, {
useTokenSearch: true,
keys: ['text'],
tokenize: (text) =>
Array.from(segmenter.segment(text), (s) => s.isWordLike ? s.segment : null)
.filter(Boolean)
})
The function receives the field/query text after case-folding and diacritic-stripping (per isCaseSensitive / ignoreDiacritics) and must return string[]. It must be deterministic — non-deterministic tokenizers silently break document-frequency accounting. Function tokenizers are not supported by FuseWorker because they cannot be transferred to a Web Worker via postMessage.
threshold to control fuzziness. The default 0.6 is permissive — for tighter matching, try 0.3 or 0.4.limit when you only need the top N results. This also improves performance via heap-based selection.The inverted index is maintained as you modify the collection:
const fuse = new Fuse(docs, { useTokenSearch: true, keys: ['title'] })
// Adding a document updates the inverted index
fuse.add({ title: 'New Book' })
// Removing documents also updates the index
fuse.remove((doc) => doc.title === 'Old Book')
Benchmarked on randomly generated documents with 2 keys (title + body):
| Metric | 100 docs | 1,000 docs | 5,000 docs |
|---|---|---|---|
| Index creation overhead | 2.5x | 5.2x | 5.5x |
| Single-term search | 1.8x | 1.8x | 1.7x |
| Multi-term search | 1.3x | 1.3x | 1.2x |
Index creation is a one-time cost (46ms for 5,000 docs). Search overhead is 1.2–1.8x depending on query complexity, primarily because each query term runs its own Bitap search. The inverted index lookup itself is O(1) per term.
Token search is available in the full build only (fuse.js / fuse.mjs). It is not included in the basic build to keep bundle size small. Using useTokenSearch: true with the basic build throws an error.
To use token search, use the full build.