docs/documentation/tokenizers/search-tokenizer.mdx
By default, ParadeDB uses the same tokenizer at both index time and search time. This makes sense for most cases — you want queries tokenized the same way the data was indexed.
But sometimes you need different tokenizers. The classic example is autocomplete:
"shoes" → s, sh, sho, shoe, shoes"sho" → shoIf you used edge ngram at search time too, typing "sho" would produce s, sh, sho — matching far too many documents.
Set search_tokenizer as a WITH option on the index to define a default search-time tokenizer for all text and JSON fields:
CREATE INDEX search_idx ON products
USING bm25 (
id,
(title::pdb.ngram(1, 10, 'prefix_only=true'))
) WITH (key_field='id', search_tokenizer='unicode_words');
With this configuration:
title is tokenized with edge ngram to create prefix tokenstitle automatically use the unicode tokenizerThe search_tokenizer value can include parameters, e.g. search_tokenizer='simple(lowercase=false)'.
Because search_tokenizer only affects query-time behavior, you can change it without reindexing:
ALTER INDEX search_idx SET (search_tokenizer = 'simple(lowercase=false)');
CREATE TABLE products (
id serial8 NOT NULL PRIMARY KEY,
title text
);
INSERT INTO products (title) VALUES
('shoes'), ('shirt'), ('shorts'), ('shoelaces'), ('socks');
CREATE INDEX idx_products ON products USING bm25
(id, (title::pdb.ngram(1, 10, 'prefix_only=true')))
WITH (key_field = 'id', search_tokenizer = 'unicode_words');
-- "sho" stays as one token → matches shoes, shorts, shoelaces
SELECT id, title FROM products WHERE title ||| 'sho' ORDER BY id;
-- "s" stays as one token → matches all five titles
SELECT id, title FROM products WHERE title ||| 's' ORDER BY id;
Without search_tokenizer, the query 'sho' would be edge-ngrammed into s, sh, sho and match
every title starting with s — not just those starting with sho.
You can still override the search tokenizer for a specific query by casting the query string:
-- Force edge ngram tokenization at query time
SELECT id, title FROM products WHERE title ||| 'sho'::pdb.ngram(1, 10, 'prefix_only=true') ORDER BY id;
When resolving which tokenizer to use at search time, ParadeDB checks in this order:
'sho'::pdb.ngram(...) (highest priority)WITH (search_tokenizer='unicode_words')Any available tokenizer can be used as a search_tokenizer:
unicode_words, simple, whitespace, ngram, literal, literal_normalized, chinese_compatible,
lindera, icu, jieba, source_code.