Search Tokenizer - Paradedb

By default, ParadeDB uses the same tokenizer at both index time and search time. This makes sense for most cases — you want queries tokenized the same way the data was indexed.

But sometimes you need different tokenizers. The classic example is autocomplete:

Index time — edge ngram: "shoes" → s, sh, sho, shoe, shoes
Search time — unicode: "sho" → sho

If you used edge ngram at search time too, typing "sho" would produce s, sh, sho — matching far too many documents.

Usage

Set search_tokenizer as a WITH option on the index to define a default search-time tokenizer for all text and JSON fields:

sql

CREATE INDEX search_idx ON products
USING bm25 (
  id,
  (title::pdb.ngram(1, 10, 'prefix_only=true'))
) WITH (key_field='id', search_tokenizer='unicode_words');

With this configuration:

Index time: title is tokenized with edge ngram to create prefix tokens
Search time: queries against title automatically use the unicode tokenizer

The search_tokenizer value can include parameters, e.g. search_tokenizer='simple(lowercase=false)'.

Because search_tokenizer only affects query-time behavior, you can change it without reindexing:

sql

ALTER INDEX search_idx SET (search_tokenizer = 'simple(lowercase=false)');

Example

sql

CREATE TABLE products (
    id serial8 NOT NULL PRIMARY KEY,
    title text
);
INSERT INTO products (title) VALUES
    ('shoes'), ('shirt'), ('shorts'), ('shoelaces'), ('socks');

CREATE INDEX idx_products ON products USING bm25
    (id, (title::pdb.ngram(1, 10, 'prefix_only=true')))
    WITH (key_field = 'id', search_tokenizer = 'unicode_words');

-- "sho" stays as one token → matches shoes, shorts, shoelaces
SELECT id, title FROM products WHERE title ||| 'sho' ORDER BY id;

-- "s" stays as one token → matches all five titles
SELECT id, title FROM products WHERE title ||| 's' ORDER BY id;

Without search_tokenizer, the query 'sho' would be edge-ngrammed into s, sh, sho and match every title starting with s — not just those starting with sho.

Overriding at Query Time

You can still override the search tokenizer for a specific query by casting the query string:

sql

-- Force edge ngram tokenization at query time
SELECT id, title FROM products WHERE title ||| 'sho'::pdb.ngram(1, 10, 'prefix_only=true') ORDER BY id;

Priority

When resolving which tokenizer to use at search time, ParadeDB checks in this order:

Query-level cast — e.g. 'sho'::pdb.ngram(...) (highest priority)
Index-level WITH option — e.g. WITH (search_tokenizer='unicode_words')
Index-time tokenizer — the tokenizer used to build the index (fallback)

Supported Tokenizers

Any available tokenizer can be used as a search_tokenizer: unicode_words, simple, whitespace, ngram, literal, literal_normalized, chinese_compatible, lindera, icu, jieba, source_code.