Back to Paradedb

Unicode

docs/documentation/tokenizers/available-tokenizers/unicode.mdx

0.23.31.3 KB
Original Source

The unicode tokenizer splits text according to word boundaries defined by the Unicode Standard Annex #29 rules. All characters are lowercased by default.

This tokenizer is the default text tokenizer. If no tokenizer is specified for a text field, the unicode tokenizer will be used (unless the text field is the key field, in which case the text is not tokenized).

sql
-- The following two configurations are equivalent
CREATE INDEX search_idx ON mock_items
USING bm25 (id, description)
WITH (key_field='id');

CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.unicode_words))
WITH (key_field='id');

To get a feel for this tokenizer, run the following command and replace the text with your own:

sql
SELECT 'Tokenize me!'::pdb.unicode_words::text[];
ini
     text
---------------
 {tokenize,me}
(1 row)

Remove Emojis

By default, emojis in the source text are preserved. To remove emojis, set remove_emojis to true.

sql
SELECT 'Tokenize me! 😊'::pdb.unicode_words('remove_emojis=true')::text[];
ini
     text
---------------
 {tokenize,me}
(1 row)