Back to Paradedb

Jieba

docs/documentation/tokenizers/available-tokenizers/jieba.mdx

0.23.3828 B
Original Source

The Jieba tokenizer is a tokenizer for Chinese text that leverages both a dictionary and statistical models. It is generally considered to be better at identifying ambiguous Chinese word boundaries compared to the Chinese Lindera and Chinese compatible tokenizers, but the tradeoff is that it is slower.

sql
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.jieba))
WITH (key_field='id');

To get a feel for this tokenizer, run the following command and replace the text with your own:

sql
SELECT 'Hello world! 你好!'::pdb.jieba::text[];
ini
              text
--------------------------------
 {hello," ",world,!," ",你好,!}
(1 row)