third_party/sentencepiece/src/doc/piece_constraints.md
SentencePiece is deliberately designed to operate directly on raw, un-pretokenized text (without requiring steps like space-splitting or regex-based rules). Instead of segmenting the input before tokenization, SentencePiece applies a set of constraints during training to determine which subwords (pieces) are valid candidates for the vocabulary.
Many subword tokenizers require a pre-tokenization step (e.g., space-splitting or regex rules) that has several issues:
To avoid these issues, SentencePiece operates directly on raw text and applies piece constraints during training to construct a static vocabulary. Because these constraints do not run during inference, the resulting model is completely self-contained, guaranteeing consistent, safe, and fast tokenization across all platforms. For cases where space-separated tokenization is desired, simple constraints like split_by_whitespace=true can safely approximate this behavior.
The following table describes the flags that control whether a candidate subword is considered a valid sentencepiece.
| Flag Name | Default Value | Description | Examples |
|---|---|---|---|
max_sentencepiece_length | 16 | Maximum length (in Unicode characters) of a vocabulary piece. | If 16, understanding (13 chars) is valid, counterunderstanding (20 chars) is invalid. |
split_by_unicode_script | true | Prevents a single piece from crossing Unicode script boundaries (e.g., mixing Latin and Han scripts). | If true, hello世界 is invalid (must be split into hello and 世界). If false, hello世界 can be a single piece. |
Note: Hiragana and Katakana are internally merged with Han (Kanji) script, allowing Japanese mixed-script words (e.g., おいしい屋) to remain in a single piece. |
| split_by_number | true | Treats numbers as a separate script. When split_by_unicode_script is true, this prevents numbers from mixing with alphabetical letters in a single piece. | If true, temp20a is invalid. If false, temp20a can be a single piece. |
| split_digits | false | Forces all digits (0-9) to be split into individual pieces of length 1. | If true, 1999 must be split into 1, 9, 9, 9. 19 is an invalid piece. |
| split_by_whitespace | true | Prevents pieces from crossing whitespace boundaries. Whitespace (represented by the meta-symbol ▁) can only appear at the boundary (prefix or suffix). | If true, foo▁bar is invalid. If false, foo▁bar (representing "foo bar") can be a single piece. |
| treat_whitespace_as_suffix | false | Controls the position of the whitespace meta-symbol. If false, whitespace must appear as a prefix. If true, whitespace must appear as a suffix. | If false (prefix): ▁hello is valid, hello▁ is invalid.
If true (suffix): hello▁ is valid, ▁hello is invalid. |
| allow_whitespace_only_pieces | false | Allows pieces that consist entirely of whitespace characters. | If false, ▁▁ is invalid (though a single ▁ is allowed). If true, ▁▁ is a valid piece. |
| pretokenization_delimiter | (empty string) | Defines a pre-tokenization delimiter. When specified, pieces crossing this delimiter cannot be included in the vocabulary. The delimiter itself is removed from the text during training, but it acts as a hard boundary. (Unigram model only) | (See detailed section below) |
The pretokenization_delimiter flag (available only in unigram mode) is the most general way to introduce arbitrary segmentation boundaries or control constraints into a SentencePiece model. It allows you to enforce hard boundaries based on external pre-tokenization (e.g., word segmenters like MeCab, syntax parsers, or custom rules) without requiring those tools at inference time.
When you specify a delimiter (e.g., pretokenization_delimiter="||||"):
Sentence||||Piece||||is||||cool).A common use case is training a model that respects morphological (word) boundaries for languages like Japanese or Chinese, without needing a morphological analyzer during inference.
||||:
形態素||||の||||一般||||的||||な||||性質--pretokenization_delimiter="||||".形態素の一般的な性質) to the model. The tokenizer will naturally split at the learned boundaries, mimicking the morphological analyzer's behavior.For a detailed walkthrough of this approach, see the article: Making SentencePiece Segmentation MeCab-like (in Japanese).
Here is how to specify these piece constraint flags using the C++ CLI or Python API.
You can pass the flags directly to spm_train:
spm_train \
--input=corpus.txt \
--model_prefix=my_model \
--vocab_size=8000 \
--split_by_unicode_script=true \
--split_digits=true
Specify the flags as keyword arguments in SentencePieceTrainer.train():
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='my_model',
vocab_size=8000,
split_by_unicode_script=True,
split_digits=True
)