third_party/sentencepiece/src/doc/options.md
This document describes the training options available in SentencePiece. These options can be passed as command-line flags to spm_train or as keyword arguments to the Python API sentencepiece.SentencePieceTrainer.train().
Here is a minimal example to train a model using a raw text corpus (input.txt):
spm_train --input=input.txt --model_prefix=m --vocab_size=8000
import sentencepiece as spm
spm.SentencePieceTrainer.train(input='input.txt', model_prefix='m', vocab_size=8000)
Both methods will output m.model and m.vocab files.
These options control the input/output files, the tokenization algorithm type, and general runtime settings.
input (string, default: "")
--input=data/corpus1.txt,data/corpus2.txtinput_format (string, default: "text")
text: Raw text files (one sentence per line).tsv: Tab-separated values. The input file must contain exactly two columns: word \t frequency (where frequency is a positive integer).
unigram model, the frequency information is not used during the initial seed vocabulary extraction stage (which uses a Suffix Array on the raw word list, treating each unique word as appearing once). This can prevent short, high-frequency words (like am or on) from being extracted as candidate pieces, leading to them being tokenized as individual characters (e.g., a, m).model_type=bpe (which correctly utilizes frequencies during training), repeat high-frequency words multiple times in the TSV, or add them to user_defined_symbols.model_prefix (string, default: "")
<model_prefix>.model (binary model file) and <model_prefix>.vocab (human-readable vocabulary list).model_type (string, default: "unigram")
unigram: Unigram language model (recommended). It fits a probabilistic model and prunes the vocabulary.bpe: Byte-Pair Encoding. It starts with characters and merges frequent pairs.word: Word segmentation. Splits by space (only useful for languages that use spaces, essentially acting as a word-frequency tokenizer).char: Character-level segmentation.vocab_size (int32, default: 8000)
accept_language (string, default: "")
ja,en).num_threads (int32, default: 16)
random_seed (uint32, default: 4294967295 (max uint32))
minloglevel (int, default: 0)
These options control how SentencePiece processes and samples the training corpus.
character_coverage (double, default: 0.9995)
<unk> (or byte fallback).1.0 for languages with small alphabets (English, German, etc.). Use 0.9995 (default) for languages with large character sets (Chinese, Japanese, Korean) to prune rare noise characters/emojis.input_sentence_size (uint64, default: 0)
0, the entire corpus is loaded. Setting this is highly recommended for very large datasets to prevent running out of memory.shuffle_input_sentence (bool, default: true)
true, randomly samples input_sentence_size sentences from the corpus. Only effective when input_sentence_size > 0.hard_vocab_limit (bool, default: true)
true, training will fail with an error if the corpus does not contain enough unique subwords to reach the requested vocab_size. If false, training will succeed and automatically shrink the vocabulary size in the output model to the maximum possible size.train_extremely_large_corpus (bool, default: false)
These options tune the core vocabulary learning process (primarily for Unigram and BPE).
seed_sentencepiece_size (int32, default: 1000000)
seed_sentencepieces_file (string, default: "")
piece \t frequency) to seed the vocabulary, instead of extracting them from the corpus.shrinking_factor (double, default: 0.75)
vocab_size * shrinking_factor until it reaches vocab_size.num_sub_iterations (int32, default: 2)
max_sentence_length (int32, default: 4192)
use_all_vocab (bool, default: false)
true, forces the model to include all words/characters found in the corpus into the vocabulary, ignoring frequency.vocabulary_output_piece_score (bool, default: true)
true, outputs the log-likelihood (score) of each piece in the generated <model_prefix>.vocab file (as piece \t score). If false, the vocab file contains only the pieces (one per line).These options apply training-time constraints to control what constitutes a valid vocabulary piece.
(See Vocabulary Piece Constraints for a detailed explanation of these constraints.)
max_sentencepiece_length (int32, default: 16)
split_by_unicode_script (bool, default: true)
split_by_number (bool, default: true)
abc123.split_digits (bool, default: false)
123 is guaranteed to be tokenized as 1, 2, 3). Recommended for math or financial applications.split_by_whitespace (bool, default: true)
pretokenization_delimiter (string, default: "")
treat_whitespace_as_suffix (bool, default: false)
▁ as a suffix instead of a prefix (e.g., world▁ instead of ▁world).allow_whitespace_only_pieces (bool, default: false)
These options define special symbols (BOS, EOS, PAD, custom control tokens) and how they map to vocabulary IDs.
(See Special Symbols for details on control vs. user-defined symbols and security implications.)
control_symbols (string, default: "")
<user>). Control symbols are never tokenized from raw text and decode to empty strings.control_symbols_file (string, default: "")
user_defined_symbols (string, default: "")
user_defined_symbols_file (string, default: "")
required_chars (string, default: "")
<unk>), regardless of the --character_coverage setting.required_chars_file (string, default: "")
byte_fallback (bool, default: false)
true, decomposes out-of-vocabulary characters into UTF-8 byte tokens (e.g., <0xE3>), completely avoiding <unk> tokens. Highly recommended for modern LLMs.unk_id (int32, default: 0), bos_id (int32, default: 1), eos_id (int32, default: 2), pad_id (int32, default: -1)
-1 to disable the symbol (except for unk_id which cannot be disabled).unk_piece (string, default: "<unk>"), bos_piece (string, default: "<s>"), eos_piece (string, default: "</s>"), pad_piece (string, default: "<pad>")
unk_surface (string, default: " ⁇ " (U+2047 double question mark))
<unk>. During decoding, the <unk> token ID is decoded back to this string.These options control how text is normalized and how spaces are processed before vocabulary learning and tokenization.
(See Text Normalization for details on Unicode normalization and custom rules.)
normalization_rule_name (string, default: "nmt_nfkc")
nmt_nfkc (default), nfkc, nmt_nfkc_cf, nfkc_cf, or identity (no normalization)).normalization_rule_tsv (string, default: "")
denormalization_rule_tsv (string, default: "")
add_dummy_prefix (bool, default: true)
▁ to the beginning of the sentence to ensure start-of-sentence words are tokenized identically to middle-of-sentence words.remove_extra_whitespaces (bool, default: true)
escape_whitespaces (bool, default: true)
▁ to preserve whitespace information in the token sequence.NormalizerSpec property and can only be modified via Python/C++ API configuration maps, not via spm_train CLI flags (where it defaults to true for normalizer and false for denormalizer).