docs/source/en/tokenizer_summary.md
Transformers support three subword tokenization algorithms: Byte pair encoding (BPE), Unigram, and WordPiece. They split text into units between words and characters, keeping the vocabulary compact while still capturing meaningful pieces. Common words stay intact as single tokens, and rare or unknown words decompose into subwords.
For instance, annoyingly might be split into ["annoying", "ly"] or ["annoy", "ing", "ly"] depending on the vocabulary. Subword splitting lets the model represent unseen words from known subwords.
[!TIP] Subword tokenization is especially useful for languages like Turkish, where you can form long, complex words by stringing subwords together.
Byte pair encoding (BPE) is the most popular tokenization algorithm in Transformers, used by models like Llama, Gemma, Qwen2, and more.
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
["b", "g", "h", "n", "p", "s", "u"], from all the characters.("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
"u" and "g" appear together the most in "hug", "pug", and "hugs", so BPE merges them into "ug" and adds it to the vocabulary.("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
"u" and "n", which appear in "pun" and "bun", so they merge into "un".("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)
["b", "g", "h", "n", "p", "s", "u", "ug", "un"]. BPE continues learning merge rules until it reaches the target vocabulary size, which equals the base vocabulary size plus the number of merges. GPT uses BPE with a vocabulary size of 40,478 (478 base tokens + 40,000 merges).Any character not in the base vocabulary maps to an unknown token like "<unk>". In practice, the base vocabulary covers all characters seen during training, so unknown tokens are rare.
Including all Unicode characters would make the base vocabulary enormous. Byte-level BPE uses 256 byte values as the base vocabulary instead, ensuring every word can be tokenized without the "<unk>" token. GPT-2 uses byte-level BPE with a vocabulary size of 50,257 (256 byte tokens + 50,000 merges + special end-of-text token).
Unigram is the second most popular tokenization algorithm in Transformers, used by models like T5, BigBird, Pegasus, and more.
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
["b", "g", "h", "n", "p", "s", "u", "hu", "ug", "un", "pu", "bu", "gs", "hug", "pug", "pun", "bun", "ugs", "hugs"]
Unigram scores how well the current vocabulary tokenizes the training data at each step.
For every token, Unigram measures how much removing the token would increase the overall loss. For example, removing "pu" barely affects the loss because "pug" and "pun" can still be tokenized as ["p", "ug"] and ["p", "un"].
But removing "ug" would significantly increase the loss because "hug", "pug", and "hugs" all rely on it.
Unigram removes the tokens with the lowest loss increase, usually the bottom 10-20%. Base characters always remain so any word can be tokenized. Tokens like "bu", "pu", "gs", "pug", and "bun" are removed because they contributed least to the overall likelihood.
["b", "g", "h", "n", "p", "s", "u", "hu", "ug", "un", "hug", "pun", "ugs", "hugs"]
During inference, Unigram can tokenize a word in several ways. "hugs" could become ["hug", "s"], ["h", "ug", "s"], or ["h", "u", "g", "s"]. Unigram picks the highest probability tokenization. Unlike BPE, which is deterministic and based on merge rules, Unigram is probabilistic and can sample different tokenizations during training.
SentencePiece is a tokenization library that applies BPE or Unigram directly on raw text. Standard BPE and Unigram assume whitespace separates words, which doesn't work for languages like Chinese and Japanese that don't use spaces.
"▁", in the vocabulary.("▁hug", 10), ("▁pug", 5), ("▁pun", 12), ("▁bun", 4), ("▁hugs", 5)
At decoding, SentencePiece concatenates all tokens and replaces "▁" with a space.
WordPiece is the tokenization algorithm for BERT-family models like DistilBERT and Electra.
It's similar to BPE and iteratively merges pairs from the bottom up, but differs in how it selects pairs.
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
WordPiece merges pairs that maximize the likelihood of the training data.
score("u", "g") = frequency("ug") / (frequency("u") × frequency("g"))
| pair | frequency | score |
|---|---|---|
"u" + "g" | 20 | 20 / (36 × 20) = 0.028 |
"u" + "n" | 16 | 16 / (36 × 16) = 0.028 |
"h" + "u" | 15 | 15 / (15 × 36) = 0.028 |
"g" + "s" | 5 | 5 / (20 × 5) = 0.050 |
The score favors merging "g" and "s" where the combined token appears more often than expected from the individual token frequencies. BPE simply merges whichever pair appears the most. WordPiece measures how informative each merge is. Two tokens that appear together far more than chance predicts get merged first.
Word-level tokenization splits text into tokens by space, punctuation, or language-specific rules.
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
Vocabulary size becomes extremely large because every unique word requires its own token, including all variations ("love", "loving", "loved", "lovingly"). The resulting embedding matrix is enormous, increasing memory and compute. Words not in the vocabulary map to an "<unk>" token, so the model can't handle new words.
Character-level tokenization splits text into individual characters.
["D", "o", "n", "'", "t", "y", "o", "u", "l", "o", "v", "e"]
The vocabulary is small and every word can be represented, so there's no "<unk>" problem. But sequences become much longer. A character like "l" carries far less meaning than "love", so performance suffers.