Tokenization algorithms

Transformers support three subword tokenization algorithms: Byte pair encoding (BPE), Unigram, and WordPiece. They split text into units between words and characters, keeping the vocabulary compact while still capturing meaningful pieces. Common words stay intact as single tokens, and rare or unknown words decompose into subwords.

For instance, annoyingly might be split into ["annoying", "ly"] or ["annoy", "ing", "ly"] depending on the vocabulary. Subword splitting lets the model represent unseen words from known subwords.

[!TIP] Subword tokenization is especially useful for languages like Turkish, where you can form long, complex words by stringing subwords together.

Byte pair encoding (BPE)

Byte pair encoding (BPE) is the most popular tokenization algorithm in Transformers, used by models like Llama, Gemma, Qwen2, and more.

A pre-tokenizer splits text on whitespace or other rules, producing a set of unique words and their frequencies.

text

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

The BPE algorithm creates a base vocabulary, ["b", "g", "h", "n", "p", "s", "u"], from all the characters.

text

("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)

BPE starts with individual characters and iteratively merges the most frequent adjacent pair. "u" and "g" appear together the most in "hug", "pug", and "hugs", so BPE merges them into "ug" and adds it to the vocabulary.

text

("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)

The next most common pair is "u" and "n", which appear in "pun" and "bun", so they merge into "un".

text

("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)

The vocabulary is now ["b", "g", "h", "n", "p", "s", "u", "ug", "un"]. BPE continues learning merge rules until it reaches the target vocabulary size, which equals the base vocabulary size plus the number of merges. GPT uses BPE with a vocabulary size of 40,478 (478 base tokens + 40,000 merges).

Any character not in the base vocabulary maps to an unknown token like "<unk>". In practice, the base vocabulary covers all characters seen during training, so unknown tokens are rare.

Byte-level BPE

Including all Unicode characters would make the base vocabulary enormous. Byte-level BPE uses 256 byte values as the base vocabulary instead, ensuring every word can be tokenized without the "<unk>" token. GPT-2 uses byte-level BPE with a vocabulary size of 50,257 (256 byte tokens + 50,000 merges + special end-of-text token).

Unigram

Unigram is the second most popular tokenization algorithm in Transformers, used by models like T5, BigBird, Pegasus, and more.

Unigram starts with a large set of candidate subwords, and each candidate gets a probability score based on how often it appears.

text

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
["b", "g", "h", "n", "p", "s", "u", "hu", "ug", "un", "pu", "bu", "gs", "hug", "pug", "pun", "bun", "ugs", "hugs"]

Unigram scores how well the current vocabulary tokenizes the training data at each step.
For every token, Unigram measures how much removing the token would increase the overall loss. For example, removing "pu" barely affects the loss because "pug" and "pun" can still be tokenized as ["p", "ug"] and ["p", "un"].

But removing "ug" would significantly increase the loss because "hug", "pug", and "hugs" all rely on it.
Unigram removes the tokens with the lowest loss increase, usually the bottom 10-20%. Base characters always remain so any word can be tokenized. Tokens like "bu", "pu", "gs", "pug", and "bun" are removed because they contributed least to the overall likelihood.

text

["b", "g", "h", "n", "p", "s", "u", "hu", "ug", "un", "hug", "pun", "ugs", "hugs"]

Steps 2-4 repeat until the vocabulary reaches the target size.

During inference, Unigram can tokenize a word in several ways. "hugs" could become ["hug", "s"], ["h", "ug", "s"], or ["h", "u", "g", "s"]. Unigram picks the highest probability tokenization. Unlike BPE, which is deterministic and based on merge rules, Unigram is probabilistic and can sample different tokenizations during training.

SentencePiece

SentencePiece is a tokenization library that applies BPE or Unigram directly on raw text. Standard BPE and Unigram assume whitespace separates words, which doesn't work for languages like Chinese and Japanese that don't use spaces.

SentencePiece treats the input text as a raw byte or character stream and includes the space character, represented as "▁", in the vocabulary.

text

("▁hug", 10), ("▁pug", 5), ("▁pun", 12), ("▁bun", 4), ("▁hugs", 5)

SentencePiece then applies BPE or Unigram to the text.

At decoding, SentencePiece concatenates all tokens and replaces "▁" with a space.

WordPiece

WordPiece is the tokenization algorithm for BERT-family models like DistilBERT and Electra.

It's similar to BPE and iteratively merges pairs from the bottom up, but differs in how it selects pairs.

text

("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)

WordPiece merges pairs that maximize the likelihood of the training data.

text

score("u", "g") = frequency("ug") / (frequency("u") × frequency("g"))

pair	frequency	score
`"u"` + `"g"`	20	20 / (36 × 20) = 0.028
`"u"` + `"n"`	16	16 / (36 × 16) = 0.028
`"h"` + `"u"`	15	15 / (15 × 36) = 0.028
`"g"` + `"s"`	5	5 / (20 × 5) = 0.050

The score favors merging "g" and "s" where the combined token appears more often than expected from the individual token frequencies. BPE simply merges whichever pair appears the most. WordPiece measures how informative each merge is. Two tokens that appear together far more than chance predicts get merged first.

Word-level tokenization

Word-level tokenization splits text into tokens by space, punctuation, or language-specific rules.

text

["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]

Vocabulary size becomes extremely large because every unique word requires its own token, including all variations ("love", "loving", "loved", "lovingly"). The resulting embedding matrix is enormous, increasing memory and compute. Words not in the vocabulary map to an "<unk>" token, so the model can't handle new words.

Character-level tokenization

Character-level tokenization splits text into individual characters.

text

["D", "o", "n", "'", "t", "y", "o", "u", "l", "o", "v", "e"]

The vocabulary is small and every word can be represented, so there's no "<unk>" problem. But sequences become much longer. A character like "l" carries far less meaning than "love", so performance suffers.

Resources

Chapter 6 of the LLM course teaches you how to train a tokenizer from scratch and explains the differences between the BPE, Unigram, and WordPiece algorithms.