bindings/python/README.md
</a>
<a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">
</a>
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.
Otherwise, let's dive in!
pip install tokenizers
To use this method, you need to have the Rust installed:
# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"
Once Rust is installed, you can compile doing the following
git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python
# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate
# Install `tokenizers` in the current virtual env
pip install -e .
tokenizers ships dedicated wheels for the free-threaded build of CPython
(python3.14t). These wheels declare Py_MOD_GIL_NOT_USED, so importing
tokenizers does not force the GIL back on — multi-threaded code stays
GIL-free.
The full mutable API works on 3.14t — the same as on regular CPython.
Setters are thread-safe: the inner tokenizer state is wrapped in a
std::sync::RwLock, so concurrent tokenizer.X = … from multiple threads
serialize correctly and concurrent encode operations take a read guard
that blocks writers only briefly.
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import ByteLevel
tok = Tokenizer(BPE())
tok.pre_tokenizer = Whitespace() # ✅ thread-safe on 3.14t
tok.post_processor = ByteLevel(trim_offsets=True)
Caveat — compound mutations are not atomic. Statements like
tokenizer.post_processor.special_tokens = X evaluate in two steps from
Python's point of view (read attribute → set attribute on the result). If
another thread swaps tokenizer.post_processor between those steps, the
mutation lands on an orphaned component. This is the same class of race
as dict[k] = v interleaved with dict.clear() — coordinate with a Python
lock if you need the compound to be atomic.
For the full thread-safety analysis, see
docs/free-threading-audit.md.
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-cased")
We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
these using some vocab.json and merges.txt files:
from tokenizers import CharBPETokenizer
# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)
# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)
And you can train them just as simply:
from tokenizers import CharBPETokenizer
# Initialize a tokenizer
tokenizer = CharBPETokenizer()
# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")
# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")
CharBPETokenizer: The original BPEByteLevelBPETokenizer: The byte level version of the BPESentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePieceBertWordPieceTokenizer: The famous Bert tokenizer, using WordPieceAll of these can be used and trained as explained above!
Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.
Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())
# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
# And then train
trainer = trainers.BpeTrainer(
vocab_size=20000,
min_frequency=2,
initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
"./path/to/dataset/1.txt",
"./path/to/dataset/2.txt",
"./path/to/dataset/3.txt"
], trainer=trainer)
# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)
Now, when you want to use this tokenizer, this is as simple as:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
encoded = tokenizer.encode("I can feel the magic, can you?")
The compiled PyO3 extension does not expose type annotations, so editors and type checkers would otherwise see most objects as Any. To provide full typing support, we use a two-step stub generation process:
tools/stub-gen/): Uses pyo3-introspection to analyze the compiled extension and generate .pyi stub filesstub.py): Adds docstrings from the runtime module and generates forwarding __init__.py shimsThe easiest way to regenerate stubs is via make style:
cd bindings/python
make style
This will:
maturin develop --release.pyi filesstub.pyruffTo run the stub generator directly:
cd bindings/python
cargo run --manifest-path tools/stub-gen/Cargo.toml
python stub.py
The stub generator automatically:
.so to the project root for introspectionPYTHONHOME for embedded Python (handles uv/venv environments)py_src/tokenizers/If you encounter Python initialization errors, you can manually set PYTHONHOME:
export PYTHONHOME=$(python3 -c 'import sys; print(sys.base_prefix)')
cargo run --manifest-path tools/stub-gen/Cargo.toml