docs/user-guide/features/tokenizers.md
Megatron Core provides a unified tokenizer system with a HuggingFace-style API for easy tokenizer management and configuration.
The MegatronTokenizer class offers a simple, familiar API for loading and managing tokenizers:
.from_pretrained() interfaceUse the same API regardless of tokenizer backend (SentencePiece, HuggingFace, TikToken, etc.):
from megatron.core.tokenizers import MegatronTokenizer
tokenizer = MegatronTokenizer.from_pretrained("/path/to/tokenizer")
Configuration is stored in a JSON metadata file containing:
Benefits:
The correct tokenizer implementation is automatically selected:
SentencePieceTokenizer, HuggingFaceTokenizer, etc.Save tokenizer configuration for reuse:
from megatron.core.tokenizers import MegatronTokenizer
# Create metadata for a SentencePiece tokenizer
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
chat_template="{% for message in messages %}{{ message.content }}{% endfor %}",
)
The metadata is saved as tokenizer_metadata.json in the tokenizer directory.
Load from a directory with metadata:
from megatron.core.tokenizers import MegatronTokenizer
# Load with auto-detected configuration
tokenizer = MegatronTokenizer.from_pretrained("/path/to/tokenizer.model")
If metadata is stored separately:
tokenizer = MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer.model",
metadata_path="/path/to/custom/metadata.json",
)
Pass metadata as a dictionary:
tokenizer = MegatronTokenizer.from_pretrained(
tokenizer_path="GPT2BPETokenizer",
metadata_path={"library": "megatron"},
vocab_file="/path/to/vocab.txt",
)
Create model-specific tokenization logic:
from megatron.core.tokenizers.text import MegatronTokenizerText
class CustomTokenizer(MegatronTokenizerText):
def encode(self, text):
# Custom encoding logic
return super().encode(text)
def decode(self, tokens):
# Custom decoding logic
return super().decode(tokens)
# Save metadata with custom class
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/tokenizer.model",
tokenizer_library="sentencepiece",
tokenizer_class=CustomTokenizer,
)
Configure TikToken-based tokenizers:
tokenizer = MegatronTokenizer.from_pretrained(
tokenizer_path="/path/to/tokenizer/model.json",
metadata_path={"library": "tiktoken"},
pattern="v2",
num_special_tokens=1000,
)
Use a null tokenizer for testing or non-text models:
tokenizer = MegatronTokenizer.from_pretrained(
metadata_path={"library": "null-text"},
vocab_size=131072,
)
The tokenizer system integrates seamlessly with Megatron-LM training:
# Null tokenizer for testing
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tokenizer-type NullTokenizer \
--vocab-size 131072 \
...
# HuggingFace tokenizer with metadata
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model meta-llama/Meta-Llama-3-8B \
--tokenizer-metadata /path/to/metadata.json \
...
If --tokenizer-metadata is not specified, a default metadata file is generated automatically based on the tokenizer type.
| Library | Description | Use Case |
|---|---|---|
| HuggingFace | Transformers tokenizers | Most modern LLMs (LLaMA, Mistral, etc.) |
| SentencePiece | Google's tokenizer | GPT-style models, custom vocabularies |
| TikToken | OpenAI's tokenizer | GPT-3.5/GPT-4 style tokenization |
| Megatron | Built-in tokenizers | Legacy GPT-2 BPE |
| Null | No-op tokenizer | Testing, non-text modalities |
MegatronTokenizer.write_metadata(
tokenizer_path="/path/to/llama/tokenizer.model",
tokenizer_library="sentencepiece",
)
MegatronTokenizer.write_metadata(
tokenizer_path="GPT2BPETokenizer",
tokenizer_library="megatron",
vocab_file="/path/to/gpt2-vocab.json",
merge_file="/path/to/gpt2-merges.txt",
)
tokenizer_metadata.json in your experiment configs