scientific-skills/transformers/references/tokenizers.md
Tokenizers convert text into numerical representations (tokens) that models can process. They handle special tokens, padding, truncation, and attention masks.
Automatically load the correct tokenizer for a model:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Load from local path:
tokenizer = AutoTokenizer.from_pretrained("./local/tokenizer/path")
# Simple encoding
text = "Hello, how are you?"
tokens = tokenizer.encode(text)
print(tokens) # [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
# With text tokenization
tokens = tokenizer.tokenize(text)
print(tokens) # ['hello', ',', 'how', 'are', 'you', '?']
token_ids = [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
text = tokenizer.decode(token_ids)
print(text) # "hello, how are you?"
# Skip special tokens
text = tokenizer.decode(token_ids, skip_special_tokens=True)
print(text) # "hello, how are you?"
__call__ MethodPrimary tokenization interface:
# Single text
inputs = tokenizer("Hello, how are you?")
# Returns dictionary with input_ids, attention_mask
print(inputs)
# {
# 'input_ids': [101, 7592, 1010, 2129, 2024, 2017, 1029, 102],
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]
# }
Multiple texts:
texts = ["Hello", "How are you?"]
inputs = tokenizer(texts, padding=True, truncation=True)
return_tensors: Output format ("pt", "tf", "np")
# PyTorch tensors
inputs = tokenizer("text", return_tensors="pt")
# TensorFlow tensors
inputs = tokenizer("text", return_tensors="tf")
# NumPy arrays
inputs = tokenizer("text", return_tensors="np")
padding: Pad sequences to same length
# Pad to longest sequence in batch
inputs = tokenizer(texts, padding=True)
# Pad to specific length
inputs = tokenizer(texts, padding="max_length", max_length=128)
# No padding
inputs = tokenizer(texts, padding=False)
pad_to_multiple_of: Pad to multiple of specified value
inputs = tokenizer(texts, padding=True, pad_to_multiple_of=8)
truncation: Limit sequence length
# Truncate to max_length
inputs = tokenizer(text, truncation=True, max_length=512)
# Truncate first sequence in pairs
inputs = tokenizer(text1, text2, truncation="only_first")
# Truncate second sequence
inputs = tokenizer(text1, text2, truncation="only_second")
# Truncate longest first (default for pairs)
inputs = tokenizer(text1, text2, truncation="longest_first", max_length=512)
max_length: Maximum sequence length
inputs = tokenizer(text, max_length=512, truncation=True)
return_attention_mask: Include attention mask (default True)
inputs = tokenizer(text, return_attention_mask=True)
return_token_type_ids: Segment IDs for sentence pairs
inputs = tokenizer(text1, text2, return_token_type_ids=True)
return_offsets_mapping: Character position mapping (Fast tokenizers only)
inputs = tokenizer(text, return_offsets_mapping=True)
return_length: Include sequence lengths
inputs = tokenizer(texts, padding=True, return_length=True)
Access special tokens:
print(tokenizer.cls_token) # [CLS] or <s>
print(tokenizer.sep_token) # [SEP] or </s>
print(tokenizer.pad_token) # [PAD]
print(tokenizer.unk_token) # [UNK]
print(tokenizer.mask_token) # [MASK]
print(tokenizer.eos_token) # End of sequence
print(tokenizer.bos_token) # Beginning of sequence
# Get IDs
print(tokenizer.cls_token_id)
print(tokenizer.sep_token_id)
Manual control:
# Automatically add special tokens (default True)
inputs = tokenizer(text, add_special_tokens=True)
# Skip special tokens
inputs = tokenizer(text, add_special_tokens=False)
special_tokens_dict = {
"additional_special_tokens": ["<CUSTOM>", "<SPECIAL>"]
}
num_added = tokenizer.add_special_tokens(special_tokens_dict)
print(f"Added {num_added} tokens")
# Resize model embeddings after adding tokens
model.resize_token_embeddings(len(tokenizer))
Tokenize text pairs:
text1 = "What is the capital of France?"
text2 = "Paris is the capital of France."
# Automatically handles separation
inputs = tokenizer(text1, text2, padding=True, truncation=True)
# Results in: [CLS] text1 [SEP] text2 [SEP]
Process multiple texts:
texts = ["First text", "Second text", "Third text"]
# Basic batch encoding
batch = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Access individual encodings
for i in range(len(texts)):
input_ids = batch["input_ids"][i]
attention_mask = batch["attention_mask"][i]
Use Rust-based tokenizers for speed:
from transformers import AutoTokenizer
# Automatically loads Fast version if available
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Check if Fast
print(tokenizer.is_fast) # True
# Force Fast tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
# Force slow (Python) tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
Offset mapping (character positions):
inputs = tokenizer("Hello world", return_offsets_mapping=True)
print(inputs["offset_mapping"])
# [(0, 0), (0, 5), (6, 11), (0, 0)] # [CLS], "Hello", "world", [SEP]
Token to word mapping:
encoding = tokenizer("Hello world")
word_ids = encoding.word_ids()
print(word_ids) # [None, 0, 1, None] # [CLS]=None, "Hello"=0, "world"=1, [SEP]=None
Save locally:
tokenizer.save_pretrained("./my_tokenizer")
Push to Hub:
tokenizer.push_to_hub("username/my-tokenizer")
Access vocabulary:
vocab = tokenizer.get_vocab()
vocab_size = len(vocab)
# Get token for ID
token = tokenizer.convert_ids_to_tokens(100)
# Get ID for token
token_id = tokenizer.convert_tokens_to_ids("hello")
Get detailed encoding information:
encoding = tokenizer("Hello world", return_tensors="pt")
# Original methods still available
tokens = encoding.tokens()
word_ids = encoding.word_ids()
sequence_ids = encoding.sequence_ids()
Subclass for custom behavior:
class CustomTokenizer(AutoTokenizer):
def __call__(self, text, **kwargs):
# Custom preprocessing
text = text.lower().strip()
return super().__call__(text, **kwargs)
For conversational models:
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"}
]
# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False)
print(text)
# Tokenize directly
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt")
texts = ["I love this!", "I hate this!"]
labels = [1, 0]
inputs = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
)
# Use with model
outputs = model(**inputs, labels=torch.tensor(labels))
question = "What is the capital?"
context = "Paris is the capital of France."
inputs = tokenizer(
question,
context,
padding=True,
truncation=True,
max_length=384,
return_tensors="pt"
)
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id
)
# Decode
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512
)
# Apply to dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
Padding token not set:
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
Sequence too long:
# Enable truncation
inputs = tokenizer(text, truncation=True, max_length=512)
Mismatched vocabulary:
# Always load tokenizer and model from same checkpoint
tokenizer = AutoTokenizer.from_pretrained("model-id")
model = AutoModel.from_pretrained("model-id")
Attention mask issues:
# Ensure attention_mask is passed
outputs = model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"]
)