Back to Ollama

Tokenizer

x/imagegen/tokenizer/README.md

0.23.12.9 KB
Original Source

Tokenizer

Tokenizer for LLM inference supporting BPE, SentencePiece, and WordPiece algorithms. The goal of this package is to see if a pure Go tokenizer can be fast and correct. It primarily supports the imagegen models however it (or parts of it) could be considered to replace Ollama's tokenizer in the model package.

Features

  • BPE (Byte Pair Encoding) - GPT-2/Llama style with byte-level encoding
  • SentencePiece - Gemma style with space handling
  • WordPiece - BERT style with ## continuation tokens
  • Parallel encoding - Automatic parallelization for inputs >4KB
  • HuggingFace compatible - Loads tokenizer.json directly

Usage

go
import "github.com/ollama/ollama/x/imagegen/tokenizer"

// Load from HuggingFace model directory
tok, err := tokenizer.Load("./weights/Llama-3.2-1B")
if err != nil {
    log.Fatal(err)
}

// Encode text to token IDs
ids := tok.Encode("Hello, world!", false) // false = don't add BOS

// Decode back to text
text := tok.Decode(ids)

// Check special tokens
if tok.IsEOS(ids[len(ids)-1]) {
    // End of sequence
}

Performance

Benchmarks on Apple M3 Max:

Input SizeEncodeDecodeTokens
1 KB14.5 MB/s267 MB/s231
10 KB10.9 MB/s321 MB/s2,301
100 KB8.9 MB/s311 MB/s23,001
1 MB9.6 MB/s321 MB/s230,001

Comparison with other implementations (10 MB input):

ImplementationEncode SpeedNotes
Engine (this)~10 MB/sstdlib RE2, parallel >4KB
tiktoken (Rust)~17 MB/sHighly optimized regex
Ollama (Go)~2-3 MB/sregexp2 backtracking

Performance Opportunities

Potential optimizations not yet implemented:

OptimizationExpected GainComplexity
Aho-Corasick for special tokens2-3x for many special tokensMedium
Custom regex engine (like tiktoken)1.5-2xHigh
SIMD byte scanning1.3-1.5x for pretokenizerMedium
Assembly BPE merge loop1.2-1.5xHigh
Memoization for repeated substringsVariableLow

Current bottleneck is the pretokenizer regex (~60% of encode time). tiktoken achieves ~17 MB/s with a hand-tuned Rust regex engine.

Not Yet Implemented

FeatureUsed ByNotes
Unigram tokenizerT5, ALBERT, mBARTDifferent algorithm (not BPE)
Unicode normalizersSome multilingual modelsNFD, NFKC, lowercase, etc.
Custom pretokenizersModel-specificBeyond standard patterns

Most HuggingFace models use BPE or SentencePiece, which are fully supported. WordPiece (BERT-style) is also supported with standard [UNK] fallback for out-of-vocabulary characters.

Files

FileDescription
tokenizer.goMain implementation (~1000 lines)
tokenizer_test.goTests and benchmarks
testdata/Mini tokenizer for unit tests