Tokenizer

Tokenizer for LLM inference supporting BPE, SentencePiece, and WordPiece algorithms. The goal of this package is to see if a pure Go tokenizer can be fast and correct. It primarily supports the imagegen models however it (or parts of it) could be considered to replace Ollama's tokenizer in the model package.

Features

BPE (Byte Pair Encoding) - GPT-2/Llama style with byte-level encoding
SentencePiece - Gemma style with ▁ space handling
WordPiece - BERT style with ## continuation tokens
Parallel encoding - Automatic parallelization for inputs >4KB
HuggingFace compatible - Loads tokenizer.json directly

Usage

import "github.com/ollama/ollama/x/imagegen/tokenizer"

// Load from HuggingFace model directory
tok, err := tokenizer.Load("./weights/Llama-3.2-1B")
if err != nil {
    log.Fatal(err)
}

// Encode text to token IDs
ids := tok.Encode("Hello, world!", false) // false = don't add BOS

// Decode back to text
text := tok.Decode(ids)

// Check special tokens
if tok.IsEOS(ids[len(ids)-1]) {
    // End of sequence
}

Performance

Benchmarks on Apple M3 Max:

Input Size	Encode	Decode	Tokens
1 KB	14.5 MB/s	267 MB/s	231
10 KB	10.9 MB/s	321 MB/s	2,301
100 KB	8.9 MB/s	311 MB/s	23,001
1 MB	9.6 MB/s	321 MB/s	230,001

Comparison with other implementations (10 MB input):

Implementation	Encode Speed	Notes
Engine (this)	~10 MB/s	stdlib RE2, parallel >4KB
tiktoken (Rust)	~17 MB/s	Highly optimized regex
Ollama (Go)	~2-3 MB/s	regexp2 backtracking

Performance Opportunities

Potential optimizations not yet implemented:

Optimization	Expected Gain	Complexity
Aho-Corasick for special tokens	2-3x for many special tokens	Medium
Custom regex engine (like tiktoken)	1.5-2x	High
SIMD byte scanning	1.3-1.5x for pretokenizer	Medium
Assembly BPE merge loop	1.2-1.5x	High
Memoization for repeated substrings	Variable	Low

Current bottleneck is the pretokenizer regex (~60% of encode time). tiktoken achieves ~17 MB/s with a hand-tuned Rust regex engine.

Not Yet Implemented

Feature	Used By	Notes
Unigram tokenizer	T5, ALBERT, mBART	Different algorithm (not BPE)
Unicode normalizers	Some multilingual models	NFD, NFKC, lowercase, etc.
Custom pretokenizers	Model-specific	Beyond standard patterns

Most HuggingFace models use BPE or SentencePiece, which are fully supported. WordPiece (BERT-style) is also supported with standard [UNK] fallback for out-of-vocabulary characters.

Files

File	Description
`tokenizer.go`	Main implementation (~1000 lines)
`tokenizer_test.go`	Tests and benchmarks
`testdata/`	Mini tokenizer for unit tests