Back to Woods

Token Estimation Benchmark

docs/TOKEN_BENCHMARK.md

1.2.06.0 KB
Original Source

Token Estimation Benchmark

Benchmarking the heuristic (string.length / 3.5).ceil against tiktoken tokenizers used by OpenAI models.

Methodology

  • Corpus: 19 Ruby source files from lib/woods/ (1.4 KB to 33.4 KB), including extractors, retrieval pipeline, MCP servers, and utility modules
  • Reference tokenizer: tiktoken_ruby 0.0.15.1 with cl100k_base (GPT-4) and o200k_base (GPT-4o)
  • Heuristic under test: (content.length / 3.5).ceil — the current divisor used across the gem
  • Content analysis: Additional breakdown of code-only vs. comment/YARD lines across 20 files

Results

Per-File Results (cl100k_base, sorted by file size)

FileCharsActual TokensHeuristicError %Chars/Token
model_name_cache.rb1,419360406+12.8%3.94
ast_source_extraction.rb1,462330418+26.7%4.43
circuit_breaker.rb3,191760912+20.0%4.20
text_preparer.rb3,7889001,083+20.3%4.21
query_classifier.rb4,5861,1531,311+13.7%3.98
extracted_unit.rb4,9631,1511,418+23.2%4.31
retriever.rb6,4241,4191,836+29.4%4.53
dependency_graph.rb6,7671,6621,934+16.4%4.07
semantic_chunker.rb8,8912,1032,541+20.8%4.23
concern_extractor.rb9,1872,1052,625+24.7%4.36
context_assembler.rb9,1962,0092,628+30.8%4.58
job_extractor.rb10,9682,5863,134+21.2%4.24
search_executor.rb12,2542,5973,502+34.8%4.72
graph_analyzer.rb12,6772,9583,622+22.4%4.29
controller_extractor.rb16,4573,5594,702+32.1%4.62
extractor.rb23,2355,3486,639+24.1%4.34
console/server.rb27,8655,5827,962+42.6%4.99
graphql_extractor.rb32,4697,3449,277+26.3%4.42
mcp/server.rb33,4306,1739,552+54.7%5.42

Statistical Summary

Metriccl100k_base (GPT-4)o200k_base (GPT-4o)
Mean signed error+26.2%+27.1%
Mean absolute error26.2%27.1%
Max absolute error54.7%54.6%
Min absolute error12.8%12.2%
Std dev9.8%9.9%
Mean chars/token4.414.45
Range chars/token3.94 - 5.423.92 - 5.41

Content Type Analysis (code lines vs. comments/YARD)

Content TypeMean Chars/TokenRange
Code lines only4.383.61 - 5.45
Comment/YARD lines4.273.91 - 4.90

Both content types are above 3.5 chars/token. The assumption in the CLAUDE.md that "Ruby code averages ~3.2-3.5 chars/token" is not supported by this corpus.

Alternative Divisor Comparison

DivisorMean Abs Error (cl100k)Max Abs Error
3.047.2%80.5%
3.238.0%69.2%
3.5 (current)26.2%54.7%
3.816.2%42.5%
4.010.6%35.4%

Key Findings

  1. The heuristic always overestimates (positive error across all 19 files). It never underestimates. This makes it conservative (safe for token-limit enforcement), but wasteful.

  2. Mean overestimation is 26.2% (cl100k) — the heuristic thinks files have ~26% more tokens than they actually do. This exceeds the 10% target from the backlog item.

  3. Large files with repetitive structures are worst. mcp/server.rb (54.7% error) and console/server.rb (42.6% error) contain long sequences of similar tool registration blocks. Tokenizers handle repetitive patterns efficiently; the heuristic does not.

  4. cl100k and o200k produce nearly identical results on Ruby source — no meaningful difference between GPT-4 and GPT-4o tokenizers for this corpus.

  5. A divisor of 4.0 reduces mean error to 10.6% while still always overestimating. This is the best simple improvement without adding a dependency.

Practical Impact

The overestimation affects these gem components:

ComponentEffect of 26% Overestimation
ExtractedUnit#estimated_tokensChunks are created ~20% smaller than necessary
TextPreparer#enforce_token_limitEmbedding input truncated ~20% earlier than needed
ContextAssemblerToken budget consumed ~20% faster, fewer results returned
CostModel::EstimatorCost estimates ~26% higher than actual
SemanticChunkerMore chunks per unit, more embedding API calls

Recommendation

Change the divisor from 3.5 to 4.0. Rationale:

  • Reduces mean error from 26.2% to 10.6% (within the 15% budget for a heuristic)
  • Still always overestimates — maintains the conservative safety property
  • Zero new dependencies
  • Single constant change across 6 files (search for / 3.5 and * 3.5)
  • The original 3.5 figure appears to have been based on minimal-comment pure-code snippets; real Ruby source with YARD docs averages 4.41 chars/token

Do NOT add tiktoken_ruby as a runtime dependency. The 10.6% mean error with divisor 4.0 is acceptable for the gem's use cases (chunking decisions, budget estimates, truncation). Adding a native extension dependency for marginal accuracy gains is not worth the complexity.

Locations Using the Heuristic

All instances of the 3.5 divisor in the codebase:

  • lib/woods/extracted_unit.rb:71-72,98,101estimated_tokens, chunking
  • lib/woods/chunking/chunk.rb:43token_count
  • lib/woods/embedding/text_preparer.rb:104,107enforce_token_limit
  • lib/woods/retrieval/context_assembler.rb:216,225 — truncation, estimate_tokens
  • lib/woods/formatting/base.rb:36estimate_tokens

Reproducing This Benchmark

bash
gem install tiktoken_ruby  # Not a Gemfile dependency — benchmark only
ruby scripts/token_benchmark.rb  # Outputs JSON to stdout

The benchmark spec at spec/token_estimation_benchmark_spec.rb validates:

  • Self-consistency of the heuristic (monotonically increasing with content length)
  • With tiktoken_ruby installed: bounded overestimation, no underestimation > 5%