Back to Weknora

Chunking Guide

docs/CHUNKING.md

0.5.28.2 KB
Original Source

Chunking Guide

How WeKnora splits uploaded documents before embedding, why the defaults are what they are, and when to change them.

Why chunking matters

Retrieval-Augmented Generation (RAG) works by embedding small slices of your documents into a vector index, then pulling the most relevant slices back at query time. The way a document is sliced — chunk size, overlap, where the cuts fall — directly drives retrieval recall and the quality of the answers your LLM produces.

Empirically (Vecta Feb-2026 benchmark across 50 academic papers): recursive splitting at ~512 tokens with ~15% overlap is the strongest single-knob baseline at 69% end-to-end accuracy, beating semantic chunking and over-engineered hybrids. WeKnora uses that as the foundation and layers smarter strategies on top when the document gives us structural cues.

Adaptive 3-tier chunking

Set per knowledge base via the editor's Chunking sidebar (or the strategy field on the KB-config API).

StrategyWhen pickedWhat it does
auto (recommended)Default for new KBsProfiles the document and picks the strongest tier from the chain below.
headingMarkdown-style structureSplits at # / ## / ### boundaries. Each chunk gets a breadcrumb context header (# Top > ## Section) prepended at embedding time.
heuristicPDF-style structureSplits at form-feeds (page breaks), numbered sections, multilingual chapter markers (DE / EN / ZH), all-caps titles, and visual separators.
legacy (= recursive)Anything else, or as fallbackPure recursive separator-based splitter — newest version with priority recursion and overlap-cap fixes.

A document profiler runs first and counts structural signals (Markdown headings, form-feeds, chapter markers per language, all-caps lines, visual separators, blank-line bursts). Auto-strategy picks the tier chain based on those counts; a validator rejects obviously broken output (e.g. the heading splitter producing 200 single-line chunks) and falls through to the next tier.

Settings reference

Core

SettingRangeDefaultSweet spot for…
Chunk size100–4000 chars512Default works for most cases. 200–400 for FAQs / atomic Q&A. 1000–2000 for narrative documents.
Chunk overlap0–500 chars80 (~15%)0 for FAQs and structured records. 80 (default) for general documents. 150–200 for argumentative texts where reasoning crosses chunks.
Separatorsstring list["\n\n", "\n", "。", "!", "?", ";", ";"]Order matters — splitter tries higher-priority separators first and only falls back to lower ones when a piece is still oversize.

Parent-Child chunking

Two-level retrieval: child chunks (small, embedded for vector match) and parent chunks (larger, returned to the LLM for context).

SettingRangeDefaultNotes
Enable parent-childtoggleonRecommended for documents > 10 pages. Skip for short FAQs to halve storage cost.
Parent chunk size512–8192 chars4096 (~1000 EN tokens)Larger for long-context LLMs (Claude, GPT-4-Turbo). Smaller (1024–2048) for local LLMs with 4k contexts.
Child chunk size64–2048 chars384 (~95 EN tokens)128–256 for Q&A-style precise matching. 512–1024 if your embedder accepts >1000 tokens (E5 / BGE-large).

Advanced

SettingRangeDefaultWhen to set
Token limit0–81920 (off)Activate when your embedding model has a small token cap. See table below.
Languagesde / en / zh (multi-select)empty (auto-detect)Set explicitly for homogeneous corpora to narrow heuristic patterns.

Token-limit guide per embedding model

EmbedderToken limitRecommended tokenLimit setting
OpenAI text-embedding-3-small/large81910 (leave off)
Anthropic Voyage-3320000
Jina-embeddings-v381920
Cohere embed-multilingual-v3512400
BGE-base / BGE-large / E5-large512400
Sentence-Transformer all-MiniLM-L6-v2256200

Rule of thumb: leave at 0 for any modern embedder with > 2000 tokens. Activate to 80% of the model's hard limit for smaller embedders so chunks always fit even for CJK content (which is denser per character).

Use-case presets

WorkloadStrategyChunkSizeOverlapParent-Child
FAQ / Q&A knowledge baseauto (likely picks legacy)200–4000off
Markdown documentation / wikisauto (picks heading)51280on
PDF reports with page breaksauto (picks heuristic)800–1200100–150on
Long-form narrative (books, articles)auto (picks recursive)1000–2000150–200on
Code documentationlegacy800100optional
Mixed-language corpusauto, languages = empty51280on
Tabular reports / CSV-derivedlegacy4000off

Debugging in the UI

The KB editor's Chunking sidebar has a "Test with sample text" collapsible at the bottom:

  1. Paste a Markdown / plain-text snippet (max 64 KB).
  2. Click Run preview.
  3. The panel shows:
    • Selected strategy tier as a colored tag
    • Tiers that were rejected and why (e.g. "too many tiny chunks")
    • Document profile (heading counts, form-feeds, chapter markers, detected languages)
    • Size statistics over the full chunk set (avg / min / max / stddev)
    • Per-chunk cards with size in chars + approximate tokens, position range, the section breadcrumb (when set), and a content preview

This runs read-only against a goroutine-isolated splitter pass (5s timeout) — no DB writes, no embedding API calls. Use it to compare configurations against the same sample before triggering a re-upload.

API

Three endpoints can write the chunking config. The KB-config update endpoint is the one wired to the editor UI and uses camelCase with a documentSplitting envelope:

http
PUT /api/v1/initialization/config/:kbId
Authorization: Bearer <jwt>
Content-Type: application/json

{
  "documentSplitting": {
    "chunkSize": 512,
    "chunkOverlap": 80,
    "separators": ["\n\n", "\n", "。", "!", "?", ";", ";"],
    "strategy": "auto",
    "tokenLimit": 0,
    "languages": ["de", "en"],
    "enableParentChild": true,
    "parentChunkSize": 4096,
    "childChunkSize": 384
  }
}

The strategy, tokenLimit, and languages fields use pointer-based DTOs server-side: omitting them in the payload means "no change", sending an empty string / 0 / [] explicitly resets to default.

The KB CRUD endpoints (POST /api/v1/knowledge-bases, PUT /api/v1/knowledge-bases/:id) take the same fields but in snake_case under a chunking_config envelope: { "chunking_config": { "chunk_size": 512, "chunk_overlap": 80, "strategy": "auto", "token_limit": 0, "languages": ["de", "en"], ... } }.

The preview endpoint at POST /api/v1/chunker/preview uses the snake_case form too and additionally takes a text field for the sample to chunk.

Known trade-offs

  • Tier-1 heading-aware chunking prepends the section breadcrumb to the embedding input, costing ~5% more tokens per chunk in exchange for ~30–50% fewer chunks on structured documents (net token savings on storage and at query time).
  • Strategy switches do not auto-reindex existing documents. After changing a KB's strategy, re-upload affected files (or trigger re-indexing via the UI) to apply the new chunking.
  • OCR artifacts in PDFs (vertical layout text broken character- by-character into separate lines) cannot be fixed by any splitter — this is a parser-side limitation. The heuristic tier still keeps chunks aligned to page boundaries, which mitigates the worst cases.
  • The recursive strategy value exists in the API for completeness but is intentionally hidden from the UI: it is functionally near legacy and adding a fifth dropdown option dilutes the meaningful choice between automatic / Markdown / heuristic / legacy.