docs/ParagraphSemanticChunking.md
Paragraph Semantic Chunking (hereafter the P strategy) targets documents with clear sectional structure such as DOCX. Its core goal is to align chunk boundaries with the document's native semantic boundaries (headings, paragraphs, table rows) as much as possible, rather than determining cut points solely from token-length counting.
The P strategy is mainly designed to address the following four categories of problems:
The P strategy is effective only for the .blocks.jsonl structured artifacts produced by the native extraction engine; for unstructured inputs, it automatically falls back to the R strategy (see §8).
| Dimension | R Strategy (Recursive) | V Strategy (SemanticVector) | P Strategy (ParagraphSemantic) |
|---|---|---|---|
| Splitting basis | Cascading character separators (paragraph → newline → Chinese punctuation → whitespace → character) + token budget | Sentence-level embedding distance thresholds (percentile / standard deviation / IQR / gradient) to locate semantic breaks | DOCX outline level with parent_headings + table row boundaries + anchors + hierarchy-aware merging |
| Chunk size control | chunk_token_size hard cap | chunk_token_size is merely an advisory ceiling; when exceeded, secondary splitting via R | target_max hard cap + target_ideal soft target + table threshold + tail-absorption threshold working in concert |
| Table handling | Table-unaware; may cut in the middle of a table | Table-unaware | Tables smaller than table_max are kept intact; large tables are sliced by JSON row array / HTML <tr> row boundaries and re-wrapped as valid <table> |
| Table context | Relies on incidental window coverage | Relies on embedding distance | First slice glues to preceding description, last slice glues to following explanation; bidirectional overlap of bridging text between consecutive large tables |
| Inter-chunk overlap | Global chunk_overlap_token_size | No overlap | No overlap across section boundaries; within the same section, long body falls back to R with overlap by CHUNK_P_OVERLAP_SIZE; bridging text between consecutive large tables may enter both the preceding and following table chunks |
| Heading metadata | Usually none | Usually none | Inherits or promotes heading; appends [part n] suffix after splitting; preserves parent_headings and level |
| Embedding compute cost | None | High (must compute embedding per sentence) | None |
| Input requirements | Any text | Any text + Embedding model | Must have a .blocks.jsonl sidecar (i.e., result of the native engine); otherwise falls back to R |
| Scenario | Recommended | Rationale |
|---|---|---|
| DOCX with clear sectional hierarchy, large tables, fine-grained clauses | P | Fully leverages heading hierarchy and table row boundaries; chunk boundaries best match semantics; avoids cross-topic pollution |
| Documents dominated by prose / commentary / long body without clear sectional structure | V | Splitting by semantic similarity forms natural boundaries at topic shifts, more stable than character splitting |
| Inputs are plain text, Markdown, code, logs, or you want minimum compute overhead | R | No embedding overhead; cascading separators are stable enough for mixed Chinese-English text |
| General configuration (uncertain about file types) | R | P automatically falls back to R when no sidecar is present; V also falls back to R when no Embedding model is available |
| Documents with chaotic heading styles and many pseudo-headings in body | R or V | P depends on the native parser correctly identifying headings; messy headings cause basic chunk boundaries to shift |
| Single-line giant tables or unparsable tables | Any | All three strategies eventually fall back to character-level splitting; P still retains the advantage of table context gluing |
native engine: explicitly declared in LIGHTRAG_PARSER, e.g., docx:native-P; otherwise, even if P is written, it falls back to R due to the missing .blocks.jsonl..blocks.jsonl artifact.The P strategy takes as input the .blocks.jsonl produced by the native parser in fixlevel=0 mode. Each type == "content" line is treated as one heading-level basic chunk, then table slicing, long-chunk splitting, and hierarchical merging are performed on top:
DOCX
↓ native parser (fixlevel=0)
.blocks.jsonl + sidecars (.tables.json / .equations.json / .drawings.json / .blocks.assets/)
↓ Stage B: slice oversized tables along row boundaries and assign first/middle/last roles
↓ Stage B.1: bidirectional overlap of bridging text between consecutive large tables
↓ Stage C: anchor-driven re-splitting of long text chunks
↓ Stage D: hierarchy-aware two-phase merging
↓ Stage E: [part n] line-level provenance numbering
Final chunk list
Key invariants of the P strategy:
.blocks.jsonl content lines is never copied into the other chunk, avoiding "misattribution".chunk_overlap_token_size, reducing mid-sentence cuts in long bodies.chunking_by_paragraph_semantic() receives the following inputs:
| Parameter | Source | Description |
|---|---|---|
content | full_docs[doc_id].content | Concatenated merged text, used for fallback when sidecar is missing |
blocks_path | full_docs[doc_id].lightrag_document_path | Path to .blocks.jsonl, the primary input for the P strategy |
chunk_token_size | chunk_options.chunk_token_size / CHUNK_P_SIZE | Target hard cap N; defaults to 2000 |
chunk_overlap_token_size | CHUNK_P_OVERLAP_SIZE / chunk_overlap_token_size | Upper bound for long-body fallback overlap within the same content line and for the table bridging budget; defaults to 100 |
tokenizer | The tokenizer already parsed by LightRAG | Basis for all token counting and text overlap truncation |
The P strategy does not accept split_by_character / split_by_character_only, because the normal path is driven by heading and paragraph structure.
.blocks.jsonl ConventionThe P strategy only processes type == "content" lines. Each content line typically contains:
content: The body text under the heading, possibly including ordinary paragraphs, <table ... /> tags, <equation ... /> formulas, <drawing ... /> graphics.heading: The current heading.parent_headings: The chain of parent headings.level: Heading level (1–9, corresponding to the original outline levels 0–8).positions: Original paragraph positioning (used for traceability).The native parser's fixlevel=0 mode guarantees that "the body under a heading becomes one basic chunk" without performing token-threshold splitting during parsing. Tables are inserted into content while staying intact.
The final output is an ordered list of chunks, where each element is:
{
"tokens": int, # Actual token count (re-measured after merging)
"content": str, # Chunk text (may contain <table> tags)
"chunk_order_index": int, # Chunk ordering index
"heading": str, # Suffix [part n] appended after splitting
"parent_headings": list[str], # Parent heading chain; no suffix appended
"level": int, # Heading level
}
Internally, the implementation also temporarily uses fields such as paragraphs, table_chunk_role, uuid, uuid_end, type to assist splitting and merging, but these do not appear in the final output.
[part n] Suffix Rules.blocks.jsonl content line is split into multiple slices, the heading field of every slice gets [part 1], [part 2] … appended.parent_headings does not get any suffix.[表格片段N] ("table fragment N") suffix is uniformly replaced by [part n].P strategy thresholds are not fixed constants; they are dynamically derived from chunk_token_size (denoted N):
| Name | Formula | Value when N = 2000 | Technical meaning |
|---|---|---|---|
target_max | N | 2000 | Hard upper bound for text chunks |
target_ideal | 0.75 × N | 1500 | Ideal target for text chunks; chunks at or above this value stop participating in ordinary peer merging |
table_max | 0.625 × N | 1250 | Threshold that triggers table slicing |
table_ideal | 0.375 × N | 750 | Ideal size for a table slice |
table_min_last | 0.32 × table_max | 400 | Last-slice swallow-back threshold (if the last slice is smaller and can be merged, swallow it back into the previous slice) |
small_tail_threshold | 0.125 × N | 250 | Threshold for tail fragment absorption |
max_anchor_candidate_length | Fixed | 100 chars | Upper bound on paragraph length for candidate anchors in long-chunk splitting |
Proportional constraint relationships: table_max < target_ideal < target_max, table_ideal < table_max. These ratios originate from empirical values in the audit mode (large chunk 8000, small table 5000, ideal table 3000, table tail 1600) and are now proportionally scaled by chunk_token_size.
Heading recognition is performed by the native parser; the P chunker itself does not scan the docx body nor judge heading styles.
In fixlevel=0 mode, the native parser:
styles.xml, builds a style inheritance chain via <w:basedOn>, and traces back the effective <w:outlineLvl>.document.xml, resolving outline levels along the inheritance chain; original outline levels 0–8 are mapped to internal level 1–9.current_heading_stack, clearing old headings no shallower than the current level when a new heading is encountered, and computing parent_headings.<table id="..." format="json">...</table> etc.) and writes them to the corresponding sidecars.The P chunker directly reads .blocks.jsonl, treating each content line as an independent unit of processing for subsequent Stages B/C. This implies that [part n] numbering is reset independently per original content line.
Stage B only processes tables whose token count exceeds table_max. Its goal is not merely to split the table but to preserve table boundary context based on row-boundary-priority splitting.
format="json": Slice by the top-level JSON row array.format="html": Slice by <tr>...</tr> rows.Before slicing, the <table {attrs}></table> wrapper token cost is pre-deducted so that each re-wrapped slice stays under table_max as much as possible. Each slice is re-wrapped as a valid <table> tag for ease of downstream parsing.
If a row subset, after re-wrapping, still exceeds table_max, further subdivision is performed within that row subset. Only when slicing has converged to a single row that itself exceeds the limit does it degrade to character-level splitting. This mechanism keeps as much valid table structure as possible for table content expressible by row boundaries.
If the token count of the last table slice falls below table_min_last and the result of merging with the previous slice does not exceed table_max, the last slice is swallowed back into the previous slice, reducing useless short table chunks.
Each table slice is assigned an internal field table_chunk_role, and gluing to surrounding paragraphs is decided by role:
| Role | Meaning | Gluing strategy |
|---|---|---|
first | First slice of the original table | Appended to the tail of the current accumulating chunk so that the table's preceding description enters the same chunk as the first slice |
middle | Middle slice of the original table | Output independently to avoid merging with unrelated body |
last | Last slice of the original table | Used as the starting point of a new accumulating chunk so that the following explanation is automatically appended after the last slice |
none | Non-table slice or untouched intact table | Treated as ordinary text chunks |
table_chunk_role is an internal field that does not survive in the final output, but in Stage D it continues to serve as a merging constraint (see §9.1).
When the pattern "large table A, short bridging text, large table B" appears in the same original content line and both tables are split, the bridging text is distributed bidirectionally according to a context budget:
prev_budget = min(chunk_overlap_token_size, target_max - current token count of the left last slice).next_budget = min(chunk_overlap_token_size, target_max - current token count of the right first slice).Each one-sided budget is additionally capped at chunk_token_size / 2 to prevent the bridging text from dominating an entire chunk.
The difference from ordinary adjacent chunk overlap:
Stage C processes content chunks that still exceed target_max after Stage B.
Restore content into paragraphs, then select paragraphs that satisfy all of the following as candidate anchors:
<table).max_anchor_candidate_length (100 chars).Based on the target sub-chunk count, ideal split positions are computed, and the anchor closest to each ideal position is chosen from candidates. The chosen anchor is promoted to the new heading of the following sub-chunk, while the original heading is written into that sub-chunk's parent_headings.
If no qualifying anchor exists:
target_max.chunking_by_recursive_character), using chunk_overlap_token_size to keep continuity between adjacent text slices.The no-anchor fallback path guarantees the algorithm does not discard content and tries to respect the user-configured chunk size cap.
Stage D resolves the tension between "chunks too small" and "cross-topic pollution" in fine-grained section scenarios. The core idea is to process from deeper levels to shallower levels, first merging small chunks at the same level, then allowing shallow chunks to absorb deep chunks, while introducing size constraints, table slice role constraints, and heading path constraints.
target_max; chunks that have reached target_ideal in principle do not continue to participate in ordinary peer merging.middle table slices are locked as independent; first and last participate in merging directionally to prevent table boundary context from being incorrectly swallowed.level; cross-level absorption only allows shallow absorbing deep, disallowing deep absorbing shallow in reverse.parent_headings, or are within a contiguous range constrained by the same parent heading path. This is key to avoiding cross-topic pollution.For adjacent chunks at the current level, if both are below target_ideal and satisfy the above constraints, merge them into one chunk.
Directional rules of table slice roles:
| Chunk role | Can forward-absorb next chunk | Can be absorbed by previous chunk |
|---|---|---|
none | Yes | Yes |
first | Yes | No |
middle | No | No |
last | No | Yes |
If a chunk that has reached target_ideal is followed by a string of peer small chunks, and the total token count of that string is below small_tail_threshold and the actual merged token count does not exceed target_max, then absorb that string in one shot. Stop when encountering a middle table slice.
For small chunks still unsaturated after Phase A, attempt cross-level merging, but only allow shallow absorbing deep:
last role is allowed to forward-absorb; middle still does not participate in merging.Because merging inserts newline connectors, chunk-by-chunk token summation may underestimate the merged result. Before committing each merge, the actual concatenated text must be re-tokenized, and the merge is committed only after confirming it does not exceed target_max.
After merging, the main chunk's heading is retained. If multiple part slices are merged, the final heading keeps the part suffix of the main chunk, never additionally concatenating multiple part tags.
The P strategy has multiple layers of fallback protection:
| Trigger | Degradation behavior |
|---|---|
blocks_path missing, unreadable, or no valid content line | Fall back entirely to chunking_by_recursive_character(), passing in the parsed chunk_overlap_token_size |
| Stage B cannot identify the JSON / HTML structure of a table | That table uses the R strategy's character splitting |
Stage B finds a single-row table itself exceeding table_max | That single row uses the R strategy's character splitting |
| Stage C finds a long chunk with no qualifying short-paragraph anchor | Table first → greedy packing → fall back to R character splitting if a single paragraph is too long |
Important: After the overall fallback, capabilities such as heading hierarchy, table roles, and bidirectional bridging-text overlap are no longer available; however, it still ensures the document produces retrieval chunks and is not silently dropped due to a missing structured sidecar.
| Configuration | Default | Description |
|---|---|---|
CHUNK_P_SIZE | 2000 (when unset, uses DEFAULT_CHUNK_P_SIZE; does not fall back to CHUNK_SIZE) | P-specific chunk_token_size; paragraph semantic merging requires a higher cap than the global default, hence an independent default rather than falling back to CHUNK_SIZE |
CHUNK_P_OVERLAP_SIZE | Unset (falls back to CHUNK_OVERLAP_SIZE) | P-specific overlap; only affects long-body fallback within the same content line and the table bridging budget. Does not cause table row-level slices to overlap |
CHUNK_OVERLAP_SIZE / LightRAG(chunk_overlap_token_size=…) | 100 | Global fallback when no P-specific overlap is set |
For configuration syntax, the priority chain, and runtime overrides via addon_params["chunker"], see FileProcessingConfiguration-zh.md §3.
A typical LIGHTRAG_PARSER setup that enables P:
LIGHTRAG_PARSER=docx:native-P,*:legacy-R
CHUNK_P_SIZE=2000
CHUNK_P_OVERLAP_SIZE=100
Or override per single file:
my-proposal.[native-P].docx
Confirm whether the native parser successfully produced .blocks.jsonl:
ls -l INPUT/__parsed__/<doc>.docx.parsed/<doc>.blocks.jsonl
If the file is missing or empty, the P strategy falls back to R entirely and gains none of P's benefits. Common causes:
LIGHTRAG_PARSER=docx:native-... was not configured.pipeline_status).Each line is a JSON; filter type == "content" and inspect whether heading / level / parent_headings match expectations:
jq -c 'select(.type=="content") | {level, heading, parent_headings}' \
INPUT/__parsed__/<doc>.docx.parsed/<doc>.blocks.jsonl | head
If most headings are empty or levels are abnormal, the native parser did not correctly recognize heading styles — in which case P's hierarchical merging and anchor promotion will both fail.
View chunk metadata in the text_chunks storage:
jq '.[] | {heading, level, tokens, parent_headings}' \
rag_storage/kv_store_text_chunks.json | head -30
You should observe:
[part 1] / [part n] (indicating Stage B splitting occurred).target_ideal (indicating Stage D took effect).parent_headings jumps at boundaries between different sections and stays stable within the same section.Ideal distribution: most chunks fall in the range [target_ideal, target_max] (i.e., approximately 1500–2000 tokens when N=2000); chunks noticeably smaller are usually middle table slices (locked as independent) or tail chunks at section boundaries.
If many tail chunks below small_tail_threshold appear, possible causes include:
parent_headings cannot merge).middle table slices pile up (the table itself is very large).Investigate in this order:
full_docs[doc_id].process_options contain P?full_docs[doc_id].parse_format equal to lightrag? If raw, it is on the legacy path and P automatically falls back to R..blocks.jsonl pointed to by lightrag_document_path exist and is it non-empty?paragraph_semantic ... fallback to recursive_character messages in the logs?<table format="json"> or <table format="html"> (see .blocks.jsonl). Tables with unrecognized format can only undergo character splitting and cannot trigger Stage B's role mechanism.table_max. Tables below the threshold remain intact and never trigger first/middle/last slicing.parent_headings of adjacent clauses are identical: the parent heading path consistency constraint prevents cross-topic merging.level is the same: peer merging requires equal level; cross-level absorption only allows shallow absorbing deep.middle table slice is inserted in the middle: this blocks batched tail absorption.target_maxNormally, Stage D's actual token re-measurement rejects oversized merges, but oversized chunks may still occur in the following scenarios:
target_max with no anchor to split on; eventually it goes through R character splitting but a single chunk still exceeds the limit.enforce_chunk_token_limit_before_embedding performs a final hard cut before embedding; downstream will not actually embed an oversized chunk into the vector store.[part n] Suffixes[part 1] is seen: check whether they were merged in Stage D — after merging, the main chunk's part suffix is retained and multiple part tags are not concatenated.[表格片段N] suffix appears: this indicates data output by an older chunker; the new version standardizes on [part n], and re-chunking is required.P-strategy-related log keywords (for grep-based troubleshooting):
paragraph_semantic — module entryfallback to recursive_character — overall or single-paragraph degradationtable_chunk_role — table role-relatedbridge — Stage B.1 bridging text handlinganchor — Stage C anchor selection