docs/FileProcessingPipeline.md
Starting from version v1.5.0 (currently on the dev branch), LightRAG's file processing pipeline has received a major upgrade:
LightRAG Server introduces an intermediate file-processing format: LightRAG Document. This format supports multimodal data such as tables and images, and also includes the document's section/paragraph metadata, which is convenient for content traceability later.
This document is organized from the perspective of LightRAG Server deployment and use: the quick-start configuration that can be applied directly is given first, followed by configuration syntax for content extraction and chunking, storage / directory layout, deduplication, concurrency, and resume rules. Developers who call the LightRAG class directly via Python should jump to Chapter 8: Python SDK Invocation.
All files are processed using the legacy document parsing and chunking strategy. Either leave LIGHTRAG_PARSER unconfigured, or set it to the following value:
LIGHTRAG_PARSER=*:legacy-F
No reliance on external document parsing services or on VLM vision models. Use the new built-in Native engine to parse docx documents with table (t) and equation (e) modality analysis enabled, paired with the P chunking strategy; other documents use the legacy content extractor paired with the more effective R chunking strategy.
LIGHTRAG_PARSER=*:native-teP,*:legacy-R
Enabling multimodal processing requires the MinerU file parsing service and a VLM vision recognition model. Use Native to parse docx files; use MinerU to parse pdf, office, and various image files. All of the above files have image (i), table (t), and equation (e) modality analysis enabled and are paired with the P chunking strategy. Other documents fall back to the legacy content extractor paired with the R chunking strategy.
LIGHTRAG_PARSER=*:native-iteP,*:mineru-iteP,*:legacy-R
VLM_PROCESS_ENABLE=true
VLM_LLM_MODEL=kimi-k2.6
MINERU_API_MODE=local
MINERU_LOCAL_ENDPOINT=http://localhost:8000
Pis LightRAG's native chunking strategy; see Paragraph Semantic Chunking for details. For VLM configuration, see Role-based LLM/VLM Configuration Guide.
LightRAG's file processing configuration is composed of two parts: the content extraction engine determines how the original file is parsed, and the processing options determine whether multimodal analysis is performed after parsing, which chunking method to use, and whether to build a knowledge graph. Typically, the environment variable LIGHTRAG_PARSER is first used to set default rules by file extension, and then a [hint] in the filename overrides individual files. Engine and options can be written in the same configuration fragment, for example docx:native-iet or report.[native-R!].docx.
For backward compatibility, when the configuration is not modified, the upgraded file content extraction behavior remains the original legacy behavior. To enable the new content processing engines, configure as described in this section.
The complete configuration model is as follows:
LIGHTRAG_PARSER=ext:engine-options,ext:engine,*:legacy-R
filename.[ENGINE].ext
filename.[ENGINE-OPTIONS].ext
filename.[-OPTIONS].ext
LIGHTRAG_PARSER is the default rule table, matched by file extension, e.g., pdf:mineru, docx:native-iet.[hint] in a filename is a single-file override rule, e.g., paper.[mineru].pdf, memo.[native-R!].docx.ENGINE is the content extraction engine: legacy, native, mineru, or docling.OPTIONS is a string combination of processing options, e.g., iet, R!, P. The options are ultimately written into process_options and read by subsequent pipeline stages.ENGINE-OPTIONS is only used to separate the engine from the options; it is not part of the options themselves.[-OPTIONS], e.g., [-!]. [abc] without a hyphen is strictly interpreted as an engine name and will raise an error; it will not fall back to being interpreted as options.Common combination examples:
LIGHTRAG_PARSER=pdf:mineru-R,docx:native-ietP,*:legacy-R
MINERU_API_MODE=local
MINERU_LOCAL_ENDPOINT=http://localhost:8000
DOCLING_ENDPOINT=http://localhost:5001
my-proposal.[native-iet].docx # Use the native engine, enable drawing/table/equation analysis
my-memo.[native-R!].docx # Use the native engine, recursive semantic chunking, disable knowledge graph construction
my-proposal.[-!].docx # Use the default engine, only disable knowledge graph construction
my-proposal.[mineru].docx # Use the MinerU engine, all processing options default
LIGHTRAG_PARSERLIGHTRAG_PARSER is used to configure the default content extraction engine for different file extensions; default processing options for the rule can also be appended after the engine:
ext:engine,ext:engine,*:legacy
ext:engine;ext:engine;*:legacy
ext:engine-options
pdf:mineru, not *.pdf:mineru., or a semicolon ;.-options suffix after the engine serves as the default process_options for files matched by this rule. For example, LIGHTRAG_PARSER=docx:native-iet means all .docx files default to the native engine with image, table, and equation analysis enabled.Square brackets in the filename can be used to temporarily specify how a single file is processed:
paper.[mineru-R].pdf
slides.[docling].pptx
memo.[native-P].docx
notes.[-R].md
The content inside the square brackets supports three forms:
[ENGINE] # Specify only the engine; processing options use the default or what LIGHTRAG_PARSER provides
[ENGINE-OPTIONS] # Specify both engine and processing options
[-OPTIONS] # Specify only processing options; the engine still follows LIGHTRAG_PARSER / default rules
When parsing the hint, content without a hyphen must match an engine name exactly (mineru / native / docling / legacy); when there is content before a hyphen, the part before the hyphen is the engine and the part after is the options; when starting with a hyphen, it specifies only options. The legacy [OPTIONS] syntax is no longer valid; for example, [iet] must now be written as [-iet].
| Engine | Description | Supported file formats (extensions) |
|---|---|---|
legacy | Legacy extraction; content is centrally extracted before joining the pipeline | txt md mdx pdf docx pptx xlsx rtf odt tex epub html htm csv json xml yaml yml log conf ini properties sql bat sh c h cpp hpp py java js ts swift go rb php css scss less |
native | Built-in intelligent structured content extractor | docx |
mineru | External MinerU content extraction engine | pdf doc docx ppt pptx xls xlsx png jpg jpeg jp2 webp gif bmp |
docling | External Docling content extraction engine | pdf docx pptx xlsx md html xhtml png jpg jpeg tiff webp bmp |
mineru and docling are external content extraction engines; before enabling related rules, the services must be running first, and the corresponding endpoint/token must be configured in LightRAG.
LightRAG caches the parsing results of the mineru and docling engines locally. Re-uploading the same file usually does not trigger the engine to re-parse the document. To delete the parse cache, you must click the "also delete file" option in the delete-file dialog of the document management interface. Modifying the endpoint addresses and effective extraction parameters of the mineru / docling engines will also invalidate the cache, causing the engine to re-parse the file content on the next upload of the same file.
The MinerU client supports two modes; choose one:
local: self-hosted MinerU service (the official Docker Compose deployment is recommended); LightRAG calls the local container via HTTP.official: directly connects to the MinerU official precise API v4; you need to apply for a token at mineru.net.Local deployment (Docker Compose)
Clone the official opendatalab/MinerU repository to your local machine, enter the docker deployment directory inside the repository, and first build the image:
docker compose -f compose.yaml build
Then start the API service (--profile api is required to enable the HTTP API container; the default listening port is 8000):
docker compose -f compose.yaml --profile api up -d
For image build details, GPU driver setup, model weight locations, etc., refer to the official README: https://github.com/opendatalab/MinerU.
LightRAG-side env configuration
Local mode (self-hosted mineru-api):
MINERU_API_MODE=local
MINERU_LOCAL_ENDPOINT=http://localhost:8000
Official mode (MinerU cloud API):
MINERU_API_MODE=official
MINERU_API_TOKEN=<your_token>
# MINERU_OFFICIAL_ENDPOINT=https://mineru.net # Default value, usually no need to change
For the remaining advanced switches (MINERU_MODEL_VERSION, MINERU_LANGUAGE, MINERU_ENABLE_TABLE / MINERU_ENABLE_FORMULA, MINERU_PAGE_RANGES, MINERU_LOCAL_BACKEND / MINERU_LOCAL_PARSE_METHOD, MINERU_POLL_INTERVAL_SECONDS / MINERU_MAX_POLLS, MINERU_ENGINE_VERSION, LIGHTRAG_FORCE_REPARSE_MINERU, etc.), refer to the MinerU section of the env.example template at the repository root. Note that MINERU_PAGE_RANGES has different semantics in the two modes: official supports a complete list (e.g., 1-3,5,7-9), while local only supports a single page (3) or a simple range (1-10); it does not accept comma-separated lists.
The docling content extraction engine requires an external docling-serve service (v1 async API). Minimal configuration:
DOCLING_ENDPOINT=http://localhost:5001
DOCLING_ENDPOINT is just the base URL (without /v1/convert/file/async). Currently LightRAG uses Docling's standard pipeline to process files. Users can control the behavior of the Docling pipeline through the following environment variables:
| Env | Default | Meaning |
|---|---|---|
DOCLING_DO_OCR | true | OCR master switch |
DOCLING_FORCE_OCR | true | Force OCR per page (mandatory for scanned documents; enabling it for non-scanned documents usually also helps improve layout recognition quality) |
DOCLING_OCR_ENGINE | auto | OCR engine selection (not recommended to change) |
DOCLING_OCR_PRESET | auto | OCR engine preset (not recommended to change) |
DOCLING_OCR_LANG | (empty) | Set per OCR engine requirements (not recommended to change) |
DOCLING_DO_FORMULA_ENRICHMENT | false | Whether to recognize equations in the document and output them in LaTeX format; before enabling, ensure that Docling has downloaded the equation recognition model on the backend (see explanation below) |
When DOCLING_OCR_ENGINE / DOCLING_OCR_PRESET are not configured, they are equivalent to auto; when DOCLING_OCR_LANG is not configured, no language list is passed to docling-serve, and the OCR engine uses its own default. The parse cache signature is computed from these effective parameters, so "not configured" and "explicitly set to the default value" do not invalidate the cache.
Two polling-budget envs (docling-serve uses server-side long-poll; the client does not sleep extra):
| Env | Default | Meaning |
|---|---|---|
DOCLING_POLL_INTERVAL_SECONDS | 5 | Poll interval for awaiting parse results |
DOCLING_MAX_POLLS | 240 | Maximum poll iterations; raises TimeoutError when exceeded; |
| default wait time ≈ 5 × 240 (about 20 minutes) |
Three bundle-cache envs:
| Env | Default | Meaning |
|---|---|---|
DOCLING_ENGINE_VERSION | (empty) | Docling engine version; version changes invalidate the parse cache |
LIGHTRAG_FORCE_REPARSE_DOCLING | false | When set to true/1, the parse cache is not used |
DOCLING_BBOX_ATTRIBUTES | {"origin":"LEFTBOTTOM"} | Default coordinate system for Docling layout |
Prerequisites for DOCLING_DO_FORMULA_ENRICHMENT: the docling-serve side must have the code-formula model weights ready. The adapter is dual-track compatible — when enabled, the text field is LaTeX; when disabled, or when missing weights cause text == orig, it falls back to plain text and does not write equations.json. Therefore the default of false is conservative; turn it on only after confirming the model is ready on the deployment side.
The following uses a Docker-based docling-serve deployment as an example, giving the complete steps from image download to model mounting. After deployment completes, write DOCLING_DO_FORMULA_ENRICHMENT=true into LightRAG's .env to enable LaTeX equation recognition.
Important: the steps below are based on an environment where the GPU supports CUDA 13. If your GPU is older and does not support CUDA 13, replace the image name
docling-serve-cu130:mainin the command and compose file with the tag corresponding to your CUDA version. For the list of available images, see docling-serve Packages.
1. Pull the image
docker pull ghcr.io/docling-project/docling-serve-cu130:main
2. Download models
# Create the docling working directory
mkdir docling
cd docling
# Create the model mount directory
mkdir models
# Copy the existing models inside the container into the models directory
docker run --rm -it \
-v "$(pwd)/models:/opt/app-root/src/models" \
ghcr.io/docling-project/docling-serve-cu130:main \
cp -r /opt/app-root/src/.cache/docling/models /opt/app-root/src/
# Download the equation recognition model
docker run --rm \
-v "$(pwd)/models:/opt/app-root/src/models" \
-e DOCLING_SERVE_ARTIFACTS_PATH="/opt/app-root/src/models" \
ghcr.io/docling-project/docling-serve-cu130:main \
docling-tools models download-hf-repo docling-project/CodeFormulaV2 -o models
3. Create docker-compose.yaml
Create docker-compose.yaml in the docling directory from the previous step, with the following contents:
services:
docling-serve:
image: ghcr.io/docling-project/docling-serve-cu130:main
container_name: docling-serve
ports:
- "5001:5001"
environment:
DOCLING_SERVE_ENABLE_UI: "true"
NVIDIA_VISIBLE_DEVICES: "all"
DOCLING_SERVE_ARTIFACTS_PATH: "/opt/app-root/src/models"
# deploy: # This section is for compatibility with Swarm
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
runtime: nvidia
restart: always
volumes:
- ./models:/opt/app-root/src/models
Then execute docker compose up -d in that directory to start the service. After the container is ready, set the following in LightRAG's .env:
DOCLING_ENDPOINT=http://localhost:5001
DOCLING_DO_FORMULA_ENRICHMENT=true
This enables LightRAG to recognize equations in documents via the local docling-serve and output them in LaTeX form.
Processing options control the behavior of a single file with respect to multimodal analysis, knowledge graph construction, and text chunking. All options are optional; defaults are shown in the table below. At most one chunking method (F/R/V/P) is specified per file; the other options can be combined arbitrarily.
| Option | Type | Default | Meaning |
|---|---|---|---|
i | Multimodal | Off | Enable image analysis (VLM) |
t | Multimodal | Off | Enable table analysis (VLM) |
e | Multimodal | Off | Enable equation analysis (VLM) |
! | Pipeline | Off | Disable entity/relation extraction; do not build the knowledge graph (only the chunks vector index is kept; naive / mix retrieval still works) |
F | Chunking | Default | Fix / fixed-length chunking: legacy method, splits mechanically by fixed token length or by separator (no chunk overlap when splitting by separator) |
R | Chunking | - | Recursive / recursive character chunking (RecursiveCharacterTextSplitter@LangChain): takes a list of separators (default ["\n\n","\n","。","!","?",";",","," ",""], ordered from strongest to weakest semantic boundary). Splits by paragraph (double newline) first; if a chunk is still over the token limit, falls back stepwise to single newline → Chinese sentence-ending punctuation (。!?) → Chinese mid-sentence punctuation (;,) → space → per-character split. The default cascade includes Chinese punctuation, letting Chinese / mixed Chinese-English documents split at semantic boundaries. English .?! is deliberately excluded (literal matching would mis-split 0.95 / e.g.). |
V | Chunking | - | Vector / semantic vector chunking (SemanticChunker@LangChain): first splits text into sentences (the default sentence splitting regex recognizes both English .?! and Chinese 。?!, allowing correct sentence splitting in Chinese / mixed Chinese-English documents), computes embeddings of adjacent sentences, then finds semantic breakpoints based on the specified threshold strategy (e.g., percentile, standard_deviation, or interquartile) for splitting. SemanticChunker itself has no chunk size cap — any semantic chunk that exceeds chunk_token_size is automatically split again by R before persistence (preserving V's non-overlap semantics). This chunking strategy never produces overlapping chunks. |
P | Chunking | - | Paragraph / paragraph semantic chunking (native); splits by heading first and strictly avoids mixing content from the bottom of the previous heading with content from the next heading, which would break semantics. Suited for chunking documents that can accurately identify headings with a clear heading structure. When the body under the same heading is too long and falls back to R, overlap can be preserved according to CHUNK_P_OVERLAP_SIZE; bridging text between adjacent large tables can also be repeated into the surrounding table chunks within that budget. This chunking method can only be applied to lightrag content stored in the sidecar directory. If lightrag content does not exist, it degrades to chunking with R. This chunking method produces far fewer overlapping chunks than the R or F strategies. |
The global multimodal switch
addon_params["enable_multimodal_pipeline"]is deprecated; the related behavior is now uniformly controlled by the file-leveli/t/eoptions. See Appendix A.
Different characters of processing options take effect at different stages of the pipeline:
| Option | Stage | Description |
|---|---|---|
| i/t/e | Analyzing (multimodal analysis) | Determines whether VLM summarization analysis is invoked on the images / tables / equations in the sidecar. The extraction stage is unaffected: the content extraction engine outputs drawings.json / tables.json / equations.json sidecar files based on what the document actually contains. As a result, simply tweaking the i/t/e options to trigger "re-analysis" can complete VLM later without re-parsing the original file. |
| ! | Extraction (entity-relation extraction) | Skips entity/relation extraction and graph writing; chunks are still written to the vector store to retain naive / mix retrieval capabilities. |
| F/R/V/P | Chunking (text chunking) | Determines which chunking strategy to use; does not affect the output of the parsing stage. |
Modality availability is signaled solely by "whether the sidecar file exists"; the content extraction engine does not need to declare its capabilities in meta. If a given document contains no images/tables/equations, the corresponding sidecar is not written; even if the user has enabled
i/t/e, the corresponding modality is silently skipped, butanalyze_multimodallogs an INFO-level line for that document ([analyze_multimodal] sidecar e:equations empty: doc—id ...), making it easy to diagnose "why didn't the VLM run". This is not an error.
LIGHTRAG_PARSER is strictly validated at startup: unknown content extraction engines, malformed extension syntax, explicitly using an unsupported extension, external engines missing endpoint, and illegal characters in processing options all cause startup to fail.parser_routing._engine_is_usable): (a) the engine's capability table supports that extension; (b) if it is an external engine (mineru / docling), the corresponding endpoint/token environment variable is configured. If either check fails, the rule is skipped and the next rule is matched. For example, in *:mineru;html:docling: MinerU does not support the html extension (condition a fails), so html continues to match docling; if MINERU_API_MODE=local but MINERU_LOCAL_ENDPOINT is not set, all PDFs also skip *:mineru and fall to the next rule (condition b fails). This behavior applies to both LIGHTRAG_PARSER rule matching and filename hint engine selection.LIGHTRAG_PARSER. If the engine specified in a hint does not support that extension, the system falls back to the default rules to continue selecting an available engine.LIGHTRAG_PARSER are used; if neither is provided, all defaults are used.legacy; if legacy also does not support the file extension, an error entry is added to the system and the uploaded file remains in the INPUT directory.LIGHTRAG_PARSER defaults or all defaults; a warning is also logged.P is only effective for structured LightRAG Document results extracted by native; for the legacy path or unstructured output, it automatically degrades to R and logs a warning.process_options selects which chunking strategy (F/R/V/P), while chunk_options decides which parameters that chunker uses. The two responsibilities are orthogonal: the former is a single-character selector, the latter is a structured dictionary.
env vars (read once at startup)
│
▼
addon_params["chunker"] (LightRAG instance field, filled by env with legacy fallback)
│
▼ resolve_chunk_options(addon_params, split_by_character=…, split_by_character_only=…)
│
full_docs[doc_id]["chunk_options"] (frozen at enqueue time, an independent snapshot per file)
│
▼
chunker(tokenizer, content, chunk_token_size, **strategy_kwargs) (dispatched by selector during chunking)
addon_params["chunker"] during the LightRAG.__init__ stage (strategy-specific env is read by default_chunker_config(), then _apply_chunk_size_overlay fills in legacy env as a fallback).addon_params["chunker"] is an ObservableAddonParams field; for Server deployments, you only need env / restart for the new values to take effect. To change it at runtime within the Python process (without restarting) and to do per-file overrides, see Chapter 8: Python SDK Invocation.full_docs.chunk_options is frozen at apipeline_enqueue_documents enqueue time: by default it is assembled by resolve_chunk_options(self.addon_params, ...) on the spot; if the caller passes a chunk_options argument, it is persisted as-is (SDK usage, see §8.4).full_docs.chunk_options and dispatches to F/R/V/P by the process_options.chunking selector.All variables in the table below are read into addon_params["chunker"] once when LightRAG is instantiated: strategy-specific env is read by default_chunker_config(), while legacy env (CHUNK_SIZE / CHUNK_OVERLAP_SIZE) is filled in by _apply_chunk_size_overlay into slots that neither strategy env nor legacy constructor fields filled. After modifying env, the service must be restarted (or a new LightRAG instance created) for it to take effect; documents already enqueued hold the frozen snapshot and are unaffected.
| Variable | Default | Type | Scope |
|---|---|---|---|
CHUNK_SIZE | 1200 | int | Legacy top-level chunk_token_size fallback; lower priority than strategy-specific env and the SDK path setting of addon_params["chunker"]["chunk_token_size"] |
CHUNK_OVERLAP_SIZE | 100 | int | Legacy overlap fallback; filled when a strategy has neither a specific env (CHUNK_F_OVERLAP_SIZE / CHUNK_R_OVERLAP_SIZE / CHUNK_P_OVERLAP_SIZE) nor the SDK path's LightRAG(chunk_overlap_token_size=…) |
CHUNK_F_SIZE | unset | int | F strategy-specific chunk_token_size; higher than the top-level legacy fallback (CHUNK_SIZE and the SDK path's LightRAG(chunk_token_size=…)). When unset, F inherits the top-level resolved value. |
CHUNK_F_OVERLAP_SIZE | unset | int | F strategy-specific overlap; higher than the legacy constructor field and CHUNK_OVERLAP_SIZE |
CHUNK_F_SPLIT_BY_CHARACTER | (unset = null) | str? | F pre-split separator; null / empty string = split by token window only |
CHUNK_F_SPLIT_BY_CHARACTER_ONLY | false | bool | F strict mode: no secondary token split; raise error when oversized |
CHUNK_R_SIZE | unset | int | R strategy-specific chunk_token_size; higher than top-level legacy fallback (CHUNK_SIZE and the SDK path's LightRAG(chunk_token_size=…)). When unset, R inherits the top-level resolved value. |
CHUNK_R_OVERLAP_SIZE | unset | int | R strategy-specific overlap; higher than the legacy constructor field and CHUNK_OVERLAP_SIZE |
CHUNK_R_SEPARATORS | ["\n\n","\n","。","!","?",";",","," ",""] | JSON array string | R separator cascade, ordered from strongest to weakest semantic boundary. The default includes Chinese sentence-ending (。!?) and mid-sentence (;,) punctuation, letting Chinese / mixed Chinese-English documents split at semantic boundaries. English .?! is deliberately excluded (literal matching would mis-split numbers and abbreviations). |
CHUNK_V_SIZE | unset | int | V strategy-specific chunk_token_size (hard cap, automatically re-split through R when exceeded); higher than the top-level legacy fallback. When unset, V inherits the top-level resolved value. |
CHUNK_V_BREAKPOINT_THRESHOLD_TYPE | percentile | str | V threshold type; can be percentile / standard_deviation / interquartile / gradient |
CHUNK_V_BREAKPOINT_THRESHOLD_AMOUNT | (unset = null) | float? | V threshold magnitude; null lets LangChain pick the default by type (e.g., percentile=95) |
CHUNK_V_BUFFER_SIZE | 1 | int | V sentence buffer window; the number of adjacent sentences to merge during distance computation |
CHUNK_V_SENTENCE_SPLIT_REGEX | (?<=[.?!])\s+|(?<=[。?!]) | str | V's sentence splitting regex, fed to LangChain's SemanticChunker. The default recognizes both English .?! (requiring trailing whitespace to avoid mis-splitting 0.95) and Chinese 。?! (no whitespace required, fitting Chinese continuous writing). The env value is the raw regex string; no JSON quoting needed. |
CHUNK_P_SIZE | 2000 (DEFAULT_CHUNK_P_SIZE) | int | P strategy-specific chunk_token_size. Unlike R/V, P does NOT inherit the top-level CHUNK_SIZE / LightRAG(chunk_token_size=…) when unset — paragraph-semantic merging needs more headroom than the global default to keep related paragraphs together, so the slot always carries DEFAULT_CHUNK_P_SIZE (2000) instead. |
CHUNK_P_OVERLAP_SIZE | unset | int | P strategy-specific overlap; higher than the legacy constructor field and CHUNK_OVERLAP_SIZE. Used for text overlap when long body text within the same JSONL content line falls back to R, and as the per-side budget for bridging text copied into the adjacent large-table chunks. |
P's internal ratio constants are algorithmic scales and are automatically derived in proportion to chunk_token_size. P always uses an independent chunk_token_size decoupled from the global chain — even when CHUNK_P_SIZE is unset, P falls back to DEFAULT_CHUNK_P_SIZE (2000) rather than the global CHUNK_SIZE, because paragraph-semantic merging needs more headroom than the global default to keep related paragraphs together. Use CHUNK_P_SIZE to override that default per deployment. CHUNK_P_OVERLAP_SIZE only affects P's internal plain-text fallback and table bridging context; it does not let table row-level slices overlap each other. CHUNK_F_SIZE / CHUNK_R_SIZE / CHUNK_V_SIZE work differently — when unset they DO fall back to the top-level chunk_token_size (F is the default global window, R prefers a smaller target to better split sentences, while V — as an advisory ceiling — typically wants to be enlarged to reduce over-splitting).
The final value of each chunking slot is resolved by a specificity-ordered chain (high → low):
addon_params["chunker"] explicit value — field values explicitly written at construction time or set at runtime via the SDK path (see §8.3). Server-only deployments usually don't hit this tier. Most direct; wins everything.CHUNK_F_SIZE / CHUNK_R_SIZE / CHUNK_V_SIZE (per-strategy chunk_token_size), CHUNK_F_OVERLAP_SIZE / CHUNK_R_OVERLAP_SIZE / CHUNK_P_OVERLAP_SIZE (overlap), CHUNK_P_SIZE (P-specific). When the corresponding size env is unset, F/R/V inherit the top-level chunk_token_size. Filled only when the slot is not already occupied by ①.LightRAG(chunk_token_size=…, chunk_overlap_token_size=…); only effective on the SDK path, see §8.2. Strategy-agnostic, "coarse-grained default", fills only the slots still empty.CHUNK_SIZE / CHUNK_OVERLAP_SIZE. Final fallback.Example: CHUNK_R_OVERLAP_SIZE=42 + LightRAG(chunk_overlap_token_size=2) → R sub-dictionary chunk_overlap_token_size=42 (strategy env wins), F / P sub-dictionary chunk_overlap_token_size=2 (no F / P-specific env; the legacy constructor field is filled in).
Special case for P's chunk_token_size: the P chunk_token_size slot does NOT walk the full four-tier chain. When ① is not explicitly provided, it resolves directly via CHUNK_P_SIZE env > DEFAULT_CHUNK_P_SIZE (2000), skipping ③ legacy constructor field LightRAG(chunk_token_size=…) and ④ legacy env CHUNK_SIZE. See the CHUNK_P_SIZE row in §3.2 for the rationale.
Three layers of semantic guarantee:
process_options) also reads full_docs.chunk_options, preventing env drift from breaking consistency.chunk_options for each file (typical usage: a management UI configures separators or V threshold individually for a certain file). These are the input semantics on the SDK path; see §8.4.addon_params["chunker"] (instance field) keeps the sub-dictionaries of all four strategies as the runtime baseline; full_docs[doc_id]["chunk_options"] is a slim snapshot — at enqueue time, only the strategy sub-dictionary selected by process_options is kept (default F), and the parameters of other strategies are discarded, because the processing stage will not read them. When re-parsing, process_options and chunk_options are rewritten together, avoiding residue of old-strategy parameters.
addon_params["chunker"] full baseline (modifiable at runtime via SDK, affecting subsequent enqueues):
{
"chunk_token_size": 1200, // common token cap
"fixed_token": { // F-specific
"chunk_token_size": 1200, // optional; when omitted, inherits the top-level chunk_token_size (seedable via CHUNK_F_SIZE)
"chunk_overlap_token_size": 100,
"split_by_character": null,
"split_by_character_only": false
},
"recursive_character": { // R-specific
"chunk_token_size": 1200, // optional; when omitted, inherits the top-level chunk_token_size
"chunk_overlap_token_size": 100,
"separators": ["\n\n", "\n", "。", "!", "?", ";", ",", " ", ""] // default cascade includes Chinese punctuation
},
"semantic_vector": { // V-specific
"chunk_token_size": 1200, // optional hard cap; re-split through R when exceeded
"breakpoint_threshold_type": "percentile", // percentile | standard_deviation | interquartile | gradient
"breakpoint_threshold_amount": null, // null = LangChain default
"buffer_size": 1,
"sentence_split_regex": "(?<=[.?!])\\s+|(?<=[。?!])" // default regex handles both English and Chinese sentence-ending punctuation
},
"paragraph_semantic": { // P-specific
"chunk_token_size": 2000, // when omitted, resolves from CHUNK_P_SIZE or DEFAULT_CHUNK_P_SIZE (2000);
// does NOT inherit the common chunk_token_size
"chunk_overlap_token_size": 100 // when omitted, inherits the legacy overlap resolution chain
}
}
full_docs[doc_id]["chunk_options"] slim snapshot (projected by selector; example below is for process_options="R"):
{
"chunk_token_size": 1200, // common token cap (kept as a top-level fallback)
"recursive_character": { // the only retained strategy sub-dictionary
"chunk_overlap_token_size": 100,
"separators": ["\n\n", "\n", "。", "!", "?", ";", ",", " ", ""]
}
}
selector → sub-dictionary mapping: F → fixed_token, R → recursive_character, V → semantic_vector, P → paragraph_semantic; without a selector, F is the default. Each sub-dictionary corresponds one-to-one with the keyword-only parameters of the corresponding chunker function; when adding new parameters, no dispatcher change is needed, just add a kwarg to the chunker function.
Old documents at enqueue time don't yet have the chunk_options field; during chunking, the dispatcher calls resolve_chunk_options(self.addon_params, process_options=…) per the current process_options to fall back to a slim snapshot. After upgrading, it is recommended to run a reprocess once to give old documents a slim chunk_options snapshot (aligned with the current process_options).
full_docs FieldsFile enqueue and extraction results are written into full_docs:
| Field | Description |
|---|---|
file_path | Basename of the filename (without directory), preserves the original name provided by the user (including the square-bracket hint), e.g., abc.[native-iet].docx is written as-is. When no valid source is provided, it is saved as unknown_source. The filename hint is not stripped, so the management UI can directly show the user's original naming intent. |
canonical_basename | The canonicalized basename with the processing hint stripped (e.g., abc.docx). Filename deduplication uses this field as the index key, ensuring abc.docx and abc.[native-iet].docx are treated as the same logical document. |
source_path | The original path provided at enqueue time (written only when it contains a directory separator or is an absolute path), used by the native / mineru / docling parsers to locate the actual file. |
parse_format | Content format: pending_parse, raw, lightrag. |
content | When raw, holds the extracted text; when pending_parse, it is an empty string; when lightrag, holds the complete merged text starting with {{LRdoc}} (concatenated body segments of all type=="content" lines in .blocks.jsonl). During chunking, parse_native strips the prefix and hands it to the chunking_func, going through exactly the same code path as raw. |
content_hash | MD5 of the content, used for cross-filename deduplication. For parse_format=raw, takes the hash of text after sanitize_text_for_encoding; for parse_format=lightrag, takes the hash of the *.blocks.jsonl file; for parse_format=pending_parse, not written, filled in after extraction completes. |
lightrag_document_path | When parse_format=lightrag, saves the path to the structured LightRAG Document; new records prefer to save the path relative to INPUT_DIR, e.g., __parsed__/report.docx.parsed/report.blocks.jsonl. Note that the subdirectories and the blocks filename in the path both use the canonicalized basename (without hint). |
parse_engine | The engine that actually completed extraction: legacy, native, mineru, docling. For files awaiting extraction, can also temporarily store the target engine. |
process_options | The original processing options string recorded at enqueue time (without engine name and the separator -), e.g., "iet", "R!", "". Downstream stages take this field as the authoritative source for deciding whether to enable image / table / equation analysis (i/t/e), whether to disable knowledge graph construction (!), and the chunking method (F/R/V/P). An empty string is equivalent to all defaults. |
chunk_options | The frozen snapshot of chunker parameters at enqueue time (slim dictionary: only the strategy sub-dictionary selected by process_options is retained, others discarded). Passed in by the SDK-path caller or assembled by resolve_chunk_options(self.addon_params, process_options=…) from instance fields (containing env defaults) as a fallback (see §3.1). process_options chooses which chunking strategy (F/R/V/P); chunk_options decides which parameters that chunker uses. The downstream process_single_document reads strategy-specific kwargs from this field before chunking; persistence guarantees that old documents behave reproducibly across env changes, resumes, and restarts. Rewritten together with process_options when re-parsing. |
pending_parse indicates the file has been enqueued but extraction is not yet complete. After successful extraction, it is rewritten to raw or lightrag, and content_hash is filled in. On extraction failure, pending_parse and the empty content are kept, making subsequent troubleshooting and retry easier.
The original
file_path(with hint),canonical_basename, andcontent_hashare also synchronized intodoc_status, serving as the deduplication index sources forget_doc_by_file_basename/get_doc_by_content_hash.get_doc_by_file_basenameinternally canonicalizes the input throughcanonicalize_parser_hinted_basenamebefore comparing againstcanonical_basename, soabc.docxandabc.[native-iet].docxalways hit the same document.process_optionsis also mirrored intodoc_status.metadata["process_options"], making it convenient for the management UI to directly display the current file's processing policy.
__parsed__ Directory Structure__parsed__ is the archival and analysis-result directory next to the input directory. It both stores already-processed original documents and the LightRAG Document (lightrag format) files and image assets produced by structured parsing.
legacy local extraction succeeds and enqueueing finishes, the original file is moved into the sibling __parsed__ directory; native / mineru / docling keep the original file first for the pipeline to parse, and only move it to __parsed__ after successful parsing and writing to full_docs. When archived, the original filename (including [hint]) is preserved, e.g., report.[native-iet].docx is archived as __parsed__/report.[native-iet].docx, making it easy to trace the user's original name and processing options.[hint] removed) plus the .parsed suffix, avoiding name conflicts with the archived original file and ensuring that the same logical document continues to point to the same directory when the filename hint or processing options change. For example, the analysis results of report.docx, report.[native].docx, and report.[native-iet].docx are all written into __parsed__/report.docx.parsed/.__parsed__/report.docx.parsed/report.blocks.jsonl; the same directory may also contain report.tables.json, report.drawings.json, report.equations.json, and the report.blocks.assets/ image asset directory. Whether a sidecar is generated is determined by the document content: the parser only writes the corresponding file when the document actually contains tables / images / equations. This is the only signal of modality availability — the engine does not need to declare capabilities in meta. The i/t/e options only determine whether the next stage invokes the VLM for summarization analysis on already-existing sidecars./documents/scan encounters a file with the same name that is already PROCESSED, the input file is treated as already processed and moved to __parsed__, not enqueued as a new document./documents/scan finds multiple files that share the same canonicalized name in the same scan, it prefers the file with a supported engine hint to respect the user's engine selection; if no variant has a hint, it processes the first file in sorted order. Other variants emit warnings and are moved to __parsed__, avoiding files in the same batch overwriting each other. For example, if both abc.docx and abc.[native].docx exist, only abc.[native].docx is processed.__parsed__; this doc_status entry is kept as FAILED duplicate for tracking._001, _002, etc., e.g., report.pdf is archived as report_001.pdf, report_002.pdf. If the analysis result directory name is already taken by a regular file, a number is also appended, e.g., report.docx.parsed_001/.<base>.mineru_raw/The mineru engine writes the complete artifacts returned by the MinerU service (content_list.json + optional full.md / middle.json / layout.pdf / images/, etc.) into the __parsed__/<canonical filename>.mineru_raw/ directory during parsing, and writes _manifest.json as the integrity validation file.
Design goals:
_manifest.json; on hit, the MinerU service call is skipped and the local content_list.json is fed directly through adapter → SidecarWriter.*.mineru_raw/ to compare the original content_list and image assets.drawings.json / tables.json / equations.json generated by MinerU save content_list.json#/N in self_ref, used for looking up the corresponding MinerU original object and its page_idx / bbox, etc.[mineru-...] / [-iet], the MinerU API is called with the canonicalized filename (hint removed), to avoid hint-bearing filenames inside the raw bundle returned by MinerU.Lifecycle:
| Operation | Behavior |
|---|---|
| First parse | Download all artifacts → atomically write _manifest.json. |
| Re-parse (cache hit) | Do not call the MinerU service; do not rewrite artifacts; rerun adapter+Writer to regenerate sidecar (for adapter upgrade scenarios). |
| Re-parse (cache miss) | Clear all files in the directory, then re-download and write manifest. |
DELETE /documents with delete_file=True | *.parsed/, *.mineru_raw/, and the original file are all deleted together. |
DELETE /documents with delete_file=False | All artifacts are preserved; only doc_status and KG data are deleted. |
clear_documents / a full sweep of __parsed__ | Naturally cleared together. |
| scan cycle | Does not actively GC orphan *.mineru_raw/ (only cleared on explicit deletion by the user, to avoid accidentally removing the debug site). |
Force re-parse (bypass cache): set LIGHTRAG_FORCE_REPARSE_MINERU=true.
Concurrency safety: LightRAG mandates canonical_basename uniqueness within the same workspace (HTTP 409 on upload/enqueue), and combined with the pipeline's serialization per document, *.mineru_raw/ has no concurrent write conflicts and needs no extra locks.
_manifest.json invalidation conditions (any triggers a cache miss):
MINERU_ENGINE_VERSION environment variable and the engine_version recorded in manifest are both non-empty but inconsistent;MINERU_API_MODE and the api_mode recorded in manifest are both non-empty but inconsistent;MINERU_OFFICIAL_ENDPOINT / MINERU_LOCAL_ENDPOINT) and the endpoint_signature recorded in manifest are both non-empty but inconsistent;content_list.json size or sha256 does not match manifest;middle.json, etc.) does not match manifest.About the "either side empty → skip" semantics of
engine_version/endpoint_signature: when the field was empty at manifest-write time (e.g.,MINERU_ENGINE_VERSIONwas not configured at first parse), or when the current environment variable is not set, the check is skipped for that item. If the version env was not set at first parse, setting it later does not automatically invalidate the historical cache — this scenario requires manually settingLIGHTRAG_FORCE_REPARSE_MINERU=trueto trigger re-parsing.
<base>.docling_raw/The docling engine extracts the zip artifact returned by docling-serve (DoclingDocument JSON, Markdown, and referenced images) into the __parsed__/<canonical filename>.docling_raw/ directory during parsing, and writes _manifest.json as the integrity validation file. On a subsequent parse, the IR builder reads the .json file in that directory and feeds it to DoclingIRBuilder, no longer calling docling-serve.
Directory layout:
__parsed__/<base>.docling_raw/
├── _manifest.json
├── <base>.json # DoclingDocument JSON (contains pages[].image base64)
├── <base>.md # Markdown form, for human inspection
└── artifacts/
└── image_*.png # image assets referenced by pictures[*].image.uri
Design goals:
_manifest.json; on hit, the upload / poll / download against docling-serve is skipped, and the local .json is fed directly through DoclingIRBuilder → SidecarWriter.*.docling_raw/ to compare the original DoclingDocument JSON, Markdown, and artifacts/ images.Lifecycle:
| Operation | Behavior |
|---|---|
| First parse | POST /v1/convert/file/async upload → long-poll /v1/status/poll/{task_id}?wait=N → GET /v1/result/{task_id} download zip → safe extraction (rejecting absolute paths and ..) → atomically write _manifest.json. |
| Re-parse (cache hit) | Do not call docling-serve; do not rewrite artifacts; rerun adapter+Writer to regenerate sidecar (for adapter upgrade scenarios). |
| Re-parse (cache miss) | Clear all files in the directory, then re-upload / download / write manifest. |
DELETE /documents with delete_file=True | *.parsed/, *.docling_raw/, and the original file are all deleted together. |
DELETE /documents with delete_file=False | All artifacts are preserved; only doc_status and KG data are deleted. |
clear_documents / a full sweep of __parsed__ | Naturally cleared together. |
| scan cycle | Does not actively GC orphan *.docling_raw/ (only cleared on explicit deletion by the user, to avoid accidentally removing the debug site). |
Force re-parse (bypass cache): set LIGHTRAG_FORCE_REPARSE_DOCLING=true.
Concurrency safety: identical to the MinerU path — LightRAG mandates canonical_basename uniqueness within the same workspace (HTTP 409 on upload / enqueue), and combined with the pipeline's serialization per document, *.docling_raw/ has no concurrent write conflicts and needs no extra locks.
_manifest.json invalidation conditions (any triggers a cache miss):
DOCLING_ENDPOINT does not match the endpoint_signature recorded in manifest;DOCLING_ENGINE_VERSION is set and does not match the engine_version recorded in manifest;options_signature does not match — any OCR / equation / pipeline field change triggers it, covering:
DOCLING_DO_OCR / DOCLING_FORCE_OCR / DOCLING_OCR_ENGINE / DOCLING_OCR_PRESET / DOCLING_OCR_LANG / DOCLING_DO_FORMULA_ENRICHMENT;pipeline / target_type / to_formats / image_export_mode (written into the signature to prevent old bundles from being mistakenly reused if these values change in the future);artifacts/ missing or size mismatch;LIGHTRAG_FORCE_REPARSE_DOCLING=true.The "either side empty → skip" semantics of
engine_version/endpoint_signatureis the same as MinerU §4.3: when the field was empty at manifest-write time (first parse withoutDOCLING_ENGINE_VERSIONconfigured) or when the current environment variable is not set, the check is skipped for that item; adding the version number later does not automatically invalidate the historical cache;LIGHTRAG_FORCE_REPARSE_DOCLING=trueis needed to trigger.
File upload, file-parse enqueue, and the text APIs check duplicates against two gates: "filename + content hash". Hitting either is considered a duplicate, and a FAILED record is written without overwriting the existing full_docs. /documents/scan directory scanning uses the same set of indexes, but in order to facilitate automatic retry of unfinished files, it has separate archive and re-process rules for duplicate filenames.
/data/a.pdf, inputs/a.pdf, and a.pdf are all considered the same filename a.pdf.canonical_basename as the index: the supported-engine processing hint at the end of the filename is stripped before comparison, so abc.docx, abc.[native].docx, and abc.[native-iet].docx are considered the same name. Unsupported hints are not stripped; e.g., abc.[draft].docx is still treated by its original filename.doc_status — whether that record is currently PENDING, PARSING, ANALYZING, PROCESSING, FAILED, or PROCESSED — the same-name file is considered a duplicate./documents/scan directory scan:
__parsed__ and skipped.PROCESSED, the file just scanned is treated as already processed; the system emits a warning, moves the input file to the sibling __parsed__ directory, and skips enqueueing.PROCESSED, the scanned file is not skipped simply because of the same name, but also does not re-extract / overwrite the existing record. The specific path depends on the form of the existing record (consistent with the classification rules listed below in the "Why is scan still the exclusive writer" section):
full_docs present → resume path: doc_status is preserved as-is, the source file remains in INPUT/, and the processing loop picks it up by status query (no re-extract, no overwrite of existing status).FAILED with full_docs missing → recognized as an extraction-error stub written by apipeline_enqueue_error_documents: scan deletes the stub and enqueues the current file as a new file. This is the only sub-branch that re-extracts; the purpose is to make "fix the source file, scan again" automatically take effect.file_source, and duplicates are checked by the basename of file_source; lacking a valid file_source returns 400 directly.insert / ainsert / apipeline_enqueue_documents without file_paths, that is allowed; related behavior is detailed in §8.4. Such documents without a source have file_path saved as unknown_source.no-file-path, and unknown_source are all considered unknown sources; they do not block new source-less text from being enqueued, nor do they deduplicate each other as same-named files.The storage backend provides basename direct lookup via get_doc_by_file_basename, internally comparing against the canonical_basename field (the input parameter is first canonicalized through canonicalize_parser_hinted_basename). JsonDocStatusStorage already implements an in-memory traversal; other backends currently fall back to the default implementation (scanning all states and comparing canonical_basename), to be augmented with native indexes in subsequent PRs.
full_docs and doc_status write or fill in the content_hash field according to the content format:
parse_format=raw: the MD5 of the text after sanitize_text_for_encoding.parse_format=lightrag: the MD5 of the *.blocks.jsonl file parsed out of lightrag_document_path. Relative paths are resolved against INPUT_DIR.parse_format=pending_parse: no hash is written yet; it is filled in by subsequent steps after parsing actually completes (to avoid mistakenly judging by empty content).legacy path deduplicates content hashes after locally extracting text and during enqueue; on hit, this record is written as FAILED duplicate, and no new full_docs, chunks, or graph data are generated.native / mineru / docling paths first enqueue with pending_parse; after parsing completes and content_hash is filled in, if another document already has the same hash, this record is stopped before entering analysis, chunking, entity extraction, and graph writing.filename or content_hash in metadata.duplicate_kind for diagnosis. Content-hash duplicates also record metadata.is_duplicate=true, metadata.original_doc_id, and metadata.original_track_id; duplicates discovered only after parsing also have the temporarily-written full_docs deleted.PROCESSED, a log and pipeline status are written; duplicates at the enqueue stage use the LightRAG layer's Duplicate document detected (...) log; content duplicates only discovered after parsing use Duplicate content skipped after parsing and write a pipeline status. Scan archiving does not emit the extra [File Extraction]Duplicate skipped.get_doc_by_content_hash; the naming convention is the same as get_doc_by_file_basename.Within an enqueue batch (the same
apipeline_enqueue_documentscall), basename and content_hash dedup are also performed; on hit, subsequent entries are written asFAILEDdirectly and marked withexisting_status=batch_duplicate. Basename dedup only applies to valid filenames;unknown_source,no-file-path, and empty sources only participate in content-hash dedup.Cross-call concurrent dedup is also guaranteed by the workspace-level serialization lock (see §6.7 enqueue serialization lock (preventing concurrent dedup leakage)): two concurrent enqueues of identical content with different filenames will not both leak past the
content_hashcheck.
To prevent scan / upload / insert from overwriting doc_status / full_docs records of an in-flight pipeline, all write entry points coordinate via the pipeline_status shared dictionary. The pipeline_status_lock per workspace ensures that all transitions in the table below are completed atomically within the lock.
pipeline_status Fields| Field | Semantics |
|---|---|
busy | Generic pipeline-busy flag. Both the processing loop and destructive jobs (clear/delete) set it. busy=True (processing loop) alone does not block enqueue — the loop pulls a doc_status snapshot per batch and checks request_pending between batches for any newly arrived work. |
destructive_busy | A destructive subset of busy: /documents/clear or /documents/{doc_id} (delete) is dropping storages / removing source files. Both reservation and the enqueue last-line guard reject — a concurrent enqueue would write to storage being torn down, and accepted documents would be silently lost. The processing loop does not set this field. |
scanning | The /documents/scan background task is running (entire lifecycle: classification stage + processing stage). Only the /scan endpoint uses it to reject overlapping scans; it does not itself block upload/insert. |
scanning_exclusive | An exclusive subset of scanning: True only during scan's classification phase — run_scanning_process is reading doc_status to classify (already processed / resume / delete stub / archive) and cannot interleave with concurrent writers. Both reservation and the enqueue last-line guard reject. After classification, the flag is cleared immediately, and concurrent uploads are allowed once scan enters the processing phase. |
pending_enqueues | The number of upload/insert calls that have passed _reserve_enqueue_slot but whose bg task has not completed. Used only by the scan endpoint — to decide whether to take the exclusive lock. The bg task releases the slot in finally. |
request_pending | A signal nudging the running processing loop to scan another round. Enqueue sets it after writing to doc_status when busy=True; the processing loop checks it after each batch and re-pulls the snapshot. |
| Entry point | Condition | Behavior |
|---|---|---|
/documents/upload / /documents/text / /documents/texts | scanning_exclusive=True or destructive_busy=True | Throw HTTP 409; do not write file, do not call enqueue |
| Same as above | Otherwise (including pure busy=True, scan-processing-phase scanning=True but scanning_exclusive=False) | Within the lock: pending_enqueues++ reserves a slot → strict name precheck → save file → schedule bg task; the bg task releases the slot in finally |
/documents/scan | busy=True or scanning=True or pending_enqueues>0 | Emit a warning and immediately return scanning_skipped_pipeline_busy; do not schedule a background task |
| Same as above | All idle | Within the lock, set scanning=True then schedule; the task clears the flag in finally upon completion |
/documents/clear / /documents/delete_document | busy=True or scanning=True or pending_enqueues>0 | The endpoint synchronously returns status="busy" and does not schedule a background task |
| Same as above | All idle | The endpoint synchronously within the lock sets busy=True + destructive_busy=True (before delete_document returns deletion_started), and the bg task's finally clears both flags |
apipeline_enqueue_documents internal (last-line guard) | scanning_exclusive=True and from_scan=False, or destructive_busy=True | Throw RuntimeError("Cannot enqueue while scan is classifying / clearing or deleting") |
| Same as above | Anything else (including pure busy=True, scan processing phase) | Enqueue normally; after writing doc_status, if busy=True, automatically nudge request_pending=True |
from_scan=True is a bypass for scan's own background-task enqueue: scan already holds the scanning flag, so it must be allowed to enqueue the files it has scanned.
busy no longer blocks enqueueIn the old version, busy=True always rejected any new enqueue, on the reasoning that "modifying doc_status would interleave with the pipeline worker thread." However, in practice:
apipeline_enqueue_documents always upserts full_docs first, then upserts doc_status. The consistency check at the start of the processing loop only deletes "orphan doc_status rows that have no corresponding full_docs" — a state that cannot occur with concurrent enqueue.get_docs_by_statuses snapshot once; newly written PENDING rows don't disturb the current batch, and the next round re-pulls the snapshot via request_pending to see the new work.request_pending is designed for this: the old version already had the request_pending field — it was designed for "new work arrives while running" — but was gated by busy.With this mechanism enabled in the new contract, users can continue to upload new documents during long batch processing, and the bg task, after writing doc_status, will be automatically picked up by the running loop.
scan not only enqueues the new files it finds, but also reads doc_status to decide what to do with each file:
PROCESSED row → archive source file, skip enqueue.full_docs present → resume path; the source file stays in INPUT/, not archived (the pending-parse parser may still need it); the processing loop picks it up by status query.FAILED with full_docs missing → recognized as an extraction-error stub previously written by apipeline_enqueue_error_documents (consistency check preserves such rows for human review); scan automatically deletes that stub and enqueues the current file as a new file, so that "fix the source file, scan again" takes effect directly.These "read–decide–write" combinations cannot interleave with other writers; otherwise classification decisions would be based on a stale view. So scan must be exclusive, and the scan endpoint will reject when any of busy / scanning / pending_enqueues>0 is present.
After upload passes the reservation but before saving the file, a two-pass check is required:
canonicalize_parser_hinted_basename, traverse the INPUT directory for any existing same-canonical variant (with hint / without hint); 409 on hit.get_existing_doc_by_file_basename with the canonicalized basename; 409 on hit.Both pass → save the file → schedule the bg task → bg task calls apipeline_enqueue_documents to write the store + calls apipeline_process_enqueue_documents to trigger processing.
The old version once allowed upload to silently write a FAILED duplicate entry when a same-name record existed; the new rule is fail-fast, leaving no duplicate traces in doc_status. To replace a same-name document, call the
/documents/{doc_id}delete API first.
When two uploads arrive simultaneously (scan cannot acquire exclusivity at this time):
_reserve_enqueue_slot → pending_enqueues=1, write file, schedule bg task A, return success._reserve_enqueue_slot → pending_enqueues=2, write file, schedule bg task B, return success.apipeline_enqueue_documents → writes doc_status → calls apipeline_process_enqueue_documents → sets busy=True to process A's document.apipeline_enqueue_documents → sees scanning=False, writes normally; after writing, sees busy=True, automatically sets request_pending=True.apipeline_process_enqueue_documents → sees busy=True, sets request_pending=True and returns immediately.request_pending=True, re-pulls the snapshot, and picks up B's PENDING row.busy=False, pending_enqueues=0.No bg task will be falsely rejected due to busy — because enqueue no longer checks busy; the processing loop will not process the same batch repeatedly — because request_pending only takes effect between batches and is cleared before each re-pull.
Inside apipeline_enqueue_documents, "read doc_status to dedupe → write full_docs / doc_status" runs serially under the workspace-level enqueue_serialize lock. Reason: now that concurrent enqueue is allowed during the busy/scan-processing phases, two enqueues with identical content but different filenames (typical scenario: a scan-processing-phase enqueue and an upload arriving together) would, without the lock, race as follows —
doc_status to check content_hash: miss.doc_status to check content_hash: still miss (A hasn't upserted yet).full_docs + doc_status.full_docs + doc_status.Result: both PENDING rows with the same content_hash enter the downstream pipeline, and the row that should have been identified as duplicate_kind=content_hash was not identified.
With the serialization lock, the second enqueue's dedup read is guaranteed to see the row already upserted by the first, taking the normal "no new unique document" early-return path and writing this run as a duplicate_kind=content_hash FAILED row. The lock only covers:
filter_keys (exclude existing by doc_id)full_docs.upsert + doc_status.upsertThe lock does not cover the request_pending nudge (outside the lock; only briefly takes pipeline_status_lock), and does not block the get_docs_by_statuses read of the processing loop (which goes through doc_status's own concurrent reads — a KV-level atomic with the enqueue writes, not contending for the same lock). Lock order: enqueue_serialize → pipeline_status_lock; no deadlock path.
The locks around pipeline_status solve the correctness problem of "who can write"; this section's set of parameters solves the throughput problem of "how many workers run concurrently". The pipeline is divided into 3 stages, each with an independently tunable worker pool:
┌─ q_native ──► [native parser × N1] ─┐
PENDING ─►├─ q_mineru ──► [mineru parser × N2] ─┼─► q_analyze ─►[analyzer × N4] ─► q_process ─►[processor × N5]
└─ q_docling ──► [docling parser × N3] ─┘
At enqueue time, resolve_stored_document_parser_engine puts each document into the corresponding parse queue based on its parser_engine (from LIGHTRAG_PARSER defaults or the filename hint); the three parse queues are completely non-blocking with respect to each other — mineru saturation does not slow down docling or native. After parsing, they enter q_analyze (multimodal analysis) uniformly, and then enter q_process (entity/relation extraction + ingest).
| Environment variable | Default | Effect | Tuning advice |
|---|---|---|---|
MAX_PARALLEL_PARSE_NATIVE | 5 | N1: number of concurrent workers for native parsing (docx / pdf / txt and other pure local processing) | Pure CPU, low memory usage; can be raised to CPU core count |
MAX_PARALLEL_PARSE_MINERU | 2 | N2: number of concurrent workers for MinerU parsing | MinerU has significant GPU/CPU usage; the default of 2 is a modest amount of parallelism. Lower to 1 when resources are tight; with local deployment and ample VRAM, you can set 2–3; when going through MinerU's official cloud service, you can raise it appropriately (subject to cloud quotas). |
MAX_PARALLEL_PARSE_DOCLING | 2 | N3: number of concurrent workers for Docling parsing | Docling is similarly resource-sensitive; the default of 2 is a modest amount of parallelism. Lower to 1 when resources are tight; with local deployment and ample CPU/GPU, you can set 2–3. |
MAX_PARALLEL_ANALYZE | 5 | N4: number of concurrent workers for multimodal analysis (VLM image / table description) | Directly consumes the VLM quota. Recommended ≤ VLM service concurrency cap. |
MAX_PARALLEL_INSERT | 3 | N5: number of concurrent documents at the entity / relation extraction + ingest stage | Recommended MAX_ASYNC / 3, in the range 2–10. This stage triggers multiple LLM calls per document; setting it too high will hit LLM rate limits. This value also serves as the asyncio.Semaphore for an additional constraint (worker count and semaphore value are the same). |
QUEUE_SIZE_PARSE | 20 | Bounded capacity of the parse-input queues (native/MinerU/Docling) | Generally no need to tune. Items here are lightweight doc_ids (the large parsed body is stripped before the analyze stage); this only bounds how many pending docs the pipeline pre-dispatches to parse workers, so tuning has little effect. |
QUEUE_SIZE_ANALYZE | 100 | Bounded capacity of the analyze queue (parse → analyze stage) | Generally no need to tune. For very large batches (thousands or more), can be raised to avoid backpressure at the enqueue side; lower it when memory is tight. |
QUEUE_SIZE_INSERT | 4 | Queue capacity between the analyze → process stage | The process stage is the slowest and most memory-hungry in the pipeline; the queue is deliberately small to provide backpressure to upstream and prevent memory bloat. |
Several key points:
MAX_PARALLEL_INSERT doubles as worker pool size and semaphore cap: the pipeline creates a Semaphore(max_parallel_insert), and each process worker also takes the semaphore before extraction and ingest. So even if you manually raise the worker count, the actual concurrency cap is still bounded by this value — just tune it directly.QUEUE_SIZE_INSERT=4 is intentional — the process stage is slow and memory-hungry; when the queue fills, analyze blocks, and backpressure reaches the parse stage, preventing thousands of parsing results from piling up in memory at once..env (or environment variables), read once at LightRAG construction; restart the service after changing them.Typical tuning scenarios:
MAX_PARALLEL_PARSE_MINERU=2, MAX_PARALLEL_ANALYZE=5, MAX_PARALLEL_INSERT=3 (defaults are fine; lower MINERU to 1 if VRAM is tight).MAX_PARALLEL_PARSE_MINERU=3~5 (depending on cloud quota), others at defaults.MAX_PARALLEL_PARSE_NATIVE=10; MAX_PARALLEL_INSERT derived from MAX_ASYNC/3.MAX_PARALLEL_INSERT (the process stage makes multiple LLM calls per document), then lower MAX_PARALLEL_ANALYZE (VLM is a separate quota).Each time apipeline_process_enqueue_documents starts up, it pulls all documents in PARSING / ANALYZING / PROCESSING / PENDING / FAILED to continue processing. The resume path branches by "whether content has been extracted", ensuring that any document, regardless of its previous progress, has an idempotent result when resumed under the current process_options.
The resume rule only applies to documents whose doc_id already exists in doc_status. New files joining the queue require the file dedup logic in "Concurrency and Reentry Constraints", to avoid new files squeezing out the records of files whose content has already been successfully extracted.
Read full_docs[doc_id]:
parse_format | Verdict |
|---|---|
lightrag and lightrag_document_path file exists | ✅ extracted |
raw and content is non-empty | ✅ extracted |
Other (including pending_parse, missing record) | ❌ not extracted |
Go through the full pipeline (parse_native / parse_mineru / parse_docling → analyze_multimodal → chunking → entity extraction), with each stage's behavior determined by full_docs.process_options. This is the normal flow of a "first-time enqueue".
Always skip parsing (do not call parse_* again), restart from the ANALYZING stage, clear old chunks / entities, and redo per the current process_options:
| Sub-step | Behavior |
|---|---|
| Engine comparison | If the engine implied by process_options ≠ full_docs.parse_engine, only warn, do not re-parse. The extracted content is an immutable fact; re-running a different engine would produce inconsistency. To switch engines, delete the whole document and re-upload it. |
| Old chunks / entities / relations cleanup | Read status_doc.chunks_list to collect old chunk id set, call _purge_doc_chunks_and_kg(doc_id, chunk_ids): delete chunk rows from chunks_vdb / text_chunks; reverse-lookup affected entities / relations by entity_chunks / relation_chunks, directly remove entries that have lost all sources from the graph and vector store, and call rebuild_knowledge_from_chunks to rebuild with the remaining chunks for entries still contributed by other documents; finally delete the index rows of this doc in full_entities / full_relations. After purge completes, status_doc.chunks_list = [] / chunks_count = 0 are reset to avoid the subsequent state-machine upsert writing back old IDs. |
analyze_multimodal | For enabled modalities, every run recomputes the sidecar item analysis and overwrites the existing llm_analyze_result. The LLM analysis cache still applies: a cache hit reuses the previous provider response, so semantic fields usually stay the same and only runtime fields such as analyze_time are rewritten. Cache misses, for example after changing the model or prompt, can produce different saved content. |
| Re-chunk | Pick the strategy by the new process_options.chunking, with parameters read from full_docs.chunk_options (the enqueue snapshot; not overwritten by resume; env changes do not affect old documents that still chunk by the parameters from the moment of enqueue). The LightRAG Document path uses paragraph_semantic when process_options=P, otherwise dispatches to F/R/V by selector. |
| Entity extraction / KG-skip | Determined by the new process_options.skip_kg |
This rule guarantees: when users change
i/t/eand re-upload the same-named document (delete the old doc first, then upload the file with the new hint), multimodal analysis is incrementally filled in; when changingF/R/V/P, chunks and graph are rebuilt; when changing!, KG construction is stopped or restored. Engine changes are considered a "major change", uniformly handled by delete + re-upload, not implicitly happening on the resume path.
This chapter targets developers who directly import the LightRAG class for integration, covering runtime APIs, constructor parameters, and removed legacy interfaces that Server deployments don't use. Server users usually don't need to read this chapter.
from lightrag import LightRAG
rag = LightRAG(working_dir="./rag_storage", ...)
await rag.initialize_storages()
await rag.ainsert("text", file_paths="doc.pdf")
The following behaviors of this invocation style differ from the Server path: you can change addon_params["chunker"] without restarting the process, you can pass per-file chunk_options into apipeline_enqueue_documents, and you can dynamically override the F strategy's pre-split parameters in an ainsert call.
LightRAG(chunk_token_size=…, chunk_overlap_token_size=…) is tier 3 in §3.3's priority chain: "legacy constructor field". Strategy-agnostic and coarse-grained default, fills only slots still empty:
addon_params["chunker"] explicit values (§8.3) and strategy-specific env (§3.2).CHUNK_SIZE / CHUNK_OVERLAP_SIZE.self.chunk_token_size / self.chunk_overlap_token_size are always back-filled to int after __post_init__, so legacy paths still reading these two fields (e.g., the chunk_opts.get("chunk_token_size") or self.chunk_token_size fallback in pipeline.py) continue to work.addon_params["chunker"] at Runtimeaddon_params["chunker"] is an ObservableAddonParams field; it can be modified at runtime:
rag.addon_params["chunker"]["recursive_character"]["separators"] = ["##", "\n", " "]
After modification, subsequent enqueues get the new defaults; already-enqueued documents keep the snapshot from their enqueue moment (see the three layers of semantic guarantee in §3.3). This is tier 1 of §3.3's priority chain: "addon_params["chunker"] explicit value", winning everything.
Server deployments do not have this capability — after changing env, the service must be restarted for it to take effect.
apipeline_enqueue_documents(chunk_options=…)apipeline_enqueue_documents accepts an optional chunk_options argument. When the caller passes a dict / list[dict], it is projected by the current document's process_options into a slim snapshot (keeping only the corresponding strategy sub-dictionary + top-level chunk_token_size) before being persisted to full_docs[doc_id]["chunk_options"]; when not passed, resolve_chunk_options(self.addon_params, process_options=…) assembles one on the spot. Callers can safely pass the full dictionary — the other strategies' sub-dictionaries will be discarded by the dispatcher and won't pollute the store.
Typical usage:
await rag.apipeline_enqueue_documents(
input=["text A", "text B"],
file_paths=["a.[native-R].txt", "b.txt"],
process_options=["R", ""],
chunk_options=[
{"chunk_token_size": 800, "recursive_character": {"separators": ["\n\n", "\n"]}},
{"chunk_token_size": 1500},
],
)
Typical scenarios for per-file personalization: a management UI configures separators or V threshold individually for a certain file; in the future, upload APIs may also accept overrides in form / hint.
Compatibility for not passing file_paths: the core APIs insert / ainsert / apipeline_enqueue_documents still support invocations without file_paths; the file_path of such documents is saved as unknown_source, does not participate in filename dedup, and the document ID continues to be generated from text content.
For apipeline_enqueue_documents's own concurrency constraints (last-line guard, from_scan=True bypass), see the entry-point behavior table in §6.2.
ainsert(split_by_character=…, split_by_character_only=…)LightRAG.ainsert(split_by_character=…, split_by_character_only=…) runtime parameters are overridden into chunk_options.fixed_token by resolve_chunk_options at enqueue time:
None split_by_character overrides the env default;split_by_character_only=True overrides (False is the signature default, indistinguishable from "not specified", so the env default wins).Only effective for the F strategy; other strategies' sub-dictionaries are unaffected.
reprocess_existing_non_processedThe legacy apipeline_enqueue_documents behavior of reprocess_existing_non_processed=True would directly delete non-PROCESSED old records and rebuild them during scan, which conflicts with the rules in §5 / §6; it has been entirely removed. Replacement paths:
/documents/{doc_id} to delete the old document, then upload the same-named new file.