docs/ParserDebugCLI.md
This tool is used to locally debug LightRAG's three content parsing engines (native / mineru / docling). It triggers the LightRAG.parse_<engine> production code path for a single file and outputs the parsing artifacts (sidecar and raw cache) into a flat directory layout. Compared with the production ingestion directory, the only differences are:
__parsed__/ intermediate layer: artifacts land directly under the specified parent directory for easy inspection;<INPUT_DIR>/__parsed__/);mineru / docling raw directory is considered valid, skipping _manifest.json validation.The rest of the flow (IR construction, sidecar writing, full_docs synchronization logic) is identical to production ingestion, making it convenient for troubleshooting parsing-stage issues.
python -m lightrag.parser.cli <input_file> \
--engine {native|mineru|docling} \
[-o <sidecar_parent_dir>] \
[--doc-id <doc-id>] \
[--force-reparse] \
[--preview N]
| Argument | Description |
|---|---|
input_file | Path to the source file to parse (positional argument, required). The file must actually exist. |
--engine | Required: native (only .docx, local parsing) / mineru (PDF/Office documents, calls MinerU service) / docling (PDF/Office documents, calls docling-serve). |
-o / --sidecar-parent-dir | Parent directory of the sidecar and raw directories. Defaults to the directory containing the source file. |
--doc-id | Custom document ID. Defaults to doc-<md5(absolute path of source file)> (stable across multiple runs on the same file). |
--force-reparse | Effective only for mineru / docling: clears the raw directory and forces re-download and re-parse. By default, a non-empty raw directory is reused. |
--preview N | After parsing completes, prints a preview of the first N blocks (headings + content snippets). Default 5; 0 disables it. |
Taking input ./inputs/workspace/sample.pdf + the default sidecar parent directory (i.e., ./inputs/workspace/) as an example:
./inputs/workspace/
├── sample.pdf # original file, untouched
├── sample.pdf.parsed/ # ← sidecar output
│ ├── sample.blocks.jsonl # JSONL: first line is meta, each subsequent line is a block
│ ├── sample.blocks.assets/ # image/media assets extracted by native (if any)
│ ├── sample.tables.json # table sidecar (if IR contains tables)
│ ├── sample.drawings.json # drawing/image sidecar (if IR contains drawings)
│ └── sample.equations.json # equation sidecar (if IR contains equations)
└── sample.pdf.<engine>_raw/ # ← raw cache for mineru / docling (native has no such directory)
├── _manifest.json # written by the engine download flow; not read by CLI cache validation
└── <bundle files> # engine-specific raw artifacts (content_list.json / *.json / assets, etc.)
The native engine does not produce a raw directory (parsing is local, with no external service involved).
.docx (zero network dependency)python -m lightrag.parser.cli ./inputs/workspace/sample.docx --engine native
# Output: ./inputs/workspace/sample.docx.parsed/ (contains blocks.jsonl + assets)
# First run: download raw bundle + generate sidecar
python -m lightrag.parser.cli ./inputs/workspace/sample.pdf --engine mineru
# Second run (no changes): raw directory non-empty → reused directly → only regenerate sidecar, fast
python -m lightrag.parser.cli ./inputs/workspace/sample.pdf --engine mineru
# The log will show: [parse_mineru] raw cache hit doc_id=... raw_dir=.../sample.pdf.mineru_raw
# Existing ./inputs/workspace/sample.pdf.docling_raw/ (contains docling's JSON output, etc.)
python -m lightrag.parser.cli ./inputs/workspace/sample.pdf --engine docling
# The CLI does not check the manifest; as long as the raw directory is non-empty, the docling-serve call is skipped
Note: this is the equivalent replacement for the "rebuild sidecar from an existing raw directory" scenario that used to live in the legacy
python -m lightrag.parser.external.doclingdebug entry point — just place the raw directory at the agreed location (<sidecar_parent>/<source>.docling_raw/) to trigger the cache-hit branch.
python -m lightrag.parser.cli ./inputs/workspace/sample.docx \
--engine native -o /tmp/debug_sidecar
# Output: /tmp/debug_sidecar/sample.docx.parsed/
# The source file ./inputs/workspace/sample.docx is not moved
python -m lightrag.parser.cli ./inputs/workspace/sample.pdf \
--engine docling --force-reparse
# raw directory is cleared → docling-serve is called again to download → sidecar regenerated
The mineru / docling engines call external services when the cache misses (first parse or --force-reparse); the required environment variables are identical to production ingestion:
MINERU_API_MODE (local / official), MINERU_API_TOKEN, MINERU_LOCAL_ENDPOINT or MINERU_OFFICIAL_ENDPOINT, optional MINERU_ENGINE_VERSION / MINERU_MODEL_VERSION / MINERU_POLL_INTERVAL_SECONDS / MINERU_MAX_POLLS.DOCLING_ENDPOINT, optional DOCLING_ENGINE_VERSION / DOCLING_DO_OCR / DOCLING_FORCE_OCR / DOCLING_OCR_ENGINE / DOCLING_OCR_PRESET / DOCLING_OCR_LANG / DOCLING_DO_FORMULA_ENRICHMENT / DOCLING_POLL_INTERVAL_SECONDS / DOCLING_MAX_POLLS.See FileProcessingConfiguration.md for details.
When the cache is hit (the raw directory already exists and is non-empty, and --force-reparse is not passed), no external service environment variables are needed — this can be used to offline-reproduce parsing output.
| Symptom | Action |
|---|---|
error: input file does not exist: ... | Check the input_file path; it must be an existing file (not a raw directory). |
| Raw directory exists but sidecar content is still stale | The default behavior is to reuse raw and regenerate sidecar. If the raw itself is outdated or has been replaced, add --force-reparse to clear and re-download. |
MinerU reports MINERU_API_TOKEN missing / Docling fails to connect to DOCLING_ENDPOINT | A cache miss triggered an external service call — verify the corresponding environment variables; or confirm whether the raw directory is non-empty (no service needed when the cache hits). |
| Source file is unexpectedly moved | Should not happen: the CLI has mocked the archive function. If reproducible, please file an issue (a new archive call site may have been added in the pipeline). |
parse_docling reports produced zero blocks | The main JSON content in docling raw is unparseable or empty. Check whether the *.json files in the raw directory are valid. |
LightRAG.parse_* Production PathThis CLI directly calls the production code paths LightRAG.parse_native / parse_mineru / parse_docling (via the lightweight RAG stand-in in lightrag/parser/debug.py), so:
write_sidecar calls, and _persist_parsed_full_docs behavior are identical;monkey-patch inside the CLI — no production code is modified:
parsed_artifact_dir_for_source → returns the flat path (no __parsed__/);is_bundle_valid → "raw is valid if non-empty";archive_docx_source_after_full_docs_sync → no-op, source file preserved.Results can be cross-validated against golden fixtures under tests/parser/docx/golden/native_docx/ (the CLI does not freeze timestamps; just exclude time fields such as created_at when comparing).