Back to Agent Zero

Document Query Plugin

plugins/_document_query/README.md

1.193.1 KB
Original Source

Document Query Plugin

Load, parse, index, and Q&A over local and remote documents with configurable timeouts and thread-safe parsers.

Features

  • Strategy-pattern parsers - MIME-type routing to dedicated parser classes
  • Centralized fetching - local and HTTP(S) resources are fetched once, size-checked, then passed to parsers
  • LiteParse first path - fast local parsing for PDFs and supported document/image formats, with legacy fallbacks
  • Adaptive OCR - long PDFs skip OCR automatically to avoid pathological parse times
  • Adaptive indexing - very large extracted documents increase chunk size to keep embedding work bounded
  • Bounded parser execution - sync parsers are offloaded to asyncio.to_thread and globally capped across chats
  • Configurable timeouts - per-document and gather-level timeouts
  • Expanded format support - PDF, HTML, text, YAML, XML, TOML, JS, TS, images, and catch-all Unstructured

Configuration

See default_config.yaml for all options. Key settings:

SettingDefaultDescription
fetch_timeout30HTTP fetch timeout (seconds)
fetch_retries3HTTP retry attempts
max_remote_bytes52428800Max remote document size
per_document_timeout60Max time for a single document parse
gather_timeout120Max time for all documents combined
parser_concurrency1Max parser jobs running across all chats in one process
context_intro_chunks2Leading chunks included per document for title/abstract grounding
chunk_size1000Text splitter chunk size
chunk_overlap100Text splitter overlap
max_index_chunks1200Maximum indexed chunks before adaptive chunk sizing, or 0 for no cap
search_threshold0.5Similarity search threshold
liteparse_enabledtruePrefer LiteParse before legacy parser fallbacks
liteparse_num_workers2Max LiteParse OCR workers per parser job
liteparse_ocr_auto_disable_pages30Disable OCR for PDFs at or above this effective page count
thread_offloadtrueOffload sync parsers to thread pool

LiteParse is installed into the Agent Zero framework runtime from hooks.py during plugin install/startup. If installation fails, the plugin logs the error and continues with the legacy parser fallbacks.

LiteParse always runs in a child process so native parser and OCR failures stay isolated from the Web UI process.

Parsers

ParserMIME TypesBackend
LiteParseParserPDF, Office/OpenDocument, imagesLiteParse
PdfParserapplication/pdfPyMuPDF + Tesseract OCR fallback
HtmlParsertext/htmlMarkdownify transformer
TextParsertext/*, application/json, YAML, XML, TOML, JS, TS, shellDirect read
ImageParserimage/*UnstructuredLoader
UnstructuredParser* (catch-all)UnstructuredLoader hi-res

Adding a new parser

  1. Create helpers/parsers/<format>.py extending BaseParser
  2. Set mimetypes class attribute
  3. Implement _parse_sync(document, config)
  4. Register in helpers/parsers/init.py