docs/development/data-components.md
The data & data structure components include:
Document class.PdfLoader
Layout-aware with table parsing PdfLoader
MathPixLoader: To use this loader, you need MathPix API key, refer to mathpix docs for more information
OCRLoader: This loader uses lib-table and Flax pipeline to perform OCR and read table structure from PDF file (TODO: add more info about deployment of this module).
Output:
Document: text + metadata to identify whether it is table or not
- "source": source file name
- "type": "table" or "text"
- "table_origin": original table in markdown format (to be feed to LLM or visualize using external tools)
- "page_label": page number in the original PDF document