llama-index-integrations/readers/llama-index-readers-docugami/README.md
pip install llama-index-readers-docugami
This loader takes in IDs of PDF, DOCX or DOC files processed by Docugami and returns nodes in a Document XML Knowledge Graph for each document. This is a rich representation that includes the semantic and structural characteristics of various chunks in the document as an XML tree. Entire sets of documents are processed, resulting in forests of XML semantic trees.
To use this loader, you simply need to pass in a Docugami Doc Set ID, and optionally an array of Document IDs (by default, all documents in the Doc Set are loaded).
from llama_index.readers.docugami import DocugamiReader
docset_id = "tjwrr2ekqkc3"
document_ids = ["ui7pkriyckwi", "1be3o7ch10iy"]
loader = DocugamiReader()
documents = loader.load_data(docset_id=docset_id, document_ids=document_ids)
This loader is designed to be used as a way to load data into LlamaIndex.
See more information about how to use Docugami with LangChain in the LangChain docs.
Appropriate chunking of your documents is critical for retrieval from documents. Many chunking techniques exist, including simple ones that rely on whitespace and recursive chunk splitting based on character length. Docugami offers a different approach: