docs/guides/dataset/advanced/extract_table_of_contents.md
Extract PageIndex, namely table of contents, from documents to provide long context RAG and improve retrieval.
During indexing, this technique uses LLM to extract and generate chapter information, which is added to each chunk to provide sufficient global context. At the retrieval stage, it first uses the chunks matched by search, then supplements missing chunks based on the PageIndex (table of contents) structure. This addresses issues caused by chunk fragmentation and insufficient context, improving answer quality.
:::danger WARNING Enabling PageIndex extraction requires significant memory, computational resources, and tokens. :::
The system's default chat model is used to summarize clustered content. Before proceeding, ensure that you have a chat model properly configured:
Navigate to the Configuration page.
Enable PageIndex.
To use this technique during retrieval, do either of the following:
PageIndex?No. Only files parsed after you enable PageIndex will be searched using the directory enhancement feature. To apply this feature to files parsed before enabling PageIndex, you must reparse them.