tutorials/doc-search/semantics.md
For documents that cover diverse topics, one can also use vector-based semantic search to search the documents. The procedure is slightly different from the classic vector-search-based method.
Divide the documents into chunks, choose an embedding model to convert the chunks into vectors and store each vector with its corresponding doc_id in a vector database.
For each query, conduct a vector-based search to get top-K chunks with their corresponding documents.
For each document, calculate a relevance score. Let N be the number of content chunks associated with each document, and let ChunkScore(n) be the relevance score of chunk n. The document score is computed as:
$$ \text{DocScore}=\frac{1}{\sqrt{N+1}}\sum_{n=1}^N \text{ChunkScore}(n) $$
Select the documents with the highest DocScore, then use their doc_id to perform further retrieval via the PageIndex retrieval API.
Contact us if you need any advice on conducting document searches for your use case.