examples/sparse_encoder/applications/retrieve_rerank/README.md
In Semantic Search we have shown how to use SparseEncoder to compute embeddings for queries, sentences, and paragraphs and how to use this for semantic search. For complex search tasks, for example question answering retrieval, the search can significantly be improved by using Retrieve & Re-Rank. Note that a detailed explanation with dense embeddings produced by Bi-Encoder is accessible here.
The Retrieve & Re-Rank approach consists of two stages:
This approach combines the efficiency of first-stage retrieval with the accuracy of second-stage re-ranking.
File: retrieve_rerank_simple_wikipedia.ipynb [ Colab Version ]
This Jupyter notebook provides an interactive demonstration of retrieve & re-rank over Simple English Wikipedia as corpus. The example allows you to:
File: hybrid_search.py
This script provides a complete evaluation pipeline comparing different retrieval and re-ranking approaches on a given dataset (here in our example NanoNFCorpus). It includes:
Output: The script generates comprehensive metrics and saves results in the runs/ directory.
Example results from running the hybrid search evaluation on NanoNFCorpus:
================================================================================
EVALUATION SUMMARY
================================================================================
METHOD NDCG@10 MRR@10 MAP
--------------------------------------------------------------------------------
Sparse Retrieval 32.10 47.27 28.29
Dense Retrieval 27.35 41.59 22.79
Sparse + Reranking 37.35 57.19 32.12
Dense + Reranking 37.56 58.27 31.93
Hybrid RRF 32.62 49.63 22.51
Hybrid RRF + Reranking 36.16 55.77 26.99
================================================================================
Key Observations:
The SparseEncoder produces embeddings independently for your paragraphs and for your search queries. You can use it like this:
from sentence_transformers import SparseEncoder
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
docs = [
"My first paragraph. That contains information",
"Python is a programming language.",
]
document_embeddings = model.encode_document(docs)
query = "What is Python?"
query_embedding = model.encode_query(query)
For pre-trained Sparse Encoder models, see: Pretrained Sparse-Encoders.
For pre-trained Cross Encoder models, see: MS MARCO Cross-Encoders