Back to Llama Index

load documents

docs/examples/node_postprocessor/SentenceTransformerRerank.ipynb

0.14.212.8 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_postprocessor/SentenceTransformerRerank.ipynb" target="_parent"></a>

Rerank can speed up an LLM query without sacrificing accuracy (and in fact, probably improving it). It does so by pruning away irrelevant nodes from the context.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai
python
!pip install llama-index
python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

Download Data

python
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
python
# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
python
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)
python
# build index
index = VectorStoreIndex.from_documents(documents=documents)
python
from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3
)

First, we try with reranking. We time the query to see how long it takes to process the output from the retrieved context.

python
from time import time
python
query_engine = index.as_query_engine(
    similarity_top_k=10, node_postprocessors=[rerank]
)

now = time()
response = query_engine.query(
    "Which grad schools did the author apply for and why?",
)
print(f"Elapsed: {round(time() - now, 2)}s")
python
print(response)
python
print(response.get_formatted_sources(length=200))

Next, we try without rerank

python
query_engine = index.as_query_engine(similarity_top_k=10)


now = time()
response = query_engine.query(
    "Which grad schools did the author apply for and why?",
)

print(f"Elapsed: {round(time() - now, 2)}s")
python
print(response)
python
print(response.get_formatted_sources(length=200))

As we can see, the query engine with reranking produced a much more concise output in much lower time (4s v.s. 28s). While both responses were essentially correct, the query engine without reranking included a lot of irrelevant information - a phenomenon we could attribute to "pollution of the context window".