load documents - Llama Index

Rerank can speed up an LLM query without sacrificing accuracy (and in fact, probably improving it). It does so by pruning away irrelevant nodes from the context.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai

python

!pip install llama-index

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

Download Data

python

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

python

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

python

from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

python

# build index
index = VectorStoreIndex.from_documents(documents=documents)

python

from llama_index.core.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3
)

First, we try with reranking. We time the query to see how long it takes to process the output from the retrieved context.

python

from time import time

python

query_engine = index.as_query_engine(
    similarity_top_k=10, node_postprocessors=[rerank]
)

now = time()
response = query_engine.query(
    "Which grad schools did the author apply for and why?",
)
print(f"Elapsed: {round(time() - now, 2)}s")

python

print(response)

python

print(response.get_formatted_sources(length=200))

Next, we try without rerank

python

query_engine = index.as_query_engine(similarity_top_k=10)


now = time()
response = query_engine.query(
    "Which grad schools did the author apply for and why?",
)

print(f"Elapsed: {round(time() - now, 2)}s")

python

print(response)

python

print(response.get_formatted_sources(length=200))

As we can see, the query engine with reranking produced a much more concise output in much lower time (4s v.s. 28s). While both responses were essentially correct, the query engine without reranking included a lot of irrelevant information - a phenomenon we could attribute to "pollution of the context window".