FlagEmbeddingReranker - Llama Index

Rerank can speed up an LLM query without sacrificing accuracy (and in fact, probably improving it). It does so by pruning away irrelevant nodes from the context.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai
%pip install llama-index-postprocessor-flag-embedding-reranker

python

!pip install llama-index
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

Download Data

python

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

python

import os

OPENAI_API_KEY = "sk-"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

python

# load documents
documents = SimpleDirectoryReader("./data/paul_graham").load_data()

python

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

python

# build index
index = VectorStoreIndex.from_documents(documents=documents)

python

from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

rerank = FlagEmbeddingReranker(model="BAAI/bge-reranker-large", top_n=5)

First, we try with reranking. We time the query to see how long it takes to process the output from the retrieved context.

python

from time import time

python

query_engine = index.as_query_engine(
    similarity_top_k=10, node_postprocessors=[rerank]
)

now = time()
response = query_engine.query(
    "Which grad schools did the author apply for and why?",
)
print(f"Elapsed: {round(time() - now, 2)}s")

python

print(response)

python

print(response.get_formatted_sources(length=200))

Next, we try without rerank

python

query_engine = index.as_query_engine(similarity_top_k=10)


now = time()
response = query_engine.query(
    "Which grad schools did the author apply for and why?",
)

print(f"Elapsed: {round(time() - now, 2)}s")

python

print(response)

python

print(response.get_formatted_sources(length=200))

As we can see, the query engine with reranking produced a much more concise output in much lower time (6s v.s. 10s). While both responses were essentially correct, the query engine without reranking included a lot of irrelevant information - a phenomenon we could attribute to "pollution of the context window".