BEIR Out of Domain Benchmark

About BEIR:

BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your retrieval methods within the benchmark.

Refer to the repo via the link for a full list of supported datasets.

Here, we test the all-MiniLM-L6-v2 sentence-transformer embedding, which is one of the fastest for the given accuracy range. We set the top_k value for the retriever to 30. We also use the nfcorpus dataset.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index-embeddings-huggingface

python

!pip install llama-index

python

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.evaluation.benchmarks import BeirEvaluator
from llama_index.core import VectorStoreIndex


def create_retriever(documents):
    embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
    index = VectorStoreIndex.from_documents(
        documents, embed_model=embed_model, show_progress=True
    )
    return index.as_retriever(similarity_top_k=30)


BeirEvaluator().run(
    create_retriever, datasets=["nfcorpus"], metrics_k_values=[3, 10, 30]
)

Higher is better for all the evaluation metrics.

This towardsdatascience article covers NDCG, MAP and MRR in greater depth.