OpenVINO Rerank

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. The OpenVINO™ Runtime supports various hardware devices including x86 and ARM CPUs, and Intel GPUs. It can help to boost deep learning performance in Computer Vision, Automatic Speech Recognition, Natural Language Processing and other common tasks.

Hugging Face rerank model can be supported by OpenVINO through OpenVINORerank class.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index-postprocessor-openvino-rerank
%pip install llama-index-embeddings-openvino

python

!pip install llama-index

Download Data

python

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

Download Embedding, Rerank models and LLM

python

from llama_index.embeddings.huggingface_openvino import OpenVINOEmbedding

OpenVINOEmbedding.create_and_save_openvino_model(
    "BAAI/bge-small-en-v1.5", "./embedding_ov"
)

python

from llama_index.postprocessor.openvino_rerank import OpenVINORerank

OpenVINORerank.create_and_save_openvino_model(
    "BAAI/bge-reranker-large", "./rerank_ov"
)

python

!optimum-cli export openvino --model HuggingFaceH4/zephyr-7b-beta --weight-format int4 llm_ov

Retrieve top 10 most relevant nodes, then filter with OpenVINO Rerank

python

from llama_index.postprocessor.openvino_rerank import OpenVINORerank
from llama_index.llms.openvino import OpenVINOLLM
from llama_index.core import Settings


Settings.embed_model = OpenVINOEmbedding(model_id_or_path="./embedding_ov")
Settings.llm = OpenVINOLLM(model_id_or_path="./llm_ov")

ov_rerank = OpenVINORerank(
    model_id_or_path="./rerank_ov", device="cpu", top_n=2
)

python

index = VectorStoreIndex.from_documents(documents=documents)

python

query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[ov_rerank],
)
response = query_engine.query(
    "What did Sam Altman do in this essay?",
)

python

print(response)

python

print(response.get_formatted_sources(length=200))

Directly retrieve top 2 most similar nodes

python

query_engine = index.as_query_engine(
    similarity_top_k=2,
)
response = query_engine.query(
    "What did Sam Altman do in this essay?",
)

Retrieved context is irrelevant and response is hallucinated.

python

print(response)

python

print(response.get_formatted_sources(length=200))

For more information refer to: