docs/examples/vector_stores/qdrant_bm42.ipynb
Qdrant recently released a new lightweight approach to sparse embeddings, BM42.
In this notebook, we walk through how to use BM42 with llama-index, for effecient hybrid search.
First, we need a few packages
llama-indexllama-index-vector-stores-qdrantfastembed or fastembed-gpullama-index will automatically run fastembed models on GPU if the provided libraries are installed. Check out their full installation guide.
%pip install llama-index llama-index-vector-stores-qdrant fastembed
To confirm the installation worked (and also to confirm GPU usage, if used), we can run the following code.
This will first download (and cache) the model locally, and then embed it.
from fastembed import SparseTextEmbedding
model = SparseTextEmbedding(
model_name="Qdrant/bm42-all-minilm-l6-v2-attentions",
# if using fastembed-gpu with cuda+onnx installed
# providers=["CudaExecutionProvider"],
)
embeddings = model.embed(["hello world", "goodbye world"])
indices, values = zip(
*[
(embedding.indices.tolist(), embedding.values.tolist())
for embedding in embeddings
]
)
print(indices[0], values[0])
In llama-index, we can construct a hybrid index in just a few lines of code.
If you've tried hybrid in the past with splade, you will notice that this is much faster!
Here, we use llama-parse to read in the Llama2 paper! Using the JSON result mode, we can get detailed data about each page, including layout and images. For now, we will use the page numbers and text.
You can get a free api key for llama-parse by visiting https://cloud.llamaindex.ai
!mkdir -p 'data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
import nest_asyncio
nest_asyncio.apply()
from llama_parse import LlamaParse
from llama_index.core import Document
parser = LlamaParse(result_type="text", api_key="llx-...")
# get per-page results, along with detailed layout info and metadata
json_data = parser.get_json_result("data/llama2.pdf")
documents = []
for document_json in json_data:
for page in document_json["pages"]:
documents.append(
Document(text=page["text"], metadata={"page_number": page["page"]})
)
With our nodes, we can construct our index with Qdrant and BM42!
In this case, Qdrant is being hosted in a docker container.
You can pull the latest:
docker pull qdrant/qdrant
And then to launch:
docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
client = qdrant_client.QdrantClient("http://localhost:6333")
aclient = qdrant_client.AsyncQdrantClient("http://localhost:6333")
# delete collection if it exists
if client.collection_exists("llama2_bm42"):
client.delete_collection("llama2_bm42")
vector_store = QdrantVectorStore(
collection_name="llama2_bm42",
client=client,
aclient=aclient,
fastembed_sparse_model="Qdrant/bm42-all-minilm-l6-v2-attentions",
)
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents,
# our dense embedding model
embed_model=OpenAIEmbedding(
model_name="text-embedding-3-small", api_key="sk-proj-..."
),
storage_context=storage_context,
)
As we can see, both the dense and sparse embeddings were generated super quickly!
Even though the sparse model is running locally on CPU, its very small and fast.
Using the powers of sparse embeddings, we can query for some very specific facts, and get the correct data.
from llama_index.llms.openai import OpenAI
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context",
llm=OpenAI(model="gpt-4o", api_key="sk-proj-..."),
)
response = chat_engine.chat("What training hardware was used for Llama2?")
print(str(response))
response = chat_engine.chat("What is the main idea of Llama2?")
print(str(response))
response = chat_engine.chat("What was Llama2 evaluated and compared against?")
print(str(response))
With your vector index created, we can easily connect back to it!
import qdrant_client
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
client = qdrant_client.QdrantClient("http://localhost:6333")
aclient = qdrant_client.AsyncQdrantClient("http://localhost:6333")
# delete collection if it exists
if client.collection_exists("llama2_bm42"):
client.delete_collection("llama2_bm42")
vector_store = QdrantVectorStore(
collection_name="llama2_bm42",
client=client,
aclient=aclient,
fastembed_sparse_model="Qdrant/bm42-all-minilm-l6-v2-attentions",
)
loaded_index = VectorStoreIndex.from_vector_store(
vector_store,
embed_model=OpenAIEmbedding(
model="text-embedding-3-small", api_key="sk-proj-..."
),
)