LanceDB Vector Store

In this notebook we are going to show how to use LanceDB to perform vector searches in LlamaIndex

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index llama-index-vector-stores-lancedb

python

%pip install lancedb==0.6.13 #Only required if the above cell installs an older version of lancedb (pypi package may not be released yet)

python

# Refresh vector store URI if restarting or re-using the same notebook
! rm -rf ./lancedb

python

import logging
import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


from llama_index.core import SimpleDirectoryReader, Document, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.lancedb import LanceDBVectorStore
import textwrap

Setup OpenAI

The first step is to configure the openai key. It will be used to created embeddings for the documents loaded into the index

python

import openai

openai.api_key = "sk-"

Download Data

python

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

Loading documents

Load the documents stored in the data/paul_graham/ using the SimpleDirectoryReader

python

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print("Document ID:", documents[0].doc_id, "Document Hash:", documents[0].hash)

Create the index

Here we create an index backed by LanceDB using the documents loaded previously. LanceDBVectorStore takes a few arguments.

uri (str, required): Location where LanceDB will store its files.
table_name (str, optional): The table name where the embeddings will be stored. Defaults to "vectors".
nprobes (int, optional): The number of probes used. A higher number makes search more accurate but also slower. Defaults to 20.
refine_factor: (int, optional): Refine the results by reading extra elements and re-ranking them in memory. Defaults to None
More details can be found at LanceDB docs

For LanceDB cloud :

python

vector_store = LanceDBVectorStore( 
    uri="db://db_name", # your remote DB URI
    api_key="sk_..", # lancedb cloud api key
    region="your-region" # the region you configured
    ...
)

```python
vector_store = LanceDBVectorStore(
    uri="./lancedb", mode="overwrite", query_type="hybrid"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Query the index

We can now ask questions using our index. We can use filtering via MetadataFilters or use native lance where clause.

python

from llama_index.core.vector_stores import (
    MetadataFilters,
    FilterOperator,
    FilterCondition,
    MetadataFilter,
)

from datetime import datetime


query_filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="creation_date",
            operator=FilterOperator.EQ,
            value=datetime.now().strftime("%Y-%m-%d"),
        ),
        MetadataFilter(
            key="file_size", value=75040, operator=FilterOperator.GT
        ),
    ],
    condition=FilterCondition.AND,
)

Hybrid Search

LanceDB offers hybrid search with reranking capabilities. For complete documentation, refer here.

This example uses the colbert reranker. The following cell installs the necessary dependencies for colbert. If you choose a different reranker, make sure to adjust the dependencies accordingly.

python

! pip install -U torch transformers tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985

if you want to add a reranker at vector store initialization, you can pass it in the arguments like below :

from lancedb.rerankers import ColbertReranker
reranker = ColbertReranker()
vector_store = LanceDBVectorStore(uri="./lancedb", reranker=reranker, mode="overwrite")

python

import lancedb

python

from lancedb.rerankers import ColbertReranker

reranker = ColbertReranker()
vector_store._add_reranker(reranker)

query_engine = index.as_query_engine(
    filters=query_filters,
    # vector_store_kwargs={
    #     "query_type": "fts",
    # },
)

response = query_engine.query("How much did Viaweb charge per month?")

python

print(response)
print("metadata -", response.metadata)

lance filters(SQL like) directly via the `where` clause :

python

lance_filter = "metadata.file_name = 'paul_graham_essay.txt' "
retriever = index.as_retriever(vector_store_kwargs={"where": lance_filter})
response = retriever.retrieve("What did the author do growing up?")

python

print(response[0].get_content())
print("metadata -", response[0].metadata)

Appending data

You can also add data to an existing index

python

nodes = [node.node for node in response]

python

del index

index = VectorStoreIndex.from_documents(
    [Document(text="The sky is purple in Portland, Maine")],
    uri="/tmp/new_dataset",
)

python

index.insert_nodes(nodes)

python

query_engine = index.as_query_engine()
response = query_engine.query("Where is the sky purple?")
print(textwrap.fill(str(response), 100))

You can also create an index from an existing table

python

del index

vec_store = LanceDBVectorStore.from_table(vector_store._table)
index = VectorStoreIndex.from_vector_store(vec_store)

python

query_engine = index.as_query_engine()
response = query_engine.query("What companies did the author start?")
print(textwrap.fill(str(response), 100))

LanceDB Vector Store

LanceDB Vector Store

Setup OpenAI

Loading documents

Create the index

For LanceDB cloud :

Query the index

Hybrid Search

lance filters(SQL like) directly via the where clause :

Appending data

lance filters(SQL like) directly via the `where` clause :