Back to Llama Index

LanceDB Vector Store

docs/examples/vector_stores/LanceDBIndexDemo.ipynb

0.14.216.1 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/LanceDBIndexDemo.ipynb" target="_parent"></a>

LanceDB Vector Store

In this notebook we are going to show how to use LanceDB to perform vector searches in LlamaIndex

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index llama-index-vector-stores-lancedb
python
%pip install lancedb==0.6.13 #Only required if the above cell installs an older version of lancedb (pypi package may not be released yet)
python
# Refresh vector store URI if restarting or re-using the same notebook
! rm -rf ./lancedb
python
import logging
import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


from llama_index.core import SimpleDirectoryReader, Document, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.lancedb import LanceDBVectorStore
import textwrap

Setup OpenAI

The first step is to configure the openai key. It will be used to created embeddings for the documents loaded into the index

python
import openai

openai.api_key = "sk-"

Download Data

python
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

Loading documents

Load the documents stored in the data/paul_graham/ using the SimpleDirectoryReader

python
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print("Document ID:", documents[0].doc_id, "Document Hash:", documents[0].hash)

Create the index

Here we create an index backed by LanceDB using the documents loaded previously. LanceDBVectorStore takes a few arguments.

  • uri (str, required): Location where LanceDB will store its files.

  • table_name (str, optional): The table name where the embeddings will be stored. Defaults to "vectors".

  • nprobes (int, optional): The number of probes used. A higher number makes search more accurate but also slower. Defaults to 20.

  • refine_factor: (int, optional): Refine the results by reading extra elements and re-ranking them in memory. Defaults to None

  • More details can be found at LanceDB docs

For LanceDB cloud :
python
vector_store = LanceDBVectorStore( 
    uri="db://db_name", # your remote DB URI
    api_key="sk_..", # lancedb cloud api key
    region="your-region" # the region you configured
    ...
)

```python
vector_store = LanceDBVectorStore(
    uri="./lancedb", mode="overwrite", query_type="hybrid"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Query the index

We can now ask questions using our index. We can use filtering via MetadataFilters or use native lance where clause.

python
from llama_index.core.vector_stores import (
    MetadataFilters,
    FilterOperator,
    FilterCondition,
    MetadataFilter,
)

from datetime import datetime


query_filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="creation_date",
            operator=FilterOperator.EQ,
            value=datetime.now().strftime("%Y-%m-%d"),
        ),
        MetadataFilter(
            key="file_size", value=75040, operator=FilterOperator.GT
        ),
    ],
    condition=FilterCondition.AND,
)

LanceDB offers hybrid search with reranking capabilities. For complete documentation, refer here.

This example uses the colbert reranker. The following cell installs the necessary dependencies for colbert. If you choose a different reranker, make sure to adjust the dependencies accordingly.

python
! pip install -U torch transformers tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985

if you want to add a reranker at vector store initialization, you can pass it in the arguments like below :

from lancedb.rerankers import ColbertReranker
reranker = ColbertReranker()
vector_store = LanceDBVectorStore(uri="./lancedb", reranker=reranker, mode="overwrite")
python
import lancedb
python
from lancedb.rerankers import ColbertReranker

reranker = ColbertReranker()
vector_store._add_reranker(reranker)

query_engine = index.as_query_engine(
    filters=query_filters,
    # vector_store_kwargs={
    #     "query_type": "fts",
    # },
)

response = query_engine.query("How much did Viaweb charge per month?")
python
print(response)
print("metadata -", response.metadata)
lance filters(SQL like) directly via the where clause :
python
lance_filter = "metadata.file_name = 'paul_graham_essay.txt' "
retriever = index.as_retriever(vector_store_kwargs={"where": lance_filter})
response = retriever.retrieve("What did the author do growing up?")
python
print(response[0].get_content())
print("metadata -", response[0].metadata)

Appending data

You can also add data to an existing index

python
nodes = [node.node for node in response]
python
del index

index = VectorStoreIndex.from_documents(
    [Document(text="The sky is purple in Portland, Maine")],
    uri="/tmp/new_dataset",
)
python
index.insert_nodes(nodes)
python
query_engine = index.as_query_engine()
response = query_engine.query("Where is the sky purple?")
print(textwrap.fill(str(response), 100))

You can also create an index from an existing table

python
del index

vec_store = LanceDBVectorStore.from_table(vector_store._table)
index = VectorStoreIndex.from_vector_store(vec_store)
python
query_engine = index.as_query_engine()
response = query_engine.query("What companies did the author start?")
print(textwrap.fill(str(response), 100))