DuckDB

DuckDB is a fast in-process analytical database. DuckDB is under an MIT license.

In this notebook we are going to show how to use DuckDB as a Vector store to be used in LlamaIndex.

Install DuckDB with:

pip install duckdb

Make sure to use the latest DuckDB version (>= 0.10.0).

You can run DuckDB in different modes depending on persistence:

in-memory is the default mode, where the database is created in memory, you can force this to be use by setting database_name = ":memory:" when initializing the vector store.
persistence is set by using a name for a database and setting a persistence directory database_name = "my_vector_store.duckdb" where the database is persisted in the default persist_dir or to the one you set it to.

With the vector store created, you can:

.add
.get
.update
.upsert
.delete
.peek
.query to run a search.

Basic example

In this basic example, we take the Paul Graham essay, split it into chunks, embed it using an open-source embedding model, load it into DuckDBVectorStore, and then query it.

For the embedding model we will use OpenAI.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

!pip install llama-index

Creating a DuckDB Index

python

!pip install duckdb
!pip install llama-index-vector-stores-duckdb

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.duckdb import DuckDBVectorStore
from llama_index.core import StorageContext

from IPython.display import Markdown, display

python

# Setup OpenAI API
import os
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

Download and prepare the sample dataset

python

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

python

documents = SimpleDirectoryReader("data/paul_graham/").load_data()

vector_store = DuckDBVectorStore()
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

python

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))

Persisting to disk example

Extending the previous example, if you want to save to disk, simply initialize the DuckDBVectorStore by specifying a database name and persist directory.

python

# Save to disk
documents = SimpleDirectoryReader("data/paul_graham/").load_data()

vector_store = DuckDBVectorStore("pg.duckdb", persist_dir="./persist/")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

python

# Load from disk
vector_store = DuckDBVectorStore.from_local("./persist/pg.duckdb")
index = VectorStoreIndex.from_vector_store(vector_store)

python

# Query Data
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))

Metadata filter example

It is possible to narrow down the search space by filter with metadata. Below is an example to show that in practice.

python

from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        **{
            "text": "The Shawshank Redemption",
            "metadata": {
                "author": "Stephen King",
                "theme": "Friendship",
                "year": 1994,
                "ref_doc_id": "doc_1",
            },
        }
    ),
    TextNode(
        **{
            "text": "The Godfather",
            "metadata": {
                "director": "Francis Ford Coppola",
                "theme": "Mafia",
                "year": 1972,
                "ref_doc_id": "doc_1",
            },
        }
    ),
    TextNode(
        **{
            "text": "Inception",
            "metadata": {
                "director": "Christopher Nolan",
                "theme": "Sci-fi",
                "year": 2010,
                "ref_doc_id": "doc_2",
            },
        }
    ),
]

vector_store = DuckDBVectorStore()
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)

Define the metadata filters.

python

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="theme", value="Mafia")]
)

Use the index as a retriever to use the metadatafilter option.

python

retriever = index.as_retriever(filters=filters)
retriever.retrieve("What is inception about?")