ClickHouse Vector Store

In this notebook we are going to show a quick demo of using the ClickHouseVectorStore.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

!pip install llama-index
!pip install clickhouse_connect

Creating a ClickHouse Client

python

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

python

from os import environ
import clickhouse_connect

environ["OPENAI_API_KEY"] = "sk-*"

# initialize client
client = clickhouse_connect.get_client(
    host="localhost",
    port=8123,
    username="default",
    password="",
)

Load documents, build and store the VectorStoreIndex with ClickHouseVectorStore

Here we will use a set of Paul Graham essays to provide the text to turn into embeddings, store in a ClickHouseVectorStore and query to find context for our LLM QnA loop.

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.clickhouse import ClickHouseVectorStore

python

# load documents
documents = SimpleDirectoryReader("../data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
print("Number of Documents: ", len(documents))

Download Data

python

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

You can process your files individually using SimpleDirectoryReader:

python

loader = SimpleDirectoryReader("./data/paul_graham/")
documents = loader.load_data()
for file in loader.input_files:
    print(file)
    # Here is where you would do any preprocessing

python

# initialize with metadata filter and store indexes
from llama_index.core import StorageContext

for document in documents:
    document.metadata = {"user_id": "123", "favorite_color": "blue"}
vector_store = ClickHouseVectorStore(clickhouse_client=client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Query Index

Now ClickHouse vector store supports filter search and hybrid search

You can learn more about query_engine and retriever.

python

import textwrap

from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(key="user_id", value="123"),
        ]
    ),
    similarity_top_k=2,
    vector_store_query_mode="hybrid",
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

Clear All Indexes

python

for document in documents:
    index.delete_ref_doc(document.doc_id)