Back to Llama Index

TiDB Vector Store

docs/examples/vector_stores/TiDBVector.ipynb

0.14.215.4 KB
Original Source

TiDB Vector Store

TiDB Cloud, is a comprehensive Database-as-a-Service (DBaaS) solution, that provides dedicated and serverless options. TiDB Serverless is now integrating a built-in vector search into the MySQL landscape. With this enhancement, you can seamlessly develop AI applications using TiDB Serverless without the need for a new database or additional technical stacks. Create a free TiDB Serverless cluster and start using the vector search feature at https://pingcap.com/ai.

This notebook provides a detailed guide on utilizing the tidb vector search in LlamaIndex.

Setting up environments

python
%pip install llama-index-vector-stores-tidbvector
%pip install llama-index
python
import textwrap

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.tidbvector import TiDBVectorStore

Configuring your OpenAI Key

python
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass("Input your OpenAI API key:")

Configure TiDB connection setting that you will need. To connect to your TiDB Cloud Cluster, follow these steps:

  • Go to your TiDB Cloud cluster Console and navigate to the Connect page.
  • Select the option to connect using SQLAlchemy with PyMySQL, and copy the provided connection URL (without password).
  • Paste the connection URL into your code, replacing the tidb_connection_string_template variable.
  • Type your password.
python
# replace with your tidb connect string from tidb cloud console
tidb_connection_string_template = "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
# type your tidb password
tidb_password = getpass.getpass("Input your TiDB password:")
tidb_connection_url = tidb_connection_string_template.replace(
    "<PASSWORD>", tidb_password
)

Prepare data that used to show case

python
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
python
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
for index, document in enumerate(documents):
    document.metadata = {"book": "paul_graham"}

Create TiDB Vectore Store

The code snippet below creates a table named VECTOR_TABLE_NAME in TiDB, optimized for vector searching. Upon successful execution of this code, you will be able to view and access the VECTOR_TABLE_NAME table directly within your TiDB database environment

python
VECTOR_TABLE_NAME = "paul_graham_test"
tidbvec = TiDBVectorStore(
    connection_string=tidb_connection_url,
    table_name=VECTOR_TABLE_NAME,
    distance_strategy="cosine",
    vector_dimension=1536,
    drop_existing_table=False,
)

Create a query engine based on tidb vectore store

python
storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, show_progress=True
)

Note: If you encounter errors during this process due to the MySQL protocol’s packet size limitation, such as when trying to insert a large number of vectors (e.g., 2000 rows) , you can mitigate this issue by splitting the insertion into smaller batches. For example, you can set the insert_batch_size parameter to a smaller value (e.g., 1000) to avoid exceeding the packet size limit, ensuring smooth insertion of your data into the TiDB vector store:

python
storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, insert_batch_size=1000, show_progress=True
)

This section focus on vector search basics and refining results using metadata filters. Please note that tidb vector only supports Deafult VectorStoreQueryMode.

python
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))

Filter with metadata

perform searches using metadata filters to retrieve a specific number of nearest-neighbor results that align with the applied filters.

python
from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="book", value="paul_graham", operator="!="),
        ]
    ),
    similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

Query again

python
from llama_index.core.vector_stores.types import (
    MetadataFilter,
    MetadataFilters,
)

query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="book", value="paul_graham", operator="=="),
        ]
    ),
    similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))

Delete documents

python
tidbvec.delete(documents[0].doc_id)

Check whether the documents had been deleted

python
query_engine = index.as_query_engine()
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))