docs/mindsdb_sql/knowledge_bases/insert_data.mdx
Knowledge Bases (KBs) organize data across data sources, including databases, files, documents, webpages, enabling efficient search capabilities.
Here is what happens to data when it is inserted into the knowledge base.
<p align="center"> </p>Upon inserting data into the knowledge base, it is split into chunks, transformed into the embedding representation to enhance the search capabilities, and stored in a vector database.
INSERT INTO SyntaxHere is the syntax for inserting data into a knowledge base:
INSERT INTO my_kb
SELECT order_id, product, notes
FROM sample_data.orders;
Upon execution, it inserts data into a knowledge base, using the embedding model to embed it into vectors before inserting into an underlying vector database.
<Tip> The status of the `INSERT INTO` is logged in the `information_schema.queries` table with the timestamp when it was ran, and can be queried as follows:SELECT *
FROM information_schema.queries;
To enable default batch inserts for PGVector, set the DISABLE_PGVECTOR_AUTOBATCH environment variable or the knowledge_bases.disable_pgvector_autobatch configuration variable to false (it is set to true by default).
To disable default batch inserts, set the DISABLE_AUTOBATCH environment variable or the knowledge_bases.disable_autobatch configuration variable to true (it is set to false by default).
</Note>
Skip duplicate checking (kb_no_upsert)
INSERT INTO my_kb
SELECT *
FROM table_name
USING kb_no_upsert = true;
This skips all duplicate checking and directly inserts data. Use only when the knowledge base is empty (initial data load).
Skip existing items (kb_skip_existing)
INSERT INTO my_kb
SELECT *
FROM table_name
USING kb_skip_existing = true;
This checks for existing items and skips them entirely, including avoiding embedding calculation for existing content. More efficient than upsert when you only want to insert new items. </Info>
<Note> **Handling duplicate data while inserting into the knowledge base**Knowledge bases uniquely identify data rows using an ID column, which prevents from inserting duplicate data, as follows.
Case 1: Inserting data into the knowledge base without the id_column defined.
When users do not define the id_column during the creation of a knowledge base, MindsDB generates the ID for each row using a hash of the content columns, as explained here.
Example:
If two rows have exactly the same content in the content columns, their hash (and thus their generated ID) will be the same.
Note that duplicate rows are skipped and not inserted.
Since both rows in the below table have the same content, only one row will be inserted.
| name | age |
|---|---|
| Alice | 25 |
| Alice | 25 |
Case 2: Inserting data into the knowledge base with the id_column defined.
When users define the id_column during the creation of a knowledge base, then the knowledge base uses that column's values as the row ID.
Example:
If the id_column has duplicate values, the knowledge base skips the duplicate row(s) during the insert.
The second row in the below table has the same id as the first row, so only one of these rows is inserted.
| id | name | age |
|---|---|---|
| 1 | Alice | 25 |
| 1 | Bob | 30 |
Best practice
Ensure the id_column uniquely identifies each row to avoid unintentional data loss due to duplicate ID skipping.
Performance optimization for duplicate handling
For better performance when handling duplicates, you can use:
kb_skip_existing = true: Checks for existing IDs and skips them completely (no embedding calculation, more efficient)kb_no_upsert = true: Skips duplicate checking entirely (fastest, use only for initial load into empty KB)In order to update existing data in the knowledge base, insert data with the column ID that you want to update and the updated content.
Here is an example of usage. A knowledge base stores the following data.
+----------+-------------------+------------------------+
| order_id | product | notes |
+----------+-------------------+------------------------+
| A1B | Wireless Mouse | Request color: black |
| 3XZ | Bluetooth Speaker | Gift wrap requested |
| Q7P | Laptop Stand | Prefer aluminum finish |
+----------+-------------------+------------------------+
A user updated Laptop Stand to Aluminum Laptop Stand.
+----------+-----------------------+------------------------+
| order_id | product | notes |
+----------+-----------------------+------------------------+
| A1B | Wireless Mouse | Request color: black |
| 3XZ | Bluetooth Speaker | Gift wrap requested |
| Q7P | Aluminum Laptop Stand | Prefer aluminum finish |
+----------+-----------------------+------------------------+
Here is how to propagate this change into the knowledge base.
INSERT INTO my_kb
SELECT order_id, product, notes
FROM sample_data.orders
WHERE order_id = 'Q7P';
The knowledge base matches the ID value to the existing one and updates the data if required.
In order to optimize the performance of data insertion into the knowledge base, users can set up partitions and threads to insert batches of data in parallel. This also enables tracking the progress of data insertion process including cancelling and resuming it if required.
Here is an example.
INSERT INTO my_kb
SELECT order_id, product, notes
FROM sample_data.orders
USING
batch_size = 200,
track_column = order_id,
threads = 10,
error = 'skip';
The parameters include the following:
batch_size defines the number of rows fetched per iteration to optimize data extraction from the source. It defaults to 1000.
threads defines threads for running partitions. Note that if the ML task queue is enabled, threads are used automatically. The available values for threads are:
threads = 10,threads = true, or disable threads, setting threads = false.track_column defines the column used for sorting data before partitioning.
error defines the error processing options. The available values include raise, used to raise errors as they come, or skip, used to subside errors. It defaults to raise if not provided.
After executing the INSERT INTO statement with the above parameters, users can view the data insertion progress by querying the information_schema.queries table.
SELECT * FROM information_schema.queries;
Users can cancel the data insertion process using the process ID from the information_schema.queries table.
SELECT query_cancel(1);
Note that canceling the query will not remove the already inserted data.
Users can resume the data insertion process using the process ID from the information_schema.queries table.
SELECT query_resume(1);
Upon inserting data into the knowledge base, the data chunking is performed in order to optimize the storage and search of data.
Each chunk is identified by its chunk ID of the following format: <id>:<chunk_number>of<total_chunks>:<start_char_number>to<end_char_number>.
Users can opt for defining the chunking parameters when creating a knowledge base.
CREATE KNOWLEDGE_BASE my_kb
USING
...
preprocessing = {
"text_chunking_config" : {
"chunk_size": 2000,
"chunk_overlap": 200
}
},
...;
The chunk_size parameter defines the size of the chunk as the number of characters. And the chunk_overlap parameter defines the number of characters that should overlap between subsequent chunks.
Users can opt for defining the chunking parameters specifically for JSON data.
CREATE KNOWLEDGE_BASE my_kb
USING
...
preprocessing = {
"type": "json_chunking",
"json_chunking_config" : {
...
}
},
...;
When the type of chunking is set to json_chunking, users can configure it by setting the following parameter values in the json_chunking_config parameter:
flatten_nested
It is of the bool data type with the default value of True.
It defines whether to flatten nested JSON structures.
include_metadata
It is of the bool data type with the default value of True.
It defines whether to include original metadata in chunks.
chunk_by_object
It is of the bool data type with the default value of True.
It defines whether to chunk by top-level objects (True) or create a single document (False).
exclude_fields
It is of the List[str] data type with the default value of an empty list.
It defines the list of fields to exclude from chunking.
include_fields
It is of the List[str] data type with the default value of an empty list.
It defines the list of fields to include in chunking (if empty, all fields except excluded ones are included).
metadata_fields
It is of the List[str] data type with the default value of an empty list.
It defines the list of fields to extract into metadata for filtering (can include nested fields using dot notation). If empty, all primitive fields will be extracted (top-level fields if available, otherwise all primitive fields in the flattened structure).
extract_all_primitives
It is of the bool data type with the default value of False.
It defines whether to extract all primitive values (strings, numbers, booleans) into metadata.
nested_delimiter
It is of the str data type with the default value of ".".
It defines the delimiter for flattened nested field names.
content_column
It is of the str data type with the default value of "content".
It defines the name of the content column for chunk ID generation.
Each knowledge base has its underlying vector store that stores data inserted into the knowledge base in the form of embeddings.
Users can query the underlying vector store as follows.
SELECT id, content, metadata, embeddings
FROM <kb_name>_chromadb.storage_table;
SELECT id, content, metadata, embeddings
FROM <vector_store_connection_name>.<table_name>;
Here a sample knowledge base created in the previous Example section is inserted into.
INSERT INTO my_kb
SELECT order_id, product, notes
FROM sample_data.orders;
CREATE KNOWLEDGE_BASE my_kb
USING
...
id_column = 'order_id',
...
INSERT INTO my_kb
SELECT order_id, notes AS content
FROM sample_data.orders;
DELETE FROM SyntaxHere is the syntax for deleting from a knowledge base:
DELETE FROM my_kb
WHERE id = 'A1B';
CREATE INDEX ON KNOWLEDGE_BASE SyntaxUsers can create index on the knowledge base to speed up the search operations.
CREATE INDEX ON KNOWLEDGE_BASE my_kb;
Upon executing this statement, an index is created on the knowledge base's underlying vector store. This is essentially a database index created on the vector database.
Note that having an index on the knowledge base may reduce the speed of the insert operations. Therefore, it is recommended to insert bulk data into the knowledge base before creating an index. The index improves performance of querying the knowledge base, while it may slow down subsequent data inserts.