docs/design/2024-07-12-support-vector-index.md
Vector Index refers to constructing a time and space efficient data structure for vector data based on a certain mathematical quantization model, to efficiently retrieve several vectors that are as similar as possible to the target vector. Currently, we plan to support the HNSW series of vector indexing methods.
This document plans to support the same processing of vector indexes as ordinary indexes, its specific functions are as follows :
The process of adding a vector index is similar to that of adding an ordinary index. However, since the actual vector index data is added to TiFlash, there is no process for populating the index data to TiKV. The following figure takes the add vector index operation as an example and briefly describes its execution process.
Syntax required to create a vector index:
CREATE TABLE foo (
id INT PRIMARY KEY,
data VECTOR(5),
data64 VECTOR64(10),
-- "WITH OPTION" is not supported in Phase 1
VECTOR INDEX idx_name USING HNSW ((VEC_COSINE_DISTANCE(data))) [WITH OPTION "m=16, ef_construction=64"]
);
IndexKeyTypeOpt add VECTOR index type, and IndexTypeOpt add HNSW option type, and add vector index after creating the table.CREATE VECTOR INDEX idx_name USING HNSW ON foo ((VEC_COSINE_DISTANCE(data)))
-- Proposal 1, "WITH OPTION" is not supported in Phase 1
[WITH OPTION "m=16, ef_construction=64"];
-- Proposal 2, "VECTOR_INDEX_PARAM" is not supported in Phase 1
[VECTOR_INDEX_PARAM "m=16, ef_construction=64"];
ALTER TABLE foo ADD VECTOR INDEX idx_name USING HNSW ((VEC_COSINE_DISTANCE(data)))
-- "WITH OPTION" is not supported in Phase 1
[WITH OPTION "m=16, ef_construction=64"];
CREATE VECTOR INDEX idx ON t ((VEC_COSINE_DISTANCE(a))) USING HNSW;
CREATE VECTOR INDEX IF NOT EXISTS idx ON t ((VEC_COSINE_DISTANCE(a))) TYPE HNSW;
CREATE VECTOR INDEX ident ON db.t (ident, ident ASC ) TYPE HNSW;
ALTER TABLE t ADD VECTOR ((VEC_COSINE_DISTANCE(a))) USING HNSW COMMENT 'a';
ALTER TABLE t ADD VECTOR INDEX ((VEC_COSINE_DISTANCE(a))) USING HNSW COMMENT 'a';
ALTER TABLE t ADD VECTOR INDEX IF NOT EXISTS ((VEC_COSINE_DISTANCE(a))) USING HNSW COMMENT 'a';
USE INDEX hint.SELECT *
FROM foo
ORDER BY VEC_COSINE_DISTANCE(data, '[3,1,2]')
LIMIT 5;
Currently, the use of vector indexing is supported on the master branch and supports the explain function.
type ANNQueryInfo struct {
QueryType ANNQueryType `protobuf:"varint,1,opt,name=query_type,json=queryType,enum=tipb.ANNQueryType" json:"query_type"`
DistanceMetric VectorDistanceMetric `protobuf:"varint,2,opt,name=distance_metric,json=distanceMetric,enum=tipb.VectorDistanceMetric" json:"distance_metric"`
TopK uint32 `protobuf:"varint,3,opt,name=top_k,json=topK" json:"top_k"`
ColumnName string `protobuf:"bytes,4,opt,name=column_name,json=columnName" json:"column_name"`
ColumnId int64 `protobuf:"varint,5,opt,name=column_id,json=columnId" json:"column_id"`
RefVecF32 []byte `protobuf:"bytes,6,opt,name=ref_vec_f32,json=refVecF32" json:"ref_vec_f32,omitempty"`
MaxDistance float64 `protobuf:"fixed64,10,opt,name=max_distance,json=maxDistance" json:"max_distance"`
HnswEfSearch uint32 `protobuf:"varint,20,opt,name=hnsw_ef_search,json=hnswEfSearch" json:"hnsw_ef_search"`
// new fields
IndexId int64 `protobuf:"varint,5,opt,name=column_id,json=columnId" json:"column_id"`
}
The analyze statement is not supported at the moment.
distance functionsdistance function will not exist| function | Phase 1 | Phase 2 | Phase 3 |
|---|---|---|---|
| VEC_L1_DISTANCE | TBD | ||
| VEC_L2_DISTANCE | v | ||
| VEC_NEGATIVE_INNER_PRODUCT(名字待定) VEC_NEGATIVE_INNER_PRODUCT (name to be determined) | v | ||
| VEC_COSINE_DISTANCE | v |
Add the VectorIndexInfo information field to describe and store information about vector indexes. In addition, add relevant information to the existing IndexInfo to record information about vector indexes.
// VectorIndexInfo is the information on the vector index of a column.
type VectorIndexInfo struct {
// Kind is the kind of vector index. Currently, only HNSW is supported.
Kind VectorIndexKind `json:"kind"`
// Dimension is the dimension of the vector.
Dimension uint64 `json:"dimension"` // Set to 0 when initially parsed from comment. Will be assigned to flen later.
// DistanceMetric is the distance metric used by the index.
DistanceMetric DistanceMetric `json:"distance_metric"`
}
// IndexInfo provides meta data describing a DB index.
type IndexInfo struct {
ID int64 `json:"id"`
Name CIStr `json:"idx_name"` // Index name.
...
// VectorInfo is the vector index information.
VectorInfo *VectorIndexInfo `json:"is_vector"`
}
ActionAddVectorIndex DDL job type, through the existing DDL framework, to achieve the operation of adding a vector index.
VectorIndexInfo to TiFlash.ROWS_STABLE_NOT_INDEXED value of the table corresponding to the added vector index on the system.dt_local_indexes table on TiFlash is 0.The IsRollbackable check is the same as a general index.MySQL.gc_delete_rangeThe IsRollbackable check is the same as a general index.No need to add a DDL job type, just like renaming a general index operation. We don't have to wait for TiFlash's synchronization operation, just return directly after the TiKV operation is completed.
admin show ddl to display the progress of DDL operation execution
ROWS_STABLE_INDEXED information of the corresponding table of the information_schema.tiflash_indexes table on TiFlash can be filled into the ROW_COUNT information of the DDL job.show create tableThe admin check table/index and admin repair table operations do not currently consider processing of vector indexes.ActionAddVectorIndex is a new DDL job type, the monitoring information for adding a vector index operation needs to be handled the same as adding a general index.Adding an index operation mainly consists of two steps:
syncTableSchema handles synchronization Schema operations, and if new index information is found, it needs to be synchronized to the table.VectorIndexHNSWBuilder to read column data and populate the corresponding index data (already supported).syncTableschema handles synchronous Schema operations, it needs to remove the corresponding index.Add or update relevant system table information.
Follow-up MySQL supports MySQL 9.0 vector, whether we need to consider compatibility later.
It’s used to ensure the basic feature function works as expected. Both the integration test and the unit test should be considered.
A checklist to test compatibility: