docs/en/table_design/indexes/vector_index.md
import Beta from '../../_assets/commonMarkdown/_beta.mdx'
This topic introduces the vector index feature of StarRocks and how to perform an approximate nearest neighbor search (ANNS) with it.
The vector index feature is only supported in shared-nothing clusters of v3.4 or later.
Currently, StarRocks supports vector indexing at the Segment file level. The index maps each search item to its row ID within Segment files, allowing fast data retrieval by directly locating the corresponding data rows without brute-force vector distance calculations. The system now provides two types of vector indexing: Inverted File with Product Quantization (IVFPQ) and Hierarchical Navigable Small World (HNSW), each with its own organizational structure.
Inverted File with Product Quantization (IVFPQ) is a method for approximate nearest neighbor search in large-scale high-dimensional vectors, commonly used in vector retrieval tasks within deep learning and machine learning. IVFPQ consists of two main components: the inverted file and product quantization.
By combining inverted files and product quantization, IVFPQ enables efficient approximate nearest neighbor search in large, high-dimensional datasets.
Hierarchical Navigable Small World (HNSW) is a graph-based algorithm for high-dimensional nearest neighbor search, also widely used in vector retrieval tasks.
HNSW builds a hierarchical graph structure in which each layer is a navigable small world (NSW) graph. In the graph, each vertex represents a data point, and edges indicate similarity between vertices. The higher layers of the graph contain fewer vertices with sparser connections for fast global search, while the lower layers include all vertices with denser connections for precise local search.
For a vector search, HNSW searches at the top layer first, quickly identifying an approximate nearest neighbor region, then moves down layer by layer to find the exact nearest neighbor at the bottom layer.
HNSW offers both efficiency and precision, making it adaptable to various data and query distributions.
Each table supports only one vector index.
Before creating vector indexes, you must enable it by setting the FE configuration item enable_experimental_vector to true.
Execute the following statement to enable it dynamically:
ADMIN SET FRONTEND CONFIG ("enable_experimental_vector" = "true");
To enable it permanently, you must add enable_experimental_vector = true to the FE configuration file fe.conf and restart FE.
This tutorial creates vector indexes while creating tables. You can also append vector indexes to an existing table. See Append vector index for detailed instructions.
The following example creates an HNSW vector index hnsw_vector on column vector in table hnsw.
CREATE TABLE hnsw (
id BIGINT(20) NOT NULL COMMENT "",
vector ARRAY<FLOAT> NOT NULL COMMENT "",
INDEX hnsw_vector (vector) USING VECTOR (
"index_type" = "hnsw",
"dim"="5",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"M" = "16",
"efconstruction" = "40"
)
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;
The following example creates an IVFPQ vector index ivfpq_vector on column vector in table ivfpq.
CREATE TABLE ivfpq (
id BIGINT(20) NOT NULL COMMENT "",
vector ARRAY<FLOAT> NOT NULL COMMENT "",
INDEX ivfpq_vector (vector) USING VECTOR (
"index_type" = "ivfpq",
"dim"="5",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"nbits" = "16",
"nlist" = "40"
)
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;
hnsw and ivfpq.1.l2_distance: Euclidean distance. The smaller the value, the higher the similarity.cosine_similarity: Cosine similarity. The larger the value, the higher the similarity.true and false. It takes effect only when metric_type is cosine_similarity. If the vectors are normalized, the value of the calculated distances will be within [-1, 1]. The vectors must satisfy that the sum of the squares is 1, otherwise an error is returned.2. The value of M directly affects the efficiency and accuracy of graph construction and search. During graph construction, each vertex will attempt to establish connections with its nearest M vertices. If a vertex already has M connections but finds a closer vertex, the farthest connection will be deleted and a new connection will be established with the closer vertex. A vector search will start from an entry point and find the nearest neighbor along the vertices connected to it. Therefore, the larger the value of M, the larger the search scope for each vertex, the higher the search efficiency, but the higher the cost of graph construction and storage.1. It is used to control the search depth during the graph construction process. Specifically, efconstruction defines the size of the search list (also known as the candidate list) for each vertex during the graph construction process. This candidate list is used to store the neighbor candidates of the current vertex, and the size of the list is efconstruction. The larger the value of efconstruction, the more candidates is considered as the neighbors of the vertex during the graph construction process, and, as a result, the better quality (such as better connectivity) of the graph, but also the higher time consumption and computation complexity of graph construction.8. With IVFPQ, each vector is divided into multiple subvectors, and then each subvector is quantized. Nbits defines the precision of quantization, that is, how many binary digits each subvector is quantized into. The larger the value of nbits, the higher the quantization precision, but the higher storage and computation costs.1. With IVFPQ, the dataset is divided into clusters, and the centroid of each cluster corresponds to an inverted list. A vector search will first find the cluster centroid closest to the data point, and then retrieve the nearest neibors in the corresponding inverted list. Therefore, the value of nlist will affect the accuracy and efficiency of the search. The larger the value of nlist, the finer the granularity of clustering, thus the higher the search accuracy, but also the higher the search complexity.dim-dimensional vector into M_IVFPQ equal-length subvectors. Therefore, it must be a factor of the value of dim.You can also add vector indexes to an existing table using CREATE INDEX or ALTER TABLE ADD INDEX.
Example:
CREATE INDEX ivfpq_vector
ON ivfpq (vector)
USING VECTOR (
"index_type" = "ivfpq",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"dim"="5",
"nlist" = "256",
"nbits"="10"
);
ALTER TABLE ivfpq
ADD INDEX ivfpq_vector (vector)
USING VECTOR (
"index_type" = "ivfpq",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"dim"="5",
"nlist" = "256",
"nbits"="10"
);
You can view the definition of a vector index using the SHOW CREATE TABLE statement:
Example:
mysql> SHOW CREATE TABLE hnsw \G
*************************** 1. row ***************************
Table: hnsw
Create Table: CREATE TABLE hnsw (
id bigint(20) NOT NULL COMMENT "",
vector array<float> NOT NULL COMMENT "",
INDEX index_vector (vector) USING VECTOR("dim" = "5", "efconstruction" = "40", "index_type" = "hnsw", "is_vector_normed" = "false", "M" = "512", "metric_type" = "l2_distance") COMMENT ''
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES (
"compression" = "LZ4",
"fast_schema_evolution" = "true",
"replicated_storage" = "false",
"replication_num" = "3"
);
1 row in set (0.00 sec)
You can drop vector indexes using DROP INDEX or ALTER TABLE DROP INDEX.
DROP INDEX ivfpq_vector ON ivfpq;
ALTER TABLE ivfpq DROP INDEX ivfpq_vector;
Before running a vector search, make sure the FE configuration item enable_experimental_vector is set to true.
SELECT *, <vector_index_distance_func>(v1, [1,2,3]) as dis
FROM table_name
WHERE <vector_index_distance_func>(v1, [1,2,3]) <= 10
ORDER BY <vector_index_distance_func>(v1, [1,2,3])
LIMIT 10
To use vector index in queries, all the following requirements must be met:
ORDER BY clause must follow the format ORDER BY <vector_index_distance_func>(vector_column, constant_array) without including additional ORDER BY columns.
<vector_index_distance_func>:
metric_type is l2_distance, the function name must be approx_l2_distance.metric_type is cosine_similarity, the function name must be approx_cosine_similarity.<vector_index_distance_func>:
constant_array must be a constant ARRAY<FLOAT> with dimensions matching the vector index dim.vector_column must be the column corresponding to the vector index.metric_type is l2_distance, the order must be ASC.metric_type is cosine_similarity, the order must be DESC.LIMIT N clause is required.<vector_index_distance_func> expressions, combined using AND and comparison operators (> or <). The comparison operator direction must align with the ASC/DESC order. Specifically:metric_type is l2_distance: col_ref <= constant.metric_type is cosine_similarity: col_ref >= constant.col_ref refers to the result of <vector_index_distance_func>(vector_column, constant_array) and can be cast to FLOAT or DOUBLE types, for example:
approx_l2_distance(v1, [1,2,3])CAST(approx_l2_distance(v1, [1,2,3]) AS FLOAT)CAST(approx_l2_distance(v1, [1,2,3]) AS DOUBLE)AND, with each child predicate meeting Requirement 1.Create tables with vector indexes and insert vector data into them:
CREATE TABLE test_hnsw (
id BIGINT(20) NOT NULL COMMENT "",
vector ARRAY<FLOAT> NOT NULL COMMENT "",
INDEX index_vector (vector) USING VECTOR (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"M" = "512",
"dim"="5")
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;
INSERT INTO test_hnsw VALUES
(1, [1,2,3,4,5]),
(2, [4,5,6,7,8]);
CREATE TABLE test_ivfpq (
id BIGINT(20) NOT NULL COMMENT "",
vector ARRAY<FLOAT> NOT NULL COMMENT "",
INDEX index_vector (vector) USING VECTOR (
"index_type" = "ivfpq",
"metric_type" = "l2_distance",
"is_vector_normed" = "false",
"nlist" = "256",
"nbits"="8",
"dim"="5",
"M_IVFPQ"="1")
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1;
INSERT INTO test_ivfpq VALUES
(1, [1,2,3,4,5]),
(2, [4,5,6,7,8]);
Approximate searches will hit the vector index and thereby accelerate the search process.
The following example searches for the top 1 approximate nearest neighbor of the vector [1,1,1,1,1].
SELECT id, approx_l2_distance([1,1,1,1,1], vector)
FROM test_hnsw
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 1;
You can combine scalar search with vector search.
SELECT id, approx_l2_distance([1,1,1,1,1], vector)
FROM test_hnsw
WHERE id = 1
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 1;
You can perform range search on vector data.
The following example pushes the score < 40 condition down to the index and filters vectors by score range.
SELECT * FROM (
SELECT id, approx_l2_distance([1,1,1,1,1], vector) score
FROM test_hnsw
) a
WHERE score < 40
ORDER BY score
LIMIT 1;
Precise calculations will ignore the vector index, and directly calculate distances between the vectors for precise results.
SELECT id, l2_distance([1,1,1,1,1], vector)
FROM test_hnsw WHERE id = 1
ORDER BY l2_distance([1,1,1,1,1], vector)
LIMIT 1;
NOTE
Different distance metric functions define "similarity" differently. For
l2_distance, smaller values indicate higher similarity; forcosine_similarity, larger values indicate higher similarity. Therefore, when calculatingtopN, the sorting (ORDER BY) direction should match the similarity direction of the metric. UseORDER BY ASC LIMIT xforl2_distance, andORDER BY DESC LIMIT xforcosine_similarity.
Parameter tuning is essential in vector search, as it affects both performance and accuracy. It is recommended to tune the search paramters on a small dataset and move to large datasets only when expected recall and latency are achieved.
Search parameters are passed through hints in SQL statements.
Before proceeding, make sure the vector column is built with the HNSW index.
SELECT
/*+ SET_VAR (ann_params='{efsearch=256}') */
id, approx_l2_distance([1,1,1,1,1], vector)
FROM test_hnsw
WHERE id = 1
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 1;
Parameter:
efsearch, the higher the accuracy, but the lower the speed.Before proceeding, make sure the vector column is built with the IVFPQ index.
SELECT
/*+ SET_VAR (ann_params='{nprobe=256,max_codes=0,scan_table_threshold=0,polysemous_ht=0,range_search_confidence=0.1}') */
id, approx_l2_distance([1,1,1,1,1], vector)
FROM test_ivfpq
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 1;
Parameter:
nprobe, the higher the accuracy, but the lower the speed.1 produces the most accurate results.You can calculate approximate recall by intersecting the topK elements from brute-force retrieval with those from approximate retrieval: Recall = TP / (TP + FN).
-- Approximate retrieval
SELECT id
FROM test_hnsw
ORDER BY approx_l2_distance([1,1,1,1,1], vector)
LIMIT 5;
8
9
7
5
1
-- Brute-force retrieval
SELECT id
FROM test_hnsw
ORDER BY l2_distance([1,1,1,1,1], vector)
LIMIT 5;
8
9
5
7
10
In the preceding example, approximate retrieval returns 8, 9, 7, and 5. However, the correct result is 8, 9, 5, 7, and 10. In this case, recall is 4/5=80%.
Execute EXPLAIN against the query statement. If OlapScanNode properties show VECTORINDEX: ON, it indicates that the vector index has been applied to approximate vector searches.
Example:
> EXPLAIN SELECT id FROM t_test_vector_table ORDER BY approx_l2_distance([1,1,1,1,1], vector) LIMIT 5;
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS:1: id |
| PARTITION: UNPARTITIONED |
| |
| RESULT SINK |
| |
| 4:Project |
| | <slot 1> : 1: id |
| | limit: 5 |
| | |
| 3:MERGING-EXCHANGE |
| limit: 5 |
| |
| PLAN FRAGMENT 1 |
| OUTPUT EXPRS: |
| PARTITION: RANDOM |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 03 |
| UNPARTITIONED |
| |
| 2:TOP-N |
| | order by: <slot 3> 3: approx_l2_distance ASC |
| | offset: 0 |
| | limit: 5 |
| | |
| 1:Project |
| | <slot 1> : 1: id |
| | <slot 3> : 4: __vector_approx_l2_distance |
| | |
| 0:OlapScanNode |
| TABLE: t_test_vector_table |
| VECTORINDEX: ON |
| IVFPQ: OFF, Distance Column: <4:__vector_approx_l2_distance>, LimitK: 5, Order: ASC, Query Vector: [1, 1, 1, 1, 1], Predicate Range: -1.0 |
| PREAGGREGATION: ON |
| partitions=1/1 |
| rollup: t_test_vector_table |
| tabletRatio=1/1 |
| tabletList=11302 |
| cardinality=2 |
| avgRowSize=4.0 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------+