docs/design/2023-03-14-multi-valued-index.md
This document introduces the technical design of implementing 'Multi-valued index' in TiDB.
Multi-valued index is a new feature introduced in MySQL 8.0.17, which allows defining indexes on a JSON array and use the index via JSON functions, similar to MongoDB in Multikey Indexes. e.g
CREATE TABLE t1 (data JSON);
CREATE INDEX zips ON t1((CAST(data->'$.zip' AS UNSIGNED ARRAY)));
INSERT INTO t1 VALUES
('{"id":1, "zip": [0,111,333]}'),('{"id":2, "zip": [123,456,0]}'),
('{"id":3, "zip": [123,123,111]}'),
('{"id":4, "zip": [456,567,222]}'),
('{"id":5, "zip": []}');
mysql> SELECT * FROM t1 WHERE 123 MEMBER OF (data->'$.zip');
+-----------------------------------+
| data |
+-----------------------------------+
| {"id": 2, "zip": [123, 456, 0]} |
| {"id": 3, "zip": [123, 123, 111]} |
+-----------------------------------+
2 rows in set (0.01 sec)
That is: N index records point to one row record (N: 1) A common scenario involves having rows with associated tags and the need to efficiently query all data containing a specific tag.
cast(... as ... array) to define a multi-valued index, which is essentially an expression index whose virtual column type is "array with type". The index is encoded in the same way as normal secondary indexes.MEMBER OF, JSON_CONTAINS(subset), JSON_OVERLAPS(intersection) functions in where condition to using the multi-valued index.The encoding of each index record is identical to the normal secondary index(see TiDB Index Key/Value Format for more details).
For string types, the encoding result in TiDB is collation-aware, we could use binary collation for strings(in MySQL it is utf8mb4_0900_as_cs and behaves almost the same as binary).
row ('pk1', [1, 1, 2])
produces index records
1 -> 'pk1'
2 -> 'pk1'
row ('pk1', c1, [1, 1, 2], c2)
produces index records
(c1,1,c2) -> 'pk1'
(c1,2,c2) -> 'pk1'
New syntax: use cast(... as... array) to create the index. Add an Array field in FuncCastExpr indicate the use of this syntax.
type FuncCastExpr struct {
//...
Array bool
}
Use JSONBinary type as the return type of expression cast(... as... array). In the FieldType structure, add an array field indicates whether the type is an array, and tp represents the type of elements in the array.
type FieldType struct {
// tp is type of the array elements
tp byte
// array indicates whether the type is an array
array bool
}
Implement new built-in functions castAsTypedArrayFunctionSig, MEMBER OF and JSON_OVERLAPS.
Data changes cause index changes, which are handled in the same way as normal secondary indexes.
The column in the where condition will substituted with the corresponding expression in the index definition if the query meet the following 3 requirements
MEMBER OF/JSON_CONTAINS/JSON_OVERLAPS.For example, the index definition is create index idx on t((cast(data->'$.zip' as unsigned array))), the where condition is where 123 member of (data->'$.zip'), the column data->'$.zip' is substituted with cast(data->'$.zip' as unsigned array).
Index selection is the same as normal secondary index.
For any of the 3 functions, we can use IndexMerge operator to fetch the data:
IndexMerge
IndexRangeScan(<const>)
TableRowIDScan
IndexMerge(AND)
IndexRangeScan(<c1>)
IndexRangeScan(<c2>)
IndexRangeScan(<c3>)
...
TableRowIDScan
IndexMerge(OR)
IndexRangeScan(<c1>)
IndexRangeScan(<c2>)
IndexRangeScan(<c3>)
...
TableRowIDScan
Each IndexRangeScan is a PointGet like operator. It will fetch the row record by the index record. Since different indexes could match the same primary key, we need to use IndexMerge to filter the duplicated row records. For JSON_CONTAINS we should use AND type of IndexMerge, because only the primary key that contained in all the IndexRangeScan can be filtered. For JSON_OVERLAPS we should use OR to filter the row records.
If the multi-valued index is unique, it can be further optimized to PointGet.
MEMBER OF/JSON_CONTAINS/JSON_OVERLAPS. So even if SQL contains hint, force index, use index, etc., it is not necessarily possible to force multi-valued index.cast(... as ... array) can only appear once in the composite index definition, and the casted column must be a JSON column.-- Allowed:
INSERT INTO t1 VALUES('[1,1,2]');
INSERT INTO t1 VALUES('[3,3,3,4,4,4]');
-- Disallowed, report dup-key error:
INSERT INTO t1 VALUES('[1,2]');
INSERT INTO t1 VALUES('[2,3]');
not xxx cannot use the index, because empty array data cannot access through the index.
select * from t where pk not in (select pk from t where a member of (xxx)).BINARY, JSON, and YEAR.