docs/RFCS/20171020_inverted_indexes.md
Add infrastructure for inverted indexes in CockroachDB, which will permit indexing arbitrarily-nested primitives in JSON column types in the short term, array columns in the medium term, and full-text indexing of text columns and geospatial indexes in the long term.
We’ve committed to supporting JSON column types to make CockroachDB useful for
schema-less data, and for compatibility with Postgres applications that use the
JSONB column type. (n.b. this document discusses JSON columns and values, but
in CockroachDB they will be referred to as either JSON or JSONB depending
on the outcome from the JSON implementation RFC) Having JSON columns on their
own is useful, but their usefulness is severely curtailed without an efficient
way to index them, and CockroachDB currently only provides indexes that permit
searching based on comparison to a prefix of a column value. In order to
support JSON in a way that is useful to our customers, we need to be able to
index JSON documents in a more fine-grained way than comparison on value
prefix.
Specifically, the list of use cases that we’ve seen for indexing JSON columns includes:
foo at the top level”.foo at the
top level, and its value is equal to bar”.foo.bar.baz = qux”.While out of scope for immediate implementation, here are some other related questions that are not possible to efficiently answer with CockroachDB’s current indexing scheme:
None of the questions posed by these use cases are efficiently answerable by CockroachDB’s conventional indexes without full table scans, since they are not satisfiable by ordinary value prefix comparison operators. We’d like to be able to answer these questions and others like them in an efficient way; inverted indexes provide a way to do so.
All of the above questions share a common structure: they’re all filters on components of tokenizable data. Arrays are tokenized by element; text is tokenized by splitting on whitespace and stop words; JSON is tokenized by key-value pairs.
An inverted index exploits the tokenizable nature of a data type to permit efficient searches of the following form:
Give me all of the rows of table
twhose tokenizable columnccontains tokens that satisfy my constraints.
Being able to answer this general kind of question efficiently is extremely powerful, and allows us to satisfy the near-term use cases for JSON above.
Aside: Why is it called an inverted index? A common point of confusion around inverted indexes is their name. What makes an inverted index any more inverted than a secondary index, which points from a column value back to the primary key of table?
Originally, inverted indexes were designed for full text search over documents. In that domain, the forward index allows you to retrieve the list of words associated with a particular document ID. To implement full text search, you need to reverse the mapping: instead of pointing from document ID to a list of words, you would point from word to list of document IDs.
This inversion of the direction of the index is where the name inverted index comes from.
An ordinary primary or secondary index in CockroachDB maps one or several column values to an index key. In the KV layer this is represented as a key with the index key and maybe some other other column values. The KV layer is sorted, which makes retrieval of index values fast when you know the index key. Consider the following table:
CREATE TABLE t (
keyCol INT PRIMARY KEY,
jsonCol JSONB
);
In a simplified view of the KV layer, this table would look like the following:
| Key (table id, index id, value) | Value |
|---|---|
t/1/1 | {"x": "a", "y": 7} |
t/1/2 | {"x": "b", "z": {"a": true, "b": false}} |
t/1/3 | {"y": 7, "z": {"b": false}} |
In this case, the keys in the KV have a prefix containing the table name, the
index key and the name of the column for which the value is stored in the value
field. This makes it very fast to run queries where the index key is in the
where clause. For example:
SELECT jsonCol FROM t where keyCol = x;
An ordinary index works very well when you’re searching based on prefixes of sorted data. However, if we want to be able to query over tokenizable data like in the motivation section, this approach doesn’t save enough information about the individual components. We need something more sophisticated because the component of interest might not be a prefix on the whole data structure.
For example, suppose we wanted to query the table above for all rows where
json``Col->z->b = false. At a glance it’s easy to see that both rows 2 and
3 satisfy that constraint.
An ordinary secondary index built by storing the entire jsonCol as the key
and the keyCol as the value would not help us answer this question
efficiently, since it would be sorted lexicographically by the valueCol and
the JSON "z" key of the value is not necessarily the first JSON key of
the value. We could get around this if we had computed indexes, by creating an
index on valueCol->z, but in order to do that we would need to know that we
were indexing on z beforehand. Users of unstructured data types would like
their indexes to work on the whole space without specifying the keys to index
on.
At a high level without going into detail about the physical key encoding, an inverted index on the above data would look like the following:
| Key (json path + value) | Value (rows whose **valueCol** contains the key) |
|---|---|
x:a | 1 |
x:b | 2 |
y:7 | 1 |
y:7 | 3 |
z:a:true | 2 |
z:b:false | 2 |
z:b:false | 3 |
The question we posed above valueCol->z->b = false is now incredibly natural
to answer with this type of index. Now it’s a straightforward lookup into the
KV, just the way ordinary indexes are.
This general approach to indexing is usable on any tokenizable datatype, such as arrays, text, and even XML, although we won't be adding support for any of these datatypes in the near term.
CockroachDB's inverted index for JSON will encode the full JSON path for every
leaf value in a JSON document, similar to Postgres' GIN index with the
jsonb_path_ops modifier.
Postgres supports two flavors of inverted indexes for JSON jsonb_ops and
jsonb_path_ops. Roughly, the difference is that jsonb_ops collapses the
hierarchy of the JSON document, whereas jsonb_path_ops preserves it. The
result is that jsonb_ops produces a larger and less precise index more similar
to an ordinary full-text index.
jsonb_ops
The first method, jsonb_ops, works by adding every JSON key and value from
the indexed column into the index as a key. It’s essentially no different from
full-text search - keys and values are treated almost identically, and all of
the hierarchical information from the input JSON is discarded. Given the
following JSON:
{"x": "b", "z": {"a": true, "b": false}}
the keys produced would be:
Ka
Kb
Kx
Kz
Vb
Vtrue
Vfalse
To find the results for the following containment query,
SELECT * from t WHERE jsonCol @> {"x":"b", "z:"{"a":true}}
we would have to look up the index key for every key and value in the input to
get a set of primary keys per object, then take the intersection of those sets,
and finally re-check the containment condition for all of the rows in that
intersection to make sure that the structure is correct. This is inefficient in
terms of both time and space. In this example, it’s obvious that a very large
number of rows can have true as a value in their JSON column so we would have
to perform a join on a very large data set. The resultant index will also be
quite large on disk, since every individual key and value gets indexed.
jsonb_path_ops
The second type of JSON index that Postgres supports is called
jsonb_path_ops. This type of index encodes the full JSON path for every leaf
value into the index key. For the example above, the resultant index keys would
be the following:
x:b
z:a:true
z:b:false
This type of index produces significantly fewer keys, only one per leaf from original JSON, not one per token. Performing lookups on this encoding is very efficient for full path queries because it’s just one lookup.
From user feedback we’ve received, users are mostly interested in running queries supported by the second type. The second type is also more space and time efficient. Therefore, we propose implementing the second variety of index initially. Once we implement full text search, we will have the first variety for free using JSON keys and values as the tokens instead of words.
Encoding the index
Inverted indexes are traditionally encoded as a “postings list”, where each key points to a single compound value containing a list of matching primary index ids. Traditional RDBMS systems can encode this compound value efficiently by using their internal row id representations in the list. Because CockroachDB doesn’t have internal row ids, we would have to implement a traditional postings list by storing a list of all missing subcomponents of matching primary keys in the RocksDB value.
There are two disadvantages of doing it this way. Firstly, naive insertion and deletion would be very inefficient. When modifying the index for a JSON object we would have to read everything in the KV value, modify it, and then write it all back, for every single leaf in the JSON. This produces more MVCC garbage on that set of keys, is more expensive over the wire, and would cause contention on the set of highly-accessed and frequently written keys. Secondly, since there’s no bound on the number of rows that might contain a particular path, one individual key’s value could easily grow past the soft 64 megabyte value limit for RocksDB. The only way to get around that limit would be to add logic to split and merge values, adding unnecessary complexity.
Fortunately, CockroachDB has already solved this problem for non-unique secondary indexes where we encode missing primary key components as part of the index key. Inverted indexes can reuse this approach to avoid the disadvantages of the posting list. We can encode both the full JSON path and the missing primary key components into the index key.
For example, a row with primary key pk and JSON column {"a": "b", "c": "d"}
in table 1 would produce the following keys in an inverted index with id 2:
/1/2/a/b/pk
/1/2/c/d/pk
As an implementation note, Postgres deviates from this idea by storing a hash of all of values of the path components as its index key to save space. This space savings comes at the cost of losing a lot of index capabilities: sub-path matching, top-level-key matching, and range scans of path and leaf values. This approach isn’t useful for us, since all of our keys are stored in sorted order and RocksDB does prefix compression on adjacent keys, so we don’t have to sacrifice those neat capabilities to save space.
Physical encoding
JSON values contain two collection types:
We will encode these two structures into index keys in a very similar way. The full key schema looks like this:
/table id/index id/(json object key | arr)…/json leaf value/pk columns…/
where arr is a separator indicating that the next key is part of an array.
There’s only one leaf value per index key, so the start of the primary key
columns always follows the leaf value, which is detectable by its encoding. The
value component of an index KV is left unused. In the future, inverted indexes
could be extended by adding stored columns into the value component like
secondary indexes currently allow with STORING.
For example, a row with primary key pk, JSON column {"a": "b", "c": "d"} and
text column "foo" in table 1 would produce the following key-value pairs in an
inverted index with id 2 that's marked to store the text column:
/1/2/a/b/pk -> "foo"
/1/2/c/d/pk -> "foo"
Key encoding JSON objects
Encoding JSON objects into the index key is straightforward. Every key in a
path is written to the index key after the index id and is prefixed with a type
tag. Object keys don’t need ordinary type tags, because they can only be
strings. However, we still need to add some sort of tag to distinguish them
from leaf values, which will need to be encoded with a type tag. We’ll pick
NOTNULL, which is not used by the leaf value encoding.
JSON leaf values have four types, null, boolean, string or number. Since we
don’t know what they are when reading the values back from a key, we’ll need to
tag them with types. CockroachDB already uses type tags for its value encoding,
so we can reuse those here.
For example, if we had a table with id 1 with columns keyCol and jsonCol
containing the following values:
| keyCol | jsonCol |
|---|---|
pk1 | {"x": "b", "z": {"a": true, "b": false}} |
pk2 | {"x": "b", "y": null} |
and an inverted index on jsonCol with index id 2, we’d encode the following
index keys:
1/2/x/b/pk1
1/2/x/b/pk2
1/2/y/null/pk2
1/2/z/a/true/pk1
1/2/z/b/false/pk1
To reduce clutter, we've dropped the type tags from our examples. For completeness, the final index key above would look like the following with all type tags included:
1/2/NOTNULL/z/NOTNULL/b/BOOL/false/pk1
Key encoding JSON Arrays
JSON arrays are ordered lists of other JSON values. For example:
["a", 3, [4, 5, 4], [false, true], {foo:bar}]
To encode array membership into an index key, the ARRAY type tag is reused to
indicate that the subsequent key component is part of an array. With arr
representing the ARRAY type tag and assuming the primary key is pk1, the
example above would produce the following index keys:
1/2/arr/3/pk1
1/2/arr/a/pk1
1/2/arr/arr/4/pk1
1/2/arr/arr/5/pk1
1/2/arr/arr/false/pk1
1/2/arr/arr/true/pk1
1/2/arr/foo/bar/pk1
We don’t actually distinguish arrays with an index or name, just a level of nesting. To understand why, consider the following query:
SELECT * from t where jsonCol @> [[false]];
This should return the JSON above because there is indeed a value false
nested two deep in an array. However, if we included the indices of the array
values, we wouldn’t be able to disambiguate between the inner array [4,5] and
[false, true]. If we just add a prefix for array containment, we don’t have
this problem.
Supported Operators
Postgres provides the following indexable operators on JSON:
| @> | Does the left JSON value contain the right JSON path/value entries at the top level? |
|---|---|
| ? | Does the string exist as a top-level key within the JSON value? |
| ?& | Do all of these array strings exist as top-level keys? |
| ? |
Note that, unlike Postgres’ json_path_ops index, the proposed index will
support all of the operators above, not just the first one. The reason for this
extra support is that the proposed encoding is lossless with respect to the
structure of the input JSON, unlike Postgres’ lossy hash encoding; searching
for a top-level key is as simple as performing a range scan. For example, given
the column value {a: {c: 3}} and the resultant index key .../a/c/3/,
completing the query col ? a merely requires checking to see if any rows are
available in the keyspan /a -> /b.
Even though supporting this feature isn't in scope for initial implementation,
the proposed index key encoding permits range queries and partial path matches
that are not even expressible by Postgres’s @> operator semantics. For
example, the query SELECT * FROM j WHERE col->a->>x > 3, though inexpressible
via Postgres’ indexable JSON operators, is easily expressed via a scan over
keyspan /a/3 -> /b.
Index Creation
The syntax of creating an inverted index on a JSONB column type should match
Postgres, for compatibility with existing Postgres applications.
Therefore, we must support the following syntax even though we don’t support Postgres’ GIN indexes in general:
CREATE INDEX <optional name> ON <table> USING GIN(<column> <opt_ops_specifier>)
where <opt_ops_specifier> can be only jsonb_path_ops, since that’s the
index mode we’ve chosen to support.
For ergonomics, we might want to also support a syntax like the following:
CREATE INVERTED INDEX <optional name> ON <table> (<column> <opt_ops_specifier>)
A future extension to inverted indexes to permit stored columns could have the
STORING syntax that we currently implement for secondary indexes:
CREATE INVERTED INDEX <optional name> ON <table> (<column> <opt_ops_specifier>) STORING (<cols>)
The fact that an index is inverted will be stored in its index descriptor, which will inform the index encoder/decoders and index selection algorithms how to behave correctly.
Inverted indexes will be physically backfilled and updated in the same fashion as ordinary secondary indexes.
Index Selection
Once we have an inverted index created, when do we choose to use it?
If the query contains a constraint against an indexed JSON column that uses any of the above operators, we add the inverted index to the set of index candidates.
In general, the priority of indexes during index selection is determined by sorting the index constraints by selectivity, an estimate of how good the index will be at constraining the number of rows it returns given the operations that constrain it. At the time of this writing, selectivity is determined by a heuristic that takes into account the number of constrained index columns, whether or not the index is covering, and the order of the index. Notably, there’s no effort made to guess how many rows will be returned by a given constraint, since we don’t currently collect table statistics.
Since the inverted index might have a very large number of matching rows per index key, in general we have to assume that it has fairly low selectivity. It’s not exactly clear how to weight an inverted index versus an incompletely-constrained unique index on the same table, since we don’t have table statistics that inform us of the selectivity of a partial match of the other index. If a user is performing a containment query on an indexed column, though, we can probably make the assumption that they want to use the index. So, we’ll weight a JSON column constraint quite high - but lower than a fully-constrained unique index.
To give users recourse in the case where these heuristics cause an undesirable
index to be picked, the explicit index selection operator table@index will
also be supported for inverted indexes.
If we guessed that keys in JSON values roughly follow a uniform distribution, one might be able to guess the selectivity of a JSON inverted index constraint by counting the number of keys in the path to search, with the rationale that the longer the path constraint, the less likely it is that documents contain that path. However, it’s clear that JSON values aren’t in general uniformly distributed, since it’s very common to have "hot paths" in a particular JSON schema, so this estimation doesn’t seem all that useful.
Index usage
Unlike ordinary secondary indexes, inverted indexes behave differently when
they are constrained multiple times. The idea is that, since the index is
inverted and contains all of the primary keys that contain the value in the
constraint, multiple constraints are implemented by checking each constraint
individually and intersecting the resultant primary keys. For example, the
containment query col @> '3' AND col @> 'b' requires 2 index scans: one for
all the rows that contain 3, and one for all the rows that contain b. The
resultant row keys are then intersected to produce the final result.
OR constraints, on the other hand, result in the output sets being unioned as
normal.
Containment queries that search for objects that contain multiple leaf values
are similar - they can be treated as a conjunction of constraints that each
contain one full path to a leaf value. For example, the query
col @> '{"animal": "dog", "color": "green"}' requires 2 index scans: one for
all the rows that contain {"animal": "dog"}, and one for all the rows that
contain {"color": "green"}. The results are intersected like above.
Simple array containment queries are similar as well: the query
col @> '{"a": [1,2]}' produces 2 index scans, one for /a/1 and one for
/a/2/. The results are intersected as above.
Queries that contain multiple constraints within a single JSON object or array past the first level of nesting require special handling: for each result in the intersection of the index scans, an index join against the primary table must be performed to re-check the query’s index condition. To understand why, observe the following examples and adversarial inputs:
| query | adversarial input |
|---|---|
col @> '[{"a": 1, "b": 2}]' | [{"a": 1}, {"b": 2}] |
col @> '[[1,2]]' | [[1], [2]] |
In each of those cases, an intersection of index scans for each of the paths through the query object will both include rows with the adversarial input, but the adversarial input does not actually contain the query. To resolve this issue, each row in the intersection must be rechecked against the index condition.
Match intersection algorithm
As mentioned above, the fundamental new operation required for performing a search on an inverted index is intersecting 1 or more sets of primary keys and using that as the input to an index scan. This operation will be built into a new plan node that sits in between a scan node and an index join node to avoid polluting the pre-existing scan node implementations.
The inverted index intersection node will handle unioning the output of OR
constraints, and intersecting the output of AND constraints as described
above. Specifically, this node will retrieve all of the matching primary keys,
deduplicate/intersect/union them as necessary, and use the output to perform an
index join against the primary index.
Since the primary keys for each inverted index prefix (everything up to the primary key) are sorted, intersection and union can be performed with a cheap in-memory merge operation.
The sorting of these primary key sets could be used in a future implementation step to transform this naive index join into a merge join, but that won't be done as part of the initial implementation.
Inverted indexes are very large, since they generate an unbounded number of index keys for every value. Implementing ordinary secondary indexes over computed columns (which are not yet implemented as of the time of writing) would produce a smaller index, at the cost of requiring that the customers know ahead of time which keys they need an index on.
Rejected JSON Array Encoding For key encoding arrays we considered leaving out the array special character prefix altogether. For the JSON object in the example above, the following keys would be encoded:
table/jsonCol/3/pk
table/jsonCol/4/pk
table/jsonCol/5/pk
table/jsonCol/a/pk
table/jsonCol/false/pk
table/jsonCol/foo/bar/pk
table/jsonCol/true/pk
The advantage of this approach is a shorter key. The disadvantage is that we
would have to do more disambiguation for a given query because the nesting
information would be lost. For example, searching for containment of [[1]]
in an index on a table with values [1,2] and [[1,2]] would generate a
keyspan that selected both rows, since the nesting information was thrown away
on index creation.
This RFC leaves the door open for several exciting features:
WHERE v->a ? b with the proposed
index structure.@> operator doesn’t support this, but a filter like WHERE v->>a > 10
could utilize a range scan over the index since it’s sorted.Index weighting
How exactly should we weight the selectivity of constraints on inverted indexes versus ordinary indexes?
For example, the following query includes a partially constrained secondary index alongside a constrained inverted index.
CREATE TABLE a (a INT, b INT, c JSON, PRIMARY KEY (a, b));
CREATE INVERTED INDEX ON a(c);
SELECT * FROM a WHERE a = 3 AND c @> {"foo": "bar"};
It's not clear which filter has better selectivity, since we don't have table
statistics to inform us of the distribution of rows in a.
Use cases for path unaware inverted indexes
It's not clear to us whether there will be demand for path unaware inverted indexes on JSON columns.