architecture/design/ysql-gin-indexes.md
Tracking GitHub issue 7850
As of v2.6 and v2.7, Yugabyte supports only one type of index access method for DocDB-backed relations: lsm. This simply maps columns to the primary key of the base table.
Generalized inverted indexes map elements inside container columns to the primary key of the base table. To support GIN, we add a new access method ybgin and implement the access method API. A lot of the work can be borrowed from the upstream PostgreSQL gin access method.
Upstream PostgreSQL comes with the following gin opclasses:
tsvector_ops: map tsvector to textarray_ops: map anyarray to anyelementjsonb_ops: map jsonb to text (key or value in jsonb)jsonb_path_ops: map jsonb to int4 (hash of root-to-leaf path in jsonb)We can just duplicate these for the ybgin access method and reference the same opfamily.
ambuild and yb_ambackfill are used for CREATE INDEX with and without online schema changes. They both read from the base table and insert to the index table. Therefore, this is a superset of yb_aminsert.
yb_aminsert is used instead of aminsert for Yugabyte. To support index writes, extract the scan entries (reuse code from upstream) for each item and write
[scan entry, basectid]
For example, given ybgin index on a int[] and insert pk='foo', a={1,3,5}, write
[1, 'foo']
[3, 'foo']
[5, 'foo']
to the index and
['foo'] -> '{1,3,5}'
to the base table.
amgettuple is used for ybgin, unlike amgetbitmap for gin. Since gin uses bitmap scan but ybgin doesn't, the implementation here is most different.
Given a query, extract the scan keys and entries (reuse code from upstream). In the simplest case, only one scan key and one scan entry is given, so the query can be like
basectid := get_from_idx(scan entry)
return get_from_tab(basectid)
A more complicated case is one key and two entries where both are required. We can scan using one entry and recheck the condition afterward.
For the first iteration, rechecking the condition is always done, even if it may not be necessary.
yb_amdelete is used. Upstream postgres doesn't have a delete because it relies on vacuum, but we don't have that, so we need to explicitly write tombstone records. This is similar to insert.
[scan entry, basectid] -> tombstone
An update operation does a delete then insert. This may be inefficient for ybgin since updating one row can mean many deletes and inserts when only a few would have sufficed.
For regular indexes and tables, nulls are binary: an item is either null or not. For GIN, there's more to distinguish, so they are categorized as follows:
ARRAY[null])ARRAY[])nullDocDB does not support null categories, so we can add a new value type specifically for GIN nulls. For the first iteration, nulls won't be supported.
For regular indexes, there are scan flags such as SK_SEARCHISNULL and SK_SEARCHARRAY. For GIN, there are search modes:
GIN_CAT_EMPTY_ITEMGIN_CAT_NULL_ITEMMulticolumn will be disabled in the first iteration, and the design is forthcoming.
The upstream PostgreSQL "fastupdate" option writes rows to a buffer (called pending list) before flushing to disk for performance purposes. We're not implementing this in the first iteration. Moreover, since in a multi-node setup the list needs to be cached on all nodes, we may never support it.
Some extensions extend gin:
btree_gin: add opclasses to support ordinary types (no element extraction) so that they can be pseudo-included in gin indexes, which don't allow included columnshstore: add opclass to support hstore typeintarray: add opclass to support faster and more operators on _int4 (int4 array) type without nullspg_trgm: add opclass to support trigram text search on text typeA ybgin equivalent can be created for each. Creating the system objects is simple, but the underlying implementation may need to be rewritten.