architecture/design/ysql-tablegroups.md
Note: This is a new feature in Beta mode.
"In workloads that do very little IOPS and have a small data set, the bottleneck shifts from CPU/disk/network to the number of tablets one can host per node. Since each table by default requires at least one tablet per node, a YugabyteDB cluster with 5000 relations (tables, indexes) will result in 5000 tablets per node. There are practical limitations to the number of tablets that YugabyteDB can handle per node since each tablet adds some CPU, disk and network overhead. If most or all of the tables in YugabyteDB cluster are small tables, then having separate tablets for each table unnecessarily adds pressure on CPU, network and disk.
To help accomodate such relational tables and workloads, we've added support for colocating SQL tables. Colocating tables puts all of their data into a single tablet, called the colocation tablet. This can dramatically increase the number of relations (tables, indexes, etc) that can be supported per node while keeping the number of tablets per node low. Note that all the data in the colocation tablet is still replicated across 3 nodes (or whatever the replication factor is)." - From the existing design of colocated tables.
We have extended this concept to something aptly titled "tablegroups." A tablegroup is a group of tables that will be colocated together on a single colocation tablet. This can be specified at the language level in YSQL. This will allow users to create multiple colocation tablets per database.
The motivation behind having multiple groups of co-located tables within a database is similar to that of specifying a database as co-located. Currently, in specifying a database as co-located, we are able to generate a single tablet that all future tables (that do not opt out of co-location) are added to. This allows a user to co-locate closely related tables together and avoid paying the cost of additional round trips when making joins etc. amongst the related data or paying the storage/compute overhead of creating a tablet for every relation, improving performance.
However, this current implementation has a few pitfalls. There is sometimes a need to horizontally scale groups of tables together, especially as the colocation tablet containing them grows excessively large. There are also needs to co-locate multiple different groups of tables (i.e. per-schema) without requiring them all to be on one tablet. This can help improve scalability while still benefitting from the latency improvements afforded by colocation.
This problem can be addressed three ways - all under the concept of forming tablegroups. We have implemented the first (currently in beta) and are beginning a conversation around which of the latter two we will eventually support.
When a tablegroup is created, the catalog manager should create a parent table and parent tablet for this tablegroup for colocation. This will reuse code from the initial commit of database-level colocation to support multiple tables for a single tablet. Same idea of a dummy table as the system tables using the sys catalog table (and kSysCatalogTableId as primary_table_id). These changes should be similar to the CREATE DATABASE changes found here.
Catalog manager will have a per-namespace map of tablegroup_id to the TabletInfo for that tablegroup. This tablegroup_id is a 16-byte Yugabyte UUID that is derived from the tablegroup oid and database oid in the same way that PgsqlTableId is derived.
This will initially look like map<tablegroup_id, tablet_id> tablegroup_tablets; and map<namespace_uuid, tablegroup_tablets> tablegroup_ids_map; Eventually tablegroup_tablets will be changed to be map<tablegroup_uuid, vector<tablet_id>> if interleaving or co-partitioning is implemented.
Catalog manager will also maintain some in-memory metadata for tablegroups.
In the event of a restart, catalog_loaders will be able to reload this information based on the persistent table and tablet info.
There are two things that need to be done here:
pg_class or re-use existing columns in pg_class to store the per-relation tablegroup information.For the new system catalog we will do the following:
Add a pg_tablegroup system catalog that contains the oid, grpname grpacl, grpowner, and grpoptions of each tablegroup. In order to ensure backwards compatibility we will be checking whether the physical table exists in postinit and if so, set a Postgres global. Elsewhere in the codebase anytime where a tablegroup-related operation needs to be performed, this PG global will be checked to guard opening the relation.
For the per-relation information there are a couple options:
There is the option to use the reltablespace field of pg_class. In this column, we would be able to store the grpoid corresponding to the tablegroup that the table is to be a part of. This likely is not the best long-term solution especially since we intend to implement tablespaces to handle placement configurations / GDPR related requirements.
The related option (and likely simplest solution to implement) would be to add another column that would represent the tablegroup that a relation is a part of. This could introduce some issues such as needing a default tablegroup / having to create this column for all prod databases. Until we have an initdb upgrade path in place, we cannot introduce a reltablegroup
The option that we are now choosing is to use reloptions to store the per-relation tablegroup ID information. We have added a new reloption of type OID to store this information.
To allow for extensibility for co-partitioning / interleaving, the idea we had in mind is as follows.
In order to create a tablegroup, a user must either be a superuser or have create privileges on the database the tablegroup is to be created in. At time of tablegroup creation, an OID is generated and existence is checked via a lookup to pg_tablegroup. If the tablegroup does not exist, a new tuple is inserted.
We have created a flow for CREATE TABLEGROUP as well as corresponding new RPCs. On the catalog manager side, we will be creating a parent table for the tablegroup with a single tablet. Catalog manager will add entries to the appropriate maps. The prefix for the table name and ID will be the tablegroup ID and the suffix will be a const tablegroup parent ID/name suffix. The colocated property will be set for the table and tablet in order to take advantage of some code reuse.
A tablegroup must explicitly be dropped and must be empty (no relations associated with it) in order to be dropped. To do so we scan pg_class and parse reloptions when non-empty.
On the catalog manager side, corresponding entries are removed from the maps and the parent table & tablet are deleted.
This will be supported. Will amount to changing the Postgres metadata after checking permissions on the tablegroup.
RBAC for the tablegroups themselves - ACL_CREATE privileges will be grantable for tablegroups.
If the table is created as CREATE TABLE [...] TABLEGROUP tablegroup_name; then the table will be colocated with the rest of the tables in that tablegroup.
To help avoid confusion between colocated db and tablegroups, both will not be allowed within the same namespace. COLOCATED=true/false as part of the CREATE TABLE ddl will throw an error. If the database was created with COLOCATED=true, then any CREATE TABLE ddl with TABLEGROUP specified will throw an error. Over time we will deprecate COLOCATED=true/false from both CREATE TABLE and CREATE DATABASE and transition to only using TABLEGROUP.
We must also check to make sure that a SPLIT clause is not included as we will not initially support co-partitioning/interleaving. With co-partitioning / interleaving we will need to have additional logic under the hood to ensure that the partition keys are properly specified. In the future we will need to validate that for co-partitioning / interleaving, the partition keys match (in data type & # columns).
Logical flow for CREATE TABLE will look like this:
There will be a couple options available for indexes:
NO TABLEGROUPTABLEGROUP group_name syntax.Should be similar to this (and this) commit. Will need to remove the map entries from catalog manager and the corresponding info from the Postgres system metadata. Physical removal of table data will happen during compactions. By setting the colocated property in table and tablet metadata, we should be able to take advantage of earlier implementation here.
In the case of interleaving, there should be a specification whether to drop all children in a cascade or just drop the table and interleave its children with its parent.
Taking from here will be a table-level tombstone. Refer to above commits related to drop table flow for colocation. Need to make sure to use this alternative code path. Reads are already aware of this table-level tombstone.
This is currently in the works for the status quo implementation of colocation. We will not support this in v1 for tablegroups.
Here are a couple ideas on how to support this:
Modified the master tables UI to display a table for the parent table information and YSQL OID's of the tablegroups (only when tablegroups are present). Similarly, when tablegroups are present, a new column will be in the UI for User tables and Index tables that will display the parent OID if any.
Supporting the ability to alter the tablegroup of a table/index.
ysql_dump / backup & restore with tablegroups present.
Better load balancing tablegroups / colocated tables.
Supporting co-partitioning and/or interleaving.
Automatic tablet splitting of tablegroup parent tablets.
Ability to create a CDC stream per-tablegroup.
Initdb upgrade path (not specific to tablegroups, but relevant).