Feature Name: User-defined SQL logical schemas
Status: superseded by #48276
Start Date: 2018-10-03
Authors: (your name here!), knz
RFC PR: #30916, originally #13319
Cockroach Issue: #26443

Remember, you can submit a PR with your RFC before the text is complete. Refer to the README for details.

Summary

This RFC proposes to introduce an additional level in the namespace structure of stored objects (tables, views, sequences), called the schema namespace, in between the database and objects themselves, and enable SQL client to create their own schemas side-by-side inside a database.

This makes it possible to have two tables with the same name, say orders, in the same database, by making them live in separate schemas: for example, mydb.testschema.orders and mydb.prodschema.orders. This is useful for compatibility and to ease hosting of multiple users/apps on a single cluster.

Motivation

There are two motivations:

A. compatibility with PostgreSQL -- ORMs and client apps expect this to work. B. simplify the management of multi-user or multi-app clusters.

The scenario for the point B goes as follows:

suppose hosting company HostedInc, with customer TheUser.
TheUser staff wants to deploy multiple apps over time but does not want to inform HostedInc continuously about this.
Meanwhile HostedInc doesn't want to provide root access to TheUser, because that would cause risks to the stability of the hosted cluster.
The question thus arises: how to enable TheUser to deploy multiple apps without providing root access? Currently root access is required to create new databases.

The proposed solution, which incidentally is both industry-standard and already expected by users, goes as follows:

HostedInc creates a single database for TheUser.
HostedInc grants the "create schema" privilege to TheUser.
TheUser creates 1 schema per app in their database, and grants finer-tuned permissions on each schema for each app.
The app developers (or the apps themselves) create the tables/objects they need in their respective schemas.

Guide-level explanation

Currently CockroachDB supports a hierarchical system of two levels to organize stored objects (tables/views/sequences):

database
  |
  +---- object

For example:

bank
  |
  +---- users
  |
  +---- accounts

To access an object, CockroachDB currently allows queries to specify a name with 1, 2 or 3 components:

   > select * from users;
   > select * from bank.users;
   > select * from bank.public.users;

The 1-part form assumes that bank has been configured as "current database" with SET database = bank or USE bank.

In the 3-part form, the middle part (public in this example) is currently ignored [1].

This RFC proposes to change the hierarchical model for organizing stored objects as follows:

database
  |
  +---- schema
          |
          +---- object

For example:

bank
  |
  +---- production
  |        |
  |        +---- users
  |        |
  |        +---- accounts
  |
  +---- testing
           |
           +---- users
           |
           +---- accounts

With this structure, an app can access a table as follows:

using a simple 1-part name, e.g. users, both the "current database" (SET database / USE) and the current schema (SET search_path) decide which object is designated.
using a 3-part name, e.g. bank.production.users all the components are clearly specified.
using a 2-part name, we have to deal with two different cases:
- PostgreSQL-compatible applications will use e.g. production.users, where the first part specifies the schema and the second part the object.
- CockroachDB-specific applications developed pre-2.2 will use e.g. bank.users.
This ambiguity is already supported in CockroachDB since version 2.0 as follows:
- if the name has 2 components and the 1st component refers to a valid schema name, then it is understood to refer to that schema.
- otherwise if the 1st component refers to a valid database name, then use that database and the schema name public.
- otherwise an error is reported.

[1] the explanation above fully avoids discussing virtual schemas like pg_catalog and information_schema. These are not relevant to this section.

Reference-level explanation

See the previous section.

In addition, here one should understand that CockroachDB internally already has a very good understanding of PostgreSQL schemas, because that understanding is needed to properly support pg_catalog and information_schema (these are schemas).

So really, CockroachDB already has adequate schema support in that the name resolution rules in the SQL dialect are already suitable.

The missing block is the ability to create 1) additional 2) stored 3) user-defined schemas.

additional: currently just one stored schema is supported. It is anonymous in storage (system.namespace knows nothing about it) and called public in SQL. We want other schemas beside public.
stored: the schemas must continue to exist across cluster restarts, so they cannot be in-memory data structures only.
user-defined: client apps must be able to manage their own schemas with CREATE/ALTER/DROP SCHEMA, subject to syntax and permission checks compatible with PostgreSQL.

Detailed design

Technically the changes impact:

the format of table descriptors: the parent of a table is now a schema not a database.
the system.namespace table: this must support name resolution for schemas.
cluster initialization/migration: a stored schema named public must be created upon cluster initialization.
the SQL/KV name resolution code (sqlbase/resolver.go): this must resolve other schema names besides "public" and virtual schema names.
if new descriptors are added for schemas, we must add new leasing logic for schemas too like we have for tables.
the introspection facilities in pg_catalog, information_schema and crdb_internal: this must iterate over all stored schemas.
Logic for GRANT/REVOKE: this must support privileges on schemas in addition to databases/tables.
permission checking for CREATE (and perhaps SHOW): this must test the create privilege on the target schema not database.
cockroach dump: this must be extended to support dumping a single schema inside a database. It must also be extended to properly qualify schema names inside the dump data.
BACKUP/RESTORE: this must be extended to support backing up / restoring a single schema, and naming schemas when listing target objects.
IMPORT CSV: this must be extended to recognize a schema name in the import target.
IMPORT PGDUMP: this must be extended to disable the "flattening" of the schema information contained in the dump to "public", and instead preserve/use the schema information in the dump.
EXPORT: probably no change needed, but QA must verify that users can use EXPORT for single schemas, or tables across different schemas.
the web UI which presents the database objects in a cluster and the admin RPC endpoints that support these features in the web UI. This must reveal schemas between databases and tables.

Here maybe no technical changes are required (the web UI already knows about schemas, insofar it already knows about virtual schemas), but QA must verify user-defined schemas are surfaced properly.

Optionally:

add zone configuration propagation/inheritance from database to schema, then from schema to table. This is optional insofar that schemas don't have data of their own, so it could be possible to inherit directly from database to table even with schemas defined.

The ability to customize zones per schema would be a feature that eases the life of operators/developers.

(Ben notes: “I think this is worth doing at the same time that we introduce user-defined schemas.”)

Data structures

Relevant quote:

"Show me your flowcharts [code] and conceal your tables [data structures], and I shall continue to be mystified. Show me your tables [data structures], and I won’t usually need your flowcharts [code]; they’ll be obvious." -- Fred Brooks, Turing Awards recipient and author of the Mythical Man-Month.

Central to the new feature in this RFC is the question of how the name resolution works.

Current mechanism

Currently in CockroachDB we have the following connected components:

database descriptors, that know nothing about tables;
table descriptors, that have their own object ID, and a ParentID field that refers to the database descriptor.
the table system.namespace maps:
- IDs to names (for both databases and tables), with an index on names. This supports the name resolution name -> ID.
  
  select id from system.namespace where name = <requested>
  
  Note: the name resolution for databases? (TBD)
- IDs to ParentIDs. This supports listing "all tables of a given database".
  
  select id from system.namespace where parentID = <requested>
In system.namespace, the listed ParentID for a database descriptor is 0. This value is used to distinguish databases from tables and can be used to list all databases:

select id, name from system.namespace where parentID = 0

In this environment, renaming a table (even across databases) only requires rewriting entries in system.namespace.

This RFC proposes to extend this system in one of the ways suggested in the following sub-sections, subject to discussion and experimentation.

Alternative A: lightweight schemas

In this solution, 1) schemas do not have their own descriptors. 2) low-level schema IDs are global across all databases.

(I like the idea to avoid descriptors because that would ask complex questions about leases -- for reference, currently only table descriptors are treated transactionally. Database descriptors are not, and this was deemed acceptable because of some arbitrary perception that databases are only rarely created/dropped/renamed. When/if we implement user-defined schemas, we can't make the same assumption about schemas as we did about databases, and so we must add lease logic for schemas too. This may be complex.)

the field ParentID in TableDescriptor would then target schemas, not databases.
database descriptors would remain unchanged.
system.namespace does not change.
CREATE DATABASE and the migration for pre-20.1 clusters would create a public schema in every database and re-populate ParentID in every table descriptor to target its respective public schema.
CREATE SCHEMA would simply add a row to the system.namespace table.
DROP DATABASE, DROP SCHEMA would iterate over system.namespace accordingly.

Example:

bank
  |
  +---- production
  |        |
  |        +---- users
  |        |
  |        +---- accounts
  |
  +---- testing
           |
           +---- users
           |
           +---- accounts

In this solution system.namespace contains:

parentID	name	Id
0	bank	1
1	public	2
1	production	3
1	testing	4
3	users	5
3	accounts	6
4	users	7
4	accounts	8

The queries to access tables become:

to resolve the parent schema and database of a table:

   select n1.parentID as databaseID, n1.id as schemaID
     from system.namespace n1, system.namespace n2
  where n1.id = n2.parentID
      and n2.id = <requested>

to list all schemas inside a database:

select id from system.namespace where parentID = <requested>
to list all tables inside a schema:

select id from system.namespace where parentID = <requested>

to list all tables inside a database:

  select n1.id
    from system.namespace n1, system.namespace n2
 where n1.parentID = n2.id
     and n2.parentID = <requested>

In this solution, any additional metadata associated with a schema (for example, privileges and grants for creating new tables) would be stored in a separate system table. (to be defined by further experimentation - see also [#2939])

[#2939] https://github.com/cockroachdb/cockroach/issues/2939

Note that it is not necessary to verify schema-level privileges for non-DDL SQL queries, so the creation of a new separate table to hold these privileges would not add additional costs to the "hot path" of latency.

Alternative B: schemas with descriptors

Similar as above, but we create descriptors for schemas and put the privileges inside.

Question then remains what to do about leasing, range IDs, etc.

Alternative C: schemas inside database descriptors

This is also similar to A but here low-level schema IDs are local within 1 database.

the table descriptor is changed to have ParentDatabaseID and ParentSchemaID.
the database descriptor would be extended to contain a list of schemas descriptors. Each of these descriptors would be identified within the database descriptor with an ID starting at 0.
every new database descriptor is instantiated with 1 schema descriptor within with ID 0. This will become the public schema thanks to the next point.
2 new columns schemaID and schemaName are added to system.namespace, prefilled with default 0 and "public". These are used both:
- for databases, to list all schemas in the db.
- for tables, to tell which schema of its parent DB a table is contained in.
CREATE DATABASE and the migration for pre-20.1 clusters would add the 2 new columns to system.namespace with defaults 0 and public. This would ensure the public schema automatically exists in every DB, and that every existing table gets connected to it automatically.
CREATE SCHEMA would add a new schema descriptor inside the parent db descriptor, then add a row to the system.namespace table.
DROP DATABASE, DROP SCHEMA would iterate over system.namespace accordingly.

Example:

bank
  |
  +---- production
  |        |
  |        +---- users
  |        |
  |        +---- accounts
  |
  +---- testing
           |
           +---- users
           |
           +---- accounts

In this solution system.namespace contains:

parentID	name	Id	SchemaID	SchemaName
0	bank	1	0	public
0	bank	1	1	production
0	bank	1	2	testing
1	users	2	1
1	accounts	3	1
1	users	4	2
1	accounts	5	2

the queries to resolve names become:
- to resolve the parent schema and database of a table: look up the parent db descriptor from ParentDatabaseID, then look up the schema inside the db descriptor from ParentSchemaID.
  
  This is faster than for alternative A above.
- to list all schemas inside a database:
  - schema details: simply iterate over the schema descriptors inside the db descriptor
  - schema names:
    
    select schemaID, schemaName from system.namespace where id = <requested>
- to list all tables inside a schema:
  
  select id from system.namespace where parentID = <dbId> and schemaID = <schemaId>
- to list all tables inside a database:
```
select id from system.namespace where parentID = <requested>
```

In this solution, any additional metadata associated with a schema (for example, privileges and grants for creating new tables) would be stored in the parent db descriptor.

Alternative D: recycle the database descriptors

the DatabaseDescriptor gets a new field ParentID that indicates, when populated, that the desc is for a schema and the ParentID is its parent DB. The parent desc of a schema must have a ParentID of 0 (we're not proposing to support arbitrary depth in the hierarchy).
the field ParentID in TableDescriptor would remain as-is, and will be expected to refer to a db descriptor that's also a schema descriptor.

In the back-compat case where the parent dbdesc of a table is an actual db, not a schema desc (its own parentID is 0) then we'll consider the table to actually have the schema "public" as parent.

(Bob says: it's interesting to preserve this property "parent of table desc is actual db not schema => parent schema is public" because it enables not breaking old version nodes for existing table/dbs when new schemas become supported, in mixed-version clusters)

(knz/bob: however we found out that "public" is not special and must be droppable. If we preserve this property, "public" becomes special and cannot be dropped.)
system.namespace does not change.
CREATE DATABASE and the migration for pre-2.2 clusters would create a public schema in every database, and populate a system.namespace entry with public for every db.
CREATE SCHEMA would create a db descriptor with a suitable ParentID value, then populate system.namespace accordingly.
DROP DATABASE, DROP SCHEMA would iterate over system.namespace accordingly.

Example:

bank
  |
  +---- production
  |        |
  |        +---- users
  |        |
  |        +---- accounts
  |
  +---- testing
           |
           +---- users
           |
           +---- accounts

In this solution system.namespace contains:

parentID	name	Id
0	bank	1
1	public	2
1	production	3
1	testing	4
3	users	5
3	accounts	6
4	users	7
4	accounts	8

The queries to access tables become:

to resolve the parent schema and database of a table:

   select n1.parentID as databaseID, n1.id as schemaID
     from system.namespace n1, system.namespace n2
  where n1.id = n2.parentID
      and n2.id = <requested>

to list all schemas inside a database:

select id from system.namespace where parentID = <requested>
to list all tables inside a schema:

select id from system.namespace where parentID = <requested>

to list all tables inside a database:

  select n1.id
    from system.namespace n1, system.namespace n2
 where n1.parentID = n2.id
     and n2.parentID = <requested>

In this solution, any additional metadata associated with a schema (for example, privileges and grants for creating new tables) would be stored in the (db) desc, just like regular databases.

Note: after alternative D is implemented, in a mixed-version cluster, after CREATE SCHEMA db.foo; CREATE TABLE db.foo.t(...); CREATE TABLE db.public.u(...) is run in a 2.2 node, a 19.2 node that runs SHOW TABLES (or look at crdb_internal.tables) will see:

table name	database name
`u`	`db`
`t`	`foo`

Is that a problem? - suspicious because we can also have CREATE DATABASE foo; CREATE TABLE foo.public.t(...) and that will cause a duplicate row in crdb_internal.tables (and hence SHOW TABLES).

Bob's reaction: maybe we can gate CREATE SCHEMA/DROP SCHEMA upon bumping the cluster version.

Test scenario proposal

We propose to implement a validation test to evaluate the various strategies, that runs the following SQL scenario at different stages of a cluster lifecycle:

sql

-- 19.2 cluster

CREATE SCHEMA; -- error "unsupported"
DROP SCHEMA; -- ditto

-- 19.2+20.1 cluster, cluster version still 19.2

CREATE SCHEMA; -- error "disabled until version update"
DROP SCHEMA; -- ditto

CREATE DATABASE d21; CREATE TABLE d21.public.t21(x int); -- in node 19.2
CREATE DATABASE d22; CREATE TABLE d22.public.t22(x int); -- in node 20.1

CREATE TABLE d21.public.t22(x int); -- in node 20.1
CREATE TABLE d22.public.t21(x int); -- in node 19.2

ALTER DATABASE d21 RENAME TO d21_new; -- in node 19.2
ALTER DATABASE d22 RENAME TO d22_new; -- in node 19.2

SHOW DATABASES; -- in node 19.2, verify all is well;
SHOW DATABASES; -- in node 20.1, verify all is well;

SHOW SCHEMAS FROM d21_new; -- in node 19.2, verify all is well;
SHOW SCHEMAS FROM d22_new; -- in node 19.2, verify all is well;
SHOW SCHEMAS FROM d21_new; -- in node 20.1, verify all is well;
SHOW SCHEMAS FROM d22_new; -- in node 20.1, verify all is well;

SHOW TABLES FROM d21_new; -- in node 19.2, verify all is well
SHOW TABLES FROM d21_new; -- in node 20.1, verify all is well
SHOW TABLES FROM d22_new; -- in node 19.2, verify all is well
SHOW TABLES FROM d22_new; -- in node 20.1, verify all is well

DROP DATABASE d21 CASCADE; -- in node 19.2, verify no leftovers in system.namespace?
DROP DATABASE d22 CASCADE; -- in node 19.2, verify no leftovers in system.namespace?

-- FIXME: some GRANT tests here.

-- 20.1 cluster with version bumped

CREATE SCHEMA; -- OK
DROP SCHEMA; -- ditto

-- take the tests from above and verify the behavior is still OK

Drawbacks

Why should we not do this?

TBD

Rationale and Alternatives

Arguments from Ben about the choice between A, B, C:

I prefer option B.

I dislike option A because it creates such a difference between databases (which have descriptors) and schemas (which have some other to-be-determined place to store the things that are currently in the DB descriptor). Option A would be more appealing to me if we planned to move away from database descriptors too and make them use the same system as for schemas.

Option C is acceptable, but it feels hackier to me than option A or B. I think the concerns about database descriptor leases are more severe for option C (for option B, operations on schemas would not be substantially more common or more complex than operations on databases are today, so it doesn't make things much worse, but in option C the kinds of changes we'd make to database descriptors become more complex). Vivek's schema leases should take care of this no matter which alternative we choose here, though.

Discussion with Bob:

Q: can we not just store the schema of a table as an additional field in the table descriptor?

A: yes that is option A.
Q: can we reuse the DatabaseDescriptor as a schema descriptor, with perhaps some additional field that marks it as a "Schema" (and not a db) and indicates its db parent?

A: this would be a new alternative (D) - however today (CRDB 19.2) we do not have proper transactional semantics on database DDL; clients may be ok with lack of txn sems on DBs but the word is that they care much more about txn sems on schema DDL. If we do option D, we'll need to introduce txn semantics on db descs.

A: alternative D raises an interesting question about the status of schema "public" in mixed-version clusters. What happens when 20.1 nodes are started with some 19.2 nodes still active?
- with the "base" version of alt D, existing/non-modified tb descs refer to an actual db desc as parent, and this is interpreted as "public" however such a public schema cannot be edited (eg dropped).
  
  Q: Is that a problem? When can the restriction be lifted?
  
  A: presumably when the cluster version is bumped.
  
  Q: can other (non-"public") schemas be used in the meantime?
  
  A: not really, see remarks at end of section for alt D.
Q: does any of the alternatives support actually sane mixed-version behavior?

A: not definitive answer yet?
- Discussion above suggests not in alt D. What about the others?
- With alt B (separate descriptors for schemas) it appears that the 19.2 code would "skip over" descriptor types that it does not know about (the type casts from Descriptor to Table/DatabaseDescriptor are conditional). However we need to audit all such casts not just those used for SHOW / crdb_internal / pg_catalog / information_schema.
- With alt A presumably the schemas would be entirely invisible to the 19.2 nodes. However we need to ensure that any schema information populated by 20.1 nodes do not get destroyed by subsequent DDL in 19.2 nodes. (Would the new tb desc ParentID field be overwritten by DDL in 19.2?)
- With alt C fundamentally the same questions as A, in addition need to check that db DDL by 19.2 nodes do not destroy the schema data in db descs edited by 20.1 nodes.
Alts A and C raise uncomfortable questions because the corresponding investigation needed to establish confidence about mixed-ver compat is difficult. The investigation for alt B is also somewhat difficult.

Arguably, in order to keep the problem tractable, out of pragmaticism, we may choose to simply disallow schema DDL until cluster version is bumped, in which case we can skip concerns about mixed-ver compat from the decision making altogether.
Q: suppose we restrict access to schema DDL until cluster version is bumped. What of 20.1 nodes in a mixed-ver cluster, before the cluster version is bumped? Do we need two code paths and a switch on the cluster version?

A: depends.
- alt C: no, because of reasonable defaults
- alt D: completely fine because of reasonable defaults
- alt A: we need a migration when the first 20.1 node starts, to add the "public" schema entries in system.namespace. Then the code will work because of sane defaults (schema ID for "public" is 0).
  
  Q: what happens with CREATE DATABASE in 19.2 nodes after the 20.1 node starts? These will not get their "public" schema entry in system.namespace!
  
  A: TBD
- alt B: same questions as alt A
Q: what of the impl of alt D, in the intermediate situation (20.1 node active, cluster version not bumped yet)?

A: combination of multiple things:
- CREATE SCHEMA/DROP SCHEMA is disabled (until ver is bumped)
- SHOW TABLES must work; this is possible by an additional conditional (described in alt D section) that says if the parent of a tb is a "real" db, interpret this as schema "public" instead.
- GRANT of schema-level permissions are still disabled at that point, because we can't ensure that we have a valid desc for "public" (for compat with 19.2 nodes)
- migration upon bumping the cluster ver, to fix up all the table descs to refer to an actual "public" schema desc instead.
- after that migration has ran, schema DDL and GRANT on schemas become available.

Unresolved questions

Which alternative design above will be selected.
Whether or not schemas should have their own zone config.