docs/RFCS/20200331_enums.md
This RFC proposes adding enum types to CockroachDB.
Enum types are a class of user defined types where the values in the type are constrained to a fixed set of user specified values. The system then ensures type safety over operations on this type. This includes ensuring that only values that are members of the enum can be inserted into a column of the enum type, and that enums can only be compared to other values of the same enum type. For example, consider an application that needs to store events and the days of the week that they happen. This application could use an enum to represent the days of the week.
CREATE TYPE day ENUM AS ('monday', 'tuesday', 'wednesday'...);
CREATE TABLE events (id INT, dayofweek day);
INSERT INTO events VALUES (1, 'monday');
To implement enum types in CockroachDB, we have to touch many layers of the system. In particular, we need to introduce a way of storing metadata about enums durably in the database. We then need a way to cache this metadata so that lookups on this metadata is fast, as well as a way to invalidate this cache when enum metadata changes. When enum metadata changes, we need to ensure that these changes do not result in some nodes in the cluster entering a situation where they are unable to process enum values they find. Lastly, we need to define a physical layout for enums and integrate enums within the type system and SQL execution stack.
Enums themselves are a special case of user-defined types. In order
to lay the groundwork for future work in this area, we propose storing
metadata about an enum in a new descriptor called a TypeDescriptor.
This descriptor will be added to the descriptor union alongside table and
database descriptors. The descriptor will store metadata about the type,
including the parent database and schema IDs, a unique ID for the type, and
the name of the type. The descriptor will also include specific information
for the kind of type being stored in the descriptor (as of now there
would only be enums). For enums, this information would include the mapping
of the enum's values to their physical representations. A proposal of the
descriptor's contents is below:
message TypeDescriptor {
// Used by all kinds of user-defined types.
// Parent database and schema.
uint32 parent_id;
uint32 parent_schema_id;
// ID and Postgres compatible OID of the type.
uint32 id;
uint32 oid;
// Visible name of the type.
string name;
// Enum specific fields.
message enum_members {
byte[] physical_encoding;
string name;
};
enum_members[] members;
}
These descriptors
will be stored in the system.descriptor table and will use the leasing
and versioning system being built. There is ongoing work on unifying
the leasing interface so that components are easily shared across
different descriptor types, and we will take advantage of these
systems once they are available. The leasing system will enable caching
and cache invalidation of type descriptors. Until the leasing system
is ready for integration, we will first implement a prototype
that either doesn't use a cache or uses a simple incoherent cache for
TypeDescriptor access.
Enums are scoped within a database and a schema. In Postgres, enums cannot be accessed from other databases -- they can only be accessed from different schemas in the same database. However, there is no core reason that CockroachDB cannot support this. In fact, we might need to support references of types across databases to be in line with other cross database references that we currently support. The topic of cross database references has come up in discussion about user defined schemas as well. The direction that we take in allowing cross database references vs allowing only cross schema references will follow what has been decided in that context.
Table and type names exist within the same namespace in Postgres. This means
that it is possible to create a type and table of the same name within
the same schema. Additionally, tables in Postgres are types themselves
as a record type where each field is typed like the tables columns. Therefore,
we will store type namespace entries along with table namespace entries
in the system.namespace table. This allows namespace conflicts between
types and tables to be properly detected, as well as allowing us to reuse
a large amount of name resolution logic that exists for table name lookup.
This strategy also will allow the user defined types implementation to
adapt to new features like user defined schemas without extra work.
All user defined types will need a stable ID that they are uniquely addressable
by from within CockroachDB, as well as an OID that can be used for Postgres
compliant operations. Importantly, the OIDs cannot conflict
with existing type OIDs. Our strategy is to construct OIDs from the stable ID.
In particular, the OID of a user defined type is equal to
ID + oidext.CockroachPredefinedOIDMax. This strategy allows us to easily
map back and forth between OIDs and IDs, and avoid using multiple counters for
essentially the same information. The offset ensures that no user defined
types have OIDs that conflict with any preexisting OIDs. This approach will
naturally extend when we allow treating tables as types.
There are a few ways that enums can change over time.
In order to rename an enum or a value in an enum can be done with a write to the enum descriptor and then waiting for all nodes to agree on the new value. There are plans to lift operations on descriptor names off of the individual descriptors, because such operations are common to all of them. This work would involve moving the draining names off of descriptors as well. It's possible that this work would be part of or take advantage of this effort.
The case of adding a new enum element is more difficult. The key difficulty comes from ensuring that a node does not attempt to translate a physical layout that it does not know about yet into a user facing representation of the enum. If we naively just add the new enum value to the enum metadata, it is possible that another node reads a newly written enum from disk and is unsure how to decode it. Consider the following sequence of events:
foo to its enum descriptor and blocks on
WaitForOneVersionfoofoo before receiving the update to
its enum descriptor.In order to avoid these situations, we propose an extension of the strategy used for performing online schema changes. As a reminder, when we add a new schema object to a table, it moves through a series of states before becoming usable. As the object moves through these states, the types of operations that are allowed upon the object change. Between each state, we require that all nodes in the cluster agree on the new version of the schema object. For more details, refer to the online schema changes RFC. We propose a similar state progression to adding new elements to an enum type.
This process ensures that all nodes know about all potential enum values before they have a chance to be written. This approach has the drawback of not being able to add an enum value and then insert that value in the same transaction. This drawback is similar to our existing limitation of not being able to add a column and insert into it in the same transaction.
This enum schema change will be implemented with a new job, rather than trying to build off of the existing table schema changer. While conceptually similar to a table schema change, there is not much implementation to share. This new job will
A rollback of this job can just remove the "read-only" values.
Additionally, enums don't really need a concept of mutations like tables. The
members of an enum in the enum's TypeDescriptor can be tagged with whether
the member is "read only" or public.
In Postgres, if an enum is dropped without CASCADE, the operation will not succeed
if there are any tables that use the enum. If an enum is dropped with
CASCADE, all dependent columns are dropped as well. If the database
that an enum is created within is dropped, then the enum
is dropped as well. In order to maintain this information, the
descriptors that represent an enum need to hold back-references to
the tables and columns that use them. We expect the descriptor leasing
system being developed to manage invalidation of cached enums when enums
are destroyed in these cases.
At first, it may seem that a valid implementation of enum values is to map each to an integer, and then store these integers on disk. This implementation seems like it would supply all the ordering guarantees needed of enums. However, Postgres allows for adding new enums and specifying the order of the newly created enum with respect to an existing value of the enum. This looks like:
CREATE TYPE t ENUM AS ('v1', 'v2');
ALTER TYPE t ADD VALUE 'v1.5' AFTER 'v1'
This means add the value v1.5 to the enum t and order it
after the value v1. Using just integers as the backing value
for enums would not allow us to handle this sort of case.
Postgres implements this feature on enums by storing a sorting
order for enums as a float. When a new value is added like this,
Postgres takes the sort orders of the enums that the new enum is
being inserted in between, and creates a float that bisects the
range between the two orders. Concretely, if v1 had a sort order
of 1.5 and v2 had a sort order of 2.0, then v1.5 would be
inserted with a sort order of 1.75. However, once the floating
point precision limit has been reached, Postgres rewrites all
sort orders to integral values. Postgres can do this because it
doesn't require a stable disk encoding for enums. In our case,
we need to have a stable encoding to store data on disk if an enum
is used in an index, and cannot afford to rewrite all tables using an
enum if the enum sort order has changed.
We propose a different strategy that is related to this idea of
bisecting ranges, but doesn't suffer from problems due to floating
point arithmetic precision. The general idea is to use byte arrays
to hold the sort order of our enums, and reserve some bytes in the
arrays to create the ordering that we need. In particular we reserve
the minimum byte (0) and have a maximum allowed byte. In practice
this will be 255. An example of the encoding scheme is below.
Assume we started with 3 elements (a, b, c), and let the maximum byte value be 3.
The sort order byte arrays for each element would be:
a 1/
b 2/
c 3/
To add an element after b we can create a new key that sits in the middle of the range
between b and c.
a 1/
b 2/
d 2/2/
c 3/
Now lets add more values before d. The first one is easy:
a 1/
b 2/
e 2/1/
d 2/2/
c 3/
The tricky case is adding a value before e. Because we reserved the minimum byte, we can
append it and then bisect the range again.
a 1/
b 2/
f 2/0/2
e 2/1/
d 2/2/
c 3/
This strategy can be extended indefinitely as long as this pattern is followed to reserve the minimum byte. A prototype of the exact algorithm is included as part of the RFC PR.
This sort order byte array will be the physical layout and identifier of the enum. We expect that for small enums only a byte or two will be used to hold all the values, and that our compression strategies at the storage layer will compress this data well.
Since the common case of adding members to an enum is to add a member at the beginning or end of the set of values, we can adjust the algorithm slightly to better handle this case. When generating a new key byte where one of the endpoints is the min or max element, the algorithm can add or subtract a small constant from the existing key rather than bisecting the range. This allows for adding many more elements to the beginning or end of the range without increasing the number of bytes used to store the enum. The algorithm can be found implemented in this PR.
Currently, the CockroachDB grammar is not equipped to handle type names that are qualified due to changes made in the past that separated parsing of object and type identifiers. Some of these changes will have to be reverted/adapted in order to allow for types to have qualifications again. The work to allow the parser to recognize qualified names has been done in this PR.
The type system of CockroachDB currently makes an assumption that anywhere
a type is present in an AST, that type is statically known. In code, this
means that every AST object that holds a type (like a CastExpr or
ColumnDef) holds a *types.T, which is constructed at parse time.
As part of implementing user defined types, the type system must be taught
that all types are no longer statically known. The general idea is to change
the types in AST nodes to a new interface representing an unresolved type
reference. These type references can then be resolved into *types.T through
type resolution. Additionally, we must enforce that types are only attempted
to be accessed after type checking, when all type references have been resolved.
A prototype of this approach can be found in
this PR.
After the process of type resolution, enums need a types.T for interaction
with other components of the system. We will introduce a new family for enums,
and the types.T for an enums will contain the stable ID for the
TypeDescriptor that backs the type. The types.T will also contain extra
fields for an enum like the mapping of names to values. Importantly, these
extra fields will not be serialized as part of the proto. Instead, when a
type is resolved, the returned *types.T will be hydrated to populate these
fields.
A potential option was to avoid using
a TypeDescriptor and instead just extend the types.T proto to contain
necessary fields for user defined types. However, this is not feasible because
the types.T proto's are stored on disk in various descriptors. It is too
expensive to update all descriptors that contain a type every time the type
is altered.
A new Datum DEnum will be introduced to represent values of the
enums at runtime. A DEnum will store the physical representation of the
enum as well as the hydrated *types.T of its type. The extra fields in the
*types.T that hold information about enum values will be used for datum
operations without the need to thread ID resolution capabilities to evaluation
of operations on datums.
When a user-defined type is created in Postgres, Postgres will automatically
create an alias for an array of the new type. For example, if a user creates
a type days, the system would also create the type _days as an alias for
days[]. This type tracks changes made to the referenced type as it
moves through schemas and is dropped.
The optimizer will need to be taught about the check constraint implied by
a column being of an enum type. Additionally, it will need to be taught how
to convert enum values from their input string representation into their
Datum physical representation.
The Catalog that is used by the optimizer will need to be extended to support
resolution of types. The way that the catalog represents user defined types is
important for invalidation of cached plans. If a type is updated, all plans
containing data sources using the type need to be invalidated.
The gateway node that plans a SQL query has access to all resolved type
information for the query. Remote nodes that different parts of the query
are planned on need access this information in order to correctly execute
the query. In particular, these nodes need to hydrate their *types.T
containers with metadata and they need to parse and type check serialized
expressions. The hydration of *types.T objects can be done at operator
initialization. The trickier problem is type checking serialized expressions --
we don't want to pay the cost of name resolution again. Our strategy is to
serialize user defined type references with their OIDs similar to how column
references are serialized. All explicit references to user defined types (i.e.
in casts or user defined type value literals) will be serialized like @<OID>.
The expression initialization process will resolve these OID references to the
correct TypeDescriptor. To actually resolve these references, we access the set
of leased descriptors through a descs.Collection that is initialized for each
flow.
During discussion of the RFC, some alternatives were debated. In particular,
the ideas of using a separate namespace table for types and/or a separate
descriptor table for metadata storage. The benefit of a separate namespace
table is that it has the potential of making future work in allowing tables
to be interpreted as types more straightforward. However, using a separate
namespace table complicates existing name resolution and conflict detection
strategies. A separate descriptor table allows for scans over all tables or
types to not have to touch descriptors of different types, which is a
performance improvement for catalog table operations. However, this problem
is somewhat orthogonal to this work, and would be better solved by building
some sort of indexing structure on the system.descriptor table.
Using the existing namespace table allows most of the existing name resolution
code to be used directly, and using the same descriptor table allows for
leasing primitives to be built on only one system table.
One alternative approach to this physical layout was to store just an enum ID on disk, and store ordering and representation information in a separate lookup table. When operations like on enums would involve joining or rendering the enums, a join would be produced against this reference table. This allows for easy changing of enum data, but results in a variety of complexity during planning.
It is unclear what interactions will arise between this work and the planned/ongoing work with user defined schemas.