Back to Cockroach

Summary

docs/RFCS/20200331_enums.md

26.1.319.2 KB
Original Source
  • Feature Name: Enum Data Types in CockroachDB
  • Status: accepted
  • Start Date: 2020-03-31
  • Authors: Rohan Yadav, Lucy Zhang, Andrew Werner, Jordan Lewis
  • RFC PR: #47070
  • Cockroach Issue: #24873

Summary

This RFC proposes adding enum types to CockroachDB.

Background

Enum types are a class of user defined types where the values in the type are constrained to a fixed set of user specified values. The system then ensures type safety over operations on this type. This includes ensuring that only values that are members of the enum can be inserted into a column of the enum type, and that enums can only be compared to other values of the same enum type. For example, consider an application that needs to store events and the days of the week that they happen. This application could use an enum to represent the days of the week.

sql
CREATE TYPE day ENUM AS ('monday', 'tuesday', 'wednesday'...);
CREATE TABLE events (id INT, dayofweek day);
INSERT INTO events VALUES (1, 'monday');

Overview

To implement enum types in CockroachDB, we have to touch many layers of the system. In particular, we need to introduce a way of storing metadata about enums durably in the database. We then need a way to cache this metadata so that lookups on this metadata is fast, as well as a way to invalidate this cache when enum metadata changes. When enum metadata changes, we need to ensure that these changes do not result in some nodes in the cluster entering a situation where they are unable to process enum values they find. Lastly, we need to define a physical layout for enums and integrate enums within the type system and SQL execution stack.

Detailed Explanation

Metadata Storage

Enums themselves are a special case of user-defined types. In order to lay the groundwork for future work in this area, we propose storing metadata about an enum in a new descriptor called a TypeDescriptor. This descriptor will be added to the descriptor union alongside table and database descriptors. The descriptor will store metadata about the type, including the parent database and schema IDs, a unique ID for the type, and the name of the type. The descriptor will also include specific information for the kind of type being stored in the descriptor (as of now there would only be enums). For enums, this information would include the mapping of the enum's values to their physical representations. A proposal of the descriptor's contents is below:

proto
message TypeDescriptor {
    // Used by all kinds of user-defined types.
    // Parent database and schema.
    uint32 parent_id;
    uint32 parent_schema_id;
    // ID and Postgres compatible OID of the type.
    uint32 id;
    uint32 oid;
    // Visible name of the type.
    string name;
    
    // Enum specific fields.
    message enum_members {
        byte[] physical_encoding;
        string name;
    };
    enum_members[] members;
}

These descriptors will be stored in the system.descriptor table and will use the leasing and versioning system being built. There is ongoing work on unifying the leasing interface so that components are easily shared across different descriptor types, and we will take advantage of these systems once they are available. The leasing system will enable caching and cache invalidation of type descriptors. Until the leasing system is ready for integration, we will first implement a prototype that either doesn't use a cache or uses a simple incoherent cache for TypeDescriptor access.

Name Resolution

Enums are scoped within a database and a schema. In Postgres, enums cannot be accessed from other databases -- they can only be accessed from different schemas in the same database. However, there is no core reason that CockroachDB cannot support this. In fact, we might need to support references of types across databases to be in line with other cross database references that we currently support. The topic of cross database references has come up in discussion about user defined schemas as well. The direction that we take in allowing cross database references vs allowing only cross schema references will follow what has been decided in that context.

Table and type names exist within the same namespace in Postgres. This means that it is possible to create a type and table of the same name within the same schema. Additionally, tables in Postgres are types themselves as a record type where each field is typed like the tables columns. Therefore, we will store type namespace entries along with table namespace entries in the system.namespace table. This allows namespace conflicts between types and tables to be properly detected, as well as allowing us to reuse a large amount of name resolution logic that exists for table name lookup. This strategy also will allow the user defined types implementation to adapt to new features like user defined schemas without extra work.

ID's and OID's

All user defined types will need a stable ID that they are uniquely addressable by from within CockroachDB, as well as an OID that can be used for Postgres compliant operations. Importantly, the OIDs cannot conflict with existing type OIDs. Our strategy is to construct OIDs from the stable ID. In particular, the OID of a user defined type is equal to ID + oidext.CockroachPredefinedOIDMax. This strategy allows us to easily map back and forth between OIDs and IDs, and avoid using multiple counters for essentially the same information. The offset ensures that no user defined types have OIDs that conflict with any preexisting OIDs. This approach will naturally extend when we allow treating tables as types.

Changing Enum Definitions

There are a few ways that enums can change over time.

  • The name can change.
  • The schema the enum is in can change.
  • A new enum member can be added to the set of values.
  • A member in the enum can be renamed.
  • The enum can be dropped.

In order to rename an enum or a value in an enum can be done with a write to the enum descriptor and then waiting for all nodes to agree on the new value. There are plans to lift operations on descriptor names off of the individual descriptors, because such operations are common to all of them. This work would involve moving the draining names off of descriptors as well. It's possible that this work would be part of or take advantage of this effort.

The case of adding a new enum element is more difficult. The key difficulty comes from ensuring that a node does not attempt to translate a physical layout that it does not know about yet into a user facing representation of the enum. If we naively just add the new enum value to the enum metadata, it is possible that another node reads a newly written enum from disk and is unsure how to decode it. Consider the following sequence of events:

  • Node 1 receives a new enum element foo to its enum descriptor and blocks on WaitForOneVersion
  • Node 2 receives the new enum descriptor update and writes a value with foo
  • Node 3 tries to read the value of foo before receiving the update to its enum descriptor.

In order to avoid these situations, we propose an extension of the strategy used for performing online schema changes. As a reminder, when we add a new schema object to a table, it moves through a series of states before becoming usable. As the object moves through these states, the types of operations that are allowed upon the object change. Between each state, we require that all nodes in the cluster agree on the new version of the schema object. For more details, refer to the online schema changes RFC. We propose a similar state progression to adding new elements to an enum type.

  1. When a new value is added to an enum, it is instead placed into a "read only" state.
  2. After all nodes agree on the "read only" state, the new enum value is promoted into the set of writeable values in the enum.

This process ensures that all nodes know about all potential enum values before they have a chance to be written. This approach has the drawback of not being able to add an enum value and then insert that value in the same transaction. This drawback is similar to our existing limitation of not being able to add a column and insert into it in the same transaction.

This enum schema change will be implemented with a new job, rather than trying to build off of the existing table schema changer. While conceptually similar to a table schema change, there is not much implementation to share. This new job will

  1. Collect all "read only" enum values and wait for one version in the cluster.
  2. Transition these values to "public", and then wait for one version in the cluster.

A rollback of this job can just remove the "read-only" values. Additionally, enums don't really need a concept of mutations like tables. The members of an enum in the enum's TypeDescriptor can be tagged with whether the member is "read only" or public.

In Postgres, if an enum is dropped without CASCADE, the operation will not succeed if there are any tables that use the enum. If an enum is dropped with CASCADE, all dependent columns are dropped as well. If the database that an enum is created within is dropped, then the enum is dropped as well. In order to maintain this information, the descriptors that represent an enum need to hold back-references to the tables and columns that use them. We expect the descriptor leasing system being developed to manage invalidation of cached enums when enums are destroyed in these cases.

Physical Layout

At first, it may seem that a valid implementation of enum values is to map each to an integer, and then store these integers on disk. This implementation seems like it would supply all the ordering guarantees needed of enums. However, Postgres allows for adding new enums and specifying the order of the newly created enum with respect to an existing value of the enum. This looks like:

sql
CREATE TYPE t ENUM AS ('v1', 'v2');
ALTER TYPE t ADD VALUE 'v1.5' AFTER 'v1'

This means add the value v1.5 to the enum t and order it after the value v1. Using just integers as the backing value for enums would not allow us to handle this sort of case. Postgres implements this feature on enums by storing a sorting order for enums as a float. When a new value is added like this, Postgres takes the sort orders of the enums that the new enum is being inserted in between, and creates a float that bisects the range between the two orders. Concretely, if v1 had a sort order of 1.5 and v2 had a sort order of 2.0, then v1.5 would be inserted with a sort order of 1.75. However, once the floating point precision limit has been reached, Postgres rewrites all sort orders to integral values. Postgres can do this because it doesn't require a stable disk encoding for enums. In our case, we need to have a stable encoding to store data on disk if an enum is used in an index, and cannot afford to rewrite all tables using an enum if the enum sort order has changed.

We propose a different strategy that is related to this idea of bisecting ranges, but doesn't suffer from problems due to floating point arithmetic precision. The general idea is to use byte arrays to hold the sort order of our enums, and reserve some bytes in the arrays to create the ordering that we need. In particular we reserve the minimum byte (0) and have a maximum allowed byte. In practice this will be 255. An example of the encoding scheme is below.

Assume we started with 3 elements (a, b, c), and let the maximum byte value be 3. The sort order byte arrays for each element would be:

a 1/
b 2/
c 3/

To add an element after b we can create a new key that sits in the middle of the range between b and c.

a 1/
b 2/
d 2/2/
c 3/

Now lets add more values before d. The first one is easy:

a 1/
b 2/
e 2/1/
d 2/2/
c 3/

The tricky case is adding a value before e. Because we reserved the minimum byte, we can append it and then bisect the range again.

a 1/
b 2/
f 2/0/2
e 2/1/
d 2/2/
c 3/

This strategy can be extended indefinitely as long as this pattern is followed to reserve the minimum byte. A prototype of the exact algorithm is included as part of the RFC PR.

This sort order byte array will be the physical layout and identifier of the enum. We expect that for small enums only a byte or two will be used to hold all the values, and that our compression strategies at the storage layer will compress this data well.

Since the common case of adding members to an enum is to add a member at the beginning or end of the set of values, we can adjust the algorithm slightly to better handle this case. When generating a new key byte where one of the endpoints is the min or max element, the algorithm can add or subtract a small constant from the existing key rather than bisecting the range. This allows for adding many more elements to the beginning or end of the range without increasing the number of bytes used to store the enum. The algorithm can be found implemented in this PR.

Parsing

Currently, the CockroachDB grammar is not equipped to handle type names that are qualified due to changes made in the past that separated parsing of object and type identifiers. Some of these changes will have to be reverted/adapted in order to allow for types to have qualifications again. The work to allow the parser to recognize qualified names has been done in this PR.

Type System Changes

The type system of CockroachDB currently makes an assumption that anywhere a type is present in an AST, that type is statically known. In code, this means that every AST object that holds a type (like a CastExpr or ColumnDef) holds a *types.T, which is constructed at parse time. As part of implementing user defined types, the type system must be taught that all types are no longer statically known. The general idea is to change the types in AST nodes to a new interface representing an unresolved type reference. These type references can then be resolved into *types.T through type resolution. Additionally, we must enforce that types are only attempted to be accessed after type checking, when all type references have been resolved. A prototype of this approach can be found in this PR.

After the process of type resolution, enums need a types.T for interaction with other components of the system. We will introduce a new family for enums, and the types.T for an enums will contain the stable ID for the TypeDescriptor that backs the type. The types.T will also contain extra fields for an enum like the mapping of names to values. Importantly, these extra fields will not be serialized as part of the proto. Instead, when a type is resolved, the returned *types.T will be hydrated to populate these fields.

A potential option was to avoid using a TypeDescriptor and instead just extend the types.T proto to contain necessary fields for user defined types. However, this is not feasible because the types.T proto's are stored on disk in various descriptors. It is too expensive to update all descriptors that contain a type every time the type is altered.

A new Datum DEnum will be introduced to represent values of the enums at runtime. A DEnum will store the physical representation of the enum as well as the hydrated *types.T of its type. The extra fields in the *types.T that hold information about enum values will be used for datum operations without the need to thread ID resolution capabilities to evaluation of operations on datums.

When a user-defined type is created in Postgres, Postgres will automatically create an alias for an array of the new type. For example, if a user creates a type days, the system would also create the type _days as an alias for days[]. This type tracks changes made to the referenced type as it moves through schemas and is dropped.

Semantic Analysis Changes

The optimizer will need to be taught about the check constraint implied by a column being of an enum type. Additionally, it will need to be taught how to convert enum values from their input string representation into their Datum physical representation.

The Catalog that is used by the optimizer will need to be extended to support resolution of types. The way that the catalog represents user defined types is important for invalidation of cached plans. If a type is updated, all plans containing data sources using the type need to be invalidated.

DistSQL

The gateway node that plans a SQL query has access to all resolved type information for the query. Remote nodes that different parts of the query are planned on need access this information in order to correctly execute the query. In particular, these nodes need to hydrate their *types.T containers with metadata and they need to parse and type check serialized expressions. The hydration of *types.T objects can be done at operator initialization. The trickier problem is type checking serialized expressions -- we don't want to pay the cost of name resolution again. Our strategy is to serialize user defined type references with their OIDs similar to how column references are serialized. All explicit references to user defined types (i.e. in casts or user defined type value literals) will be serialized like @<OID>. The expression initialization process will resolve these OID references to the correct TypeDescriptor. To actually resolve these references, we access the set of leased descriptors through a descs.Collection that is initialized for each flow.

Alternatives

Namespacing and Metadata Storage

During discussion of the RFC, some alternatives were debated. In particular, the ideas of using a separate namespace table for types and/or a separate descriptor table for metadata storage. The benefit of a separate namespace table is that it has the potential of making future work in allowing tables to be interpreted as types more straightforward. However, using a separate namespace table complicates existing name resolution and conflict detection strategies. A separate descriptor table allows for scans over all tables or types to not have to touch descriptors of different types, which is a performance improvement for catalog table operations. However, this problem is somewhat orthogonal to this work, and would be better solved by building some sort of indexing structure on the system.descriptor table. Using the existing namespace table allows most of the existing name resolution code to be used directly, and using the same descriptor table allows for leasing primitives to be built on only one system table.

Overall Alternative

One alternative approach to this physical layout was to store just an enum ID on disk, and store ordering and representation information in a separate lookup table. When operations like on enums would involve joining or rendering the enums, a join would be produced against this reference table. This allows for easy changing of enum data, but results in a variety of complexity during planning.

Unresolved questions

It is unclear what interactions will arise between this work and the planned/ongoing work with user defined schemas.