Back to Cockroach

Summary

docs/RFCS/20171102_sql_sequences.md

26.1.421.0 KB
Original Source
  • Feature Name: Sequences
  • Status: in-progress
  • Start Date: 2017-10-09
  • Authors: Pete Vilter
  • RFC PR: #19196
  • Cockroach Issue: #5811

Summary

This RFC proposes a design for implementing standard SQL sequences in Cockroach. Sequences allow users to have auto-incrementing integers in their tables (usually used as a primary key), which some in the community prefer to our serial type. The syntax proposed matches the Postgres syntax, which is similar to what Oracle and SQL Server support and to the SQL standard. Sequence metadata is stored on a table descriptor, and the sequence value is stored in a special KV pair which is updated atomically but outside of a SQL transaction.

Motivation

Many customers could probably make do with using our serial type instead of sequences for keys, but have existing applications which use sequences and/or prefer incrementing integers to the random-looking ones generated by our existing unique_rowid() function.

Guide-level explanation

A sequence is a named database object which exists in a schema alongside tables, views and other sequences, and is used to hold a BIGINT value which can be read and incremented atomically, usually for the purpose of giving out unique ids as rows are inserted into a table. In addition to their values, sequences have settings such as start value, amount to increment by, and max value. (See "Sequence settings details" below)

Example:

sql
CREATE SEQUENCE blog_posts_id_seq;
CREATE TABLE blog_posts (
  id INT PRIMARY KEY DEFAULT nextval('blog_posts_id_seq'),
  -- ^ in Postgres, `id SERIAL` expands to the above
  --   except with int4 instead of our int, which is 8 bytes
  title text,
  body text
);
INSERT INTO blog_posts (title, body) VALUES ('Sequences', 'Whooo') RETURNING (id);
-- => id: 0
INSERT INTO blog_posts (title, body) VALUES ('They''re awesome', 'Yep') RETURNING (id);
-- => id: 1
-- etc

A workaround for the lack of sequences is to use a row in a table to store the sequence value:

sql
CREATE TABLE sequences (
  name text PRIMARY KEY,
  value int
);
INSERT INTO sequences VALUES ('blog_posts_id_seq', 0);
-- when you want a new value:
INSERT INTO sequences (name) VALUES ('blog_posts_id_seq')
ON CONFLICT (name) DO UPDATE SET value = sequences.value + 1
RETURNING value AS nextval;

This works, but has a significant drawback (besides incompatibility with applications which are set up to use builtin sequences): the update to the row in the sequences table locks that row until its transaction has committed. Thus, any transactions which create records in the table for which the sequence is being used contend on that one row in the sequence table.

The sequence implementations in Postgres, MySQL, Oracle, and SQL Server avoid this by taking place outside of a SQL transaction: updates to the sequence are atomic, but are never rolled back. As soon as a new sequence value is given out, the next transaction can access the sequence, regardless of whether the first transaction aborts or commits. This may create gaps in the table which is using the sequence, but the performance win seems worth it. (If not, customers can use the table strategy.) We propose to follow in their footsteps and implement this atomic but non-transactional behavior.

We propose leaving Cockroach's SERIAL type the way it is (using the unique_rowid function), since it offers better performance due to not requiring IO to the KV layer, and should work for most applications. Future work could increase the performance of sequences by implementing the CACHE setting, which pre-allocates batches of values to be handed out from memory; we could then switch SERIAL over to be backed by sequences. This work could be done later on top of the work proposed in this RFC, if we deem it necessary. (See "Rationale and Alternatives")

Sequences are part of the SQL standard, and are implemented with nearly the same syntax in Postgres, MySQL, and Microsoft SQL Server. The CREATE SEQUENCE statement is nearly identical between them, but the syntax for getting the next value is different. This proposal mirrors the Postgres approach closely. There are a few aspects of the Postgres implementation which are awkward or difficult to implement but may be worth the effort for compatibility; they're enumerated in the "Design / Scope Issues" setting under "Unresolved Questions".

Some users may not care that sequences exist and happily continue using serial (which would be faster since it operates locally without contention or coordination over the network), but their existence may make migration to Cockroach easier for users who are used to them.

The feature is relatively low-impact for Cockroach internals, since it adds a few new SQL statements and functions, adds sequence information to table descriptors, and uses the existing KV store to store the sequence value.

Reference-level explanation

To support this feature, I propose the following changes:

SQL interface additions

  • Introduce new DDL statements:
    • CREATE SEQUENCE <sequence name> <sequence settings> (see section "Sequence settings details")
    • ALTER SEQUENCE <sequence name> <sequence settings>
    • DROP SEQUENCE <sequence name>
  • Introduce functions which access, increment, and set the sequence value: (Postgres Docs):
    • nextval(sequence_name)
    • currval(sequence_name)
    • lastval()
    • setval(sequence_name, value, is_called)
    • Note: Oracle and SQL Server implement these differently.
      • In Oracle, you get the next value of a sequence with SELECT my_sequence.next_val FROM dual (dual is a builtin one-row table in Oracle used when a table isn't needed).
      • In SQL Server, you use NEXT VALUE FOR my_sequence (This is what's specified in the SQL standard).
  • Make information_schema.sequences and pg_catalog.pg_class show the sequences (information_schema.sequences currently exists but is empty)
  • Record dependencies by columns on sequences (in the column and sequence descriptors), such that:
    • Deletion of a sequence is not allowed if a column depends on it.
    • Dropping a column which depends on a sequence (either by dropping the individual column or dropping its table) results in the sequence being deleted if no other columns depend on it.
    • DROP SEQUENCE <sequence name> CASCADE removes the DEFAULT expression from any columns using the sequence. (The default on DROP SEQUENCE is RESTRICT, which errors if there are any dependencies on the sequence.)
    • Note: These dependencies will have to be recorded on CREATE TABLE, ALTER TABLE, ALTER COLUMN, and ADD COLUMN. We already put dependency tracking information for views on TableDescriptors, but these dependencies will be recorded on the column descriptors of the columns which use sequences in their DEFAULT expressions.
    • Note: In Postgres, creating a table with a serial column automatically creates a sequence and records the dependency. Since our serial type is not based on sequences, our users will have to manually create sequences, with CREATE SEQUENCE, unless we add new syntax.
  • Add checks to disallow schema and data changes to sequences via INSERT, DELETE, UPDATE, ALTER TABLE, etc. They should be like views or virtual tables: their contents can only be affected by other means.
  • (Possibly, see open questions section): Add a new planNode which allows SELECT * FROM my_sequence to work. This allows users to introspect the sequence value and settings, and may be relied upon by PG tools.
  • Make cockroach dump and cockroach load work with sequences. I.e. dump dumps the current sequence value, and load sets the value so that nextval can pick up where it left off. Adding CREATE SEQUENCE my_seq START WITH <current value> to the dump would achieve this.

Internal representation and operations

Sequence metadata

I propose that sequence metadata (name and settings) be represented internally as a type of TableDescriptor. Just as a TableDescriptor has fields which are populated only for views, it will have a field (sequence_settings or similar) which is only populated on table descriptors which describe sequences. The sequence_settings field on the table descriptor will include sequence settings such as increment, minvalue, maxvalue, start, and cycle. INSERTs, UPDATEs, and schema changes to the sequence will be disallowed based on the presence of the sequence_settings field.

CREATE SEQUENCE will use the same machinery as table and view creation, including:

  • Allocating a descriptor ID by incrementing the descriptor ID allocation key.
  • Adding an entry to the system.namespace table or erroring if the name is already taken, since sequences exist in the same namespace as tables.
  • Writing table descriptor to the system.descriptor table.
  • Creating a new range.

Additionally, ALTER SEQUENCE will use machinery designed for table schema changes: all nodes will be notified of the change and read the new version of the descriptor using the lease-based descriptor caching system.

See the "Sequence metadata" section under "Rationale and Alternatives" for a discussion of alternate approaches.

Sequence values

Each sequence value will be stored in its own range, with a key in this format:

/Table/<DescriptorID>/1/0/1

Where the numbers are dummy values for the index ID, primary key value, and column family ID, respectively. This mimics the structure of a normal table key, so backup/restore can work without modification.

Reads and writes to the sequence value (via the functions nextval, currval, etc.), will be implemented by direct calls to the KV layer's Get, Inc, and Set functions.

Storing sequence values in their own ranges means that inserts to tables which use sequences will always touch two ranges. However, since the sequence update takes place outside of the SQL transaction, this should not trigger the 2PC commit protocol. Savvy users might question this; we should note it in our documentation.

See the "Sequence values" section under "Rationale and Alternatives" for a discussion of alternate approaches.

Sequence settings details

The sequence functions must respect several optional settings (see Postgres docs):

  • AS <integer type>: What type the sequence value should be (default INT, 64 bits). We'll store them as 64 bits, but set the MINVALUE and MAXVALUE settings to limit the value to the specified type.
  • INCREMENT BY <increment>: how much to increment by. A negative number creates a descending sequence; a positive number creates an ascending sequence.
  • MINVALUE <minvalue> | NO MINVALUE (defaults apply if clause not specified or you say NO MINVALUE)
    • default for ascending: 1
    • default for descending: MIN_INT
  • MAXVALUE <maxvalue> | NO MAXVALUE
    • default for ascending: MAX_INT
    • default for descending: -1
  • START WITH start: default 1 for ascending; -1 for descending.
  • CACHE <cache>: how many values to allocate in memory, for faster access. I propose recognizing this syntax to avoid broken clients but implementing the actual functionality later, if customers are using sequences and demanding faster performance. See "Rationale and Alternatives".
  • CYCLE | NO CYCLE: whether or not to wrap around when the sequence value hits the max or min.
  • OWNED BY <table_name.column_name> | OWNED BY NONE: Records a dependency on this sequence from the column. We would also create this association if we see a call to nextval with a sequence name in the DEFAULT expression of a column.

Corner cases

  • Reducing max value setting as sequence value concurrently goes above it:
    • Scenario:
      • A sequence's max value setting is 10; current value is 5
      • User runs ALTER SEQUENCE my_sequence MAXVALUE 5. my_sequence is now at its maximum; any calls to nextval should error out or wrap around.
      • User runs nextval('my_sequence') on a different node
    • Problem: The node running nextval may not yet have received word of the schema change, since schema changes are scheduled and gossipped. Thus, a value higher than the sequence's maximum could be given out if nextval and the schema change are running concurrently.
    • Evaluation: This is not a big problem, since most users will leave the max value at its default (2^64), and either error out when they hit it, or wrap around (according to their CYCLE setting).

Drawbacks

Users might not understand that this type creates more contention than the serial type, and be frustrated that their app is slow. Maybe we shouldn't give them this option, and direct everyone toward serial.

  • Mitigation: serial is still there to use. The documentation for sequences should just warn that serial is faster.
  • Mitigation: Rails's scaffold generator (don't know about other ORMs) creates a primary key column of type bigserial by default.

Rationale and Alternatives

SQL Syntax

The CREATE SEQUENCE statement, other DDL statements, and sequence manipulation functions mirror Postgres. Oracle and SQL Server have different syntax for getting sequence values (my_sequence.nextval and NEXT VALUE FOR my_sequence) respectively, but these don't have clear advantages over PG's syntax. MySQL is significantly different in that you can only mark a single column as AUTO_INCREMENT, and there is no separate sequence object and thus no CREATE SEQUENCE. Again this would likely be fine, but we should match Postgres, as we do elsewhere.

CACHE setting for performance

We propose that the additional engineering effort to support this be done in a later RFC, if multiple customers are using sequences and need better performance. It would involve nodes making requests to the KV layer to increment sequence values by the specified batch size, then keeping track of what values they have to hand out on a per-session or per-node basis.

Sequence metadata

Requirements for sequence metadata storage are as follows:

  1. Must be available in-memory on all nodes, so that getting a new sequence value doesn't require two KV roundtrips (one to look up how much to increment by and another to do the increment).
  2. The in-memory copy should be kept up to date on each node when the sequence is altered via ALTER SEQUENCE, just as tables and views are cached and kept up to date.
  3. Must allow sequences to live in the same namespace as tables and views.

Options are:

  • New SequenceDescriptor type, added to the Descriptor proto
    • Advantages: Cleaner; i.e. doesn't add something that's not really a table to the TableDescriptor.
    • Disadvantages:
      • Hard to see how to keep the in-memory copy up to date. Refactoring the leasing system code to be generic for multiple types of descriptors which satisfy a common interface necessitates a large interface, and many changes to use it. After trying this (branch here), the refactored code didn't seem much cleaner.
  • Create separate proto messages for information specific to views, tables, and sequences, and put them on TableDescriptor as exclusive options in a oneof. (and rename TableDescriptor to something more generic like DatabaseObjectDescriptor)
    • Advantages: Seems very DRY and unambiguous: the proto would describe a named database object that could either be a table, view, or sequence (or more things in the future).
    • Disadvantages:
      • Requires migration of table descriptor protos.
      • For functions which should only ever be operating on table descriptors, how can we provide a type system guarantee that the DatabaseObjectDescriptor they're receiving is a table, not a view or sequence?
  • Optional field on TableDescriptor proto (chosen)
    • Advantages:
      • Satisfies all requirements with minimal additional code
    • Disadvantages:
      • Clutters TableDescriptor
      • Requires checks to be added so that sequences cannot be modified via DML. Checks like these already exist for VIEWs, and shouldn't be hard to add on to.

Sequence values

  • One range per sequence (chosen)
    • Advantages:
      • Spreads load out; no performance bottleneck or single point of failure for all sequences
      • Allows per-sequence zone config, allowing operators to keep the sequence value local to the table that's using it
      • Already gets created when a TableDescriptor is created
    • Disadvantages:
      • A lot of overhead to store a single integer
  • Alongside tables
    • Advantages: Would keep the KV operations for a sequence-using INSERT local to one range (until the table splits)
    • Sequences can be used by multiple tables; in this situation we wouldn't know where to put it. We could allow users to tell us using a syntax extension to CREATE|ALTER SEQUENCE, but we should really be directing them toward unique_rowid.
  • All sequences in one range
    • Advantages:
      • Avoids overhead of one range per table.
      • Allows a sequence to be used from multiple tables.
    • Disadvantages:
      • Without load-based or manual splitting, this range would be a single point of failure and/or a performance bottleneck. It's unclear how operators would manually split the range. (zone configs? SQL?)
  • A system.sequences table
    • Advantages:
      • All those of the "all sequences in one range" approach
      • Allows easy introspection
    • Disadvantages
      • To use Inc KV operation (better than Get and then Set because it requires only one round trip), we'd have to teach Inc about SQL's integer encoding, which seems like a separation-of-concerns violation. The introspectability could be achieved by creating a virtual table. (The information_schema.sequences table, which we'll be implementing, gets you some of this: all the metadata, just not the value.)

Impact of not doing this

Users who prefer sequences or who have applications which use them will have an obstacle in moving to Cockroach.

Possible future work

  • Change our SERIAL type to be backed by sequences: I don't think we can do this until we make it faster by implementing CACHE (see below).

    We would also have to decide whether to make SERIAL int32. While more compatible with Postgres, this would be backwards-incompatible with Cockroach, since our current SERIAL is int64.

    Making SERIAL always sequence backed would also allow us to automatically create a sequence when a SERIAL is used in a CREATE TABLE, as Postgres does. With the implementation in this RFC, to use sequences users will have to explicitly type CREATE SEQUENCE and set their column to DEFAULT nextval('my_sequence').

  • Implement the CACHE setting: (pre-allocation on each node of batches of values, which can then be handed out quickly from memory) This would probably make sequences fast enough to be our default for the SERIAL type, but I think we should build the simple version, benchmark, and do CACHE in a later RFC/PR.

  • Exposed the sequence values as a system.sequences table, for easy introspectability.

  • Make SELECT * FROM my_sequence work: In Postgres, it returns a single row containing the sequence value and settings:

    sequence_namelast_valuestart_valueincrement_bymax_valuemin_valuecache_valuelog_cntis_cycledis_called
    foo21192233720368547758071131ft

    This is easy to do, and may be helpful for compatibility. It can be saved for the last PR of the change, but should probably be done.

Unresolved questions

Implementation issues

Currently, some functions are nondeterministic (e.g. random), but none write to the KV store. The EvalContext function implementations get only lets them run SQL (via its EvalPlanner member). This will have to be refactored to give the functions the capabilities they need.