docs/RFCS/20171102_sql_sequences.md
This RFC proposes a design for implementing standard SQL sequences in Cockroach.
Sequences allow users to have auto-incrementing integers in their tables
(usually used as a primary key), which some in the community prefer to our
serial type. The syntax proposed matches the Postgres syntax, which is
similar to what Oracle and SQL Server support and to the SQL standard.
Sequence metadata is stored on a table descriptor, and the sequence value
is stored in a special KV pair which is updated atomically but outside of
a SQL transaction.
Many customers could probably make do with using our serial type instead of
sequences for keys, but have existing applications which use sequences and/or
prefer incrementing integers to the random-looking ones generated by our
existing unique_rowid() function.
A sequence is a named database object which exists in a schema alongside
tables, views and other sequences, and is used to hold a BIGINT value which
can be read and incremented atomically, usually for the purpose of giving out
unique ids as rows are inserted into a table. In addition to their values,
sequences have settings such as start value, amount to increment by, and max
value. (See "Sequence settings details" below)
Example:
CREATE SEQUENCE blog_posts_id_seq;
CREATE TABLE blog_posts (
id INT PRIMARY KEY DEFAULT nextval('blog_posts_id_seq'),
-- ^ in Postgres, `id SERIAL` expands to the above
-- except with int4 instead of our int, which is 8 bytes
title text,
body text
);
INSERT INTO blog_posts (title, body) VALUES ('Sequences', 'Whooo') RETURNING (id);
-- => id: 0
INSERT INTO blog_posts (title, body) VALUES ('They''re awesome', 'Yep') RETURNING (id);
-- => id: 1
-- etc
A workaround for the lack of sequences is to use a row in a table to store the sequence value:
CREATE TABLE sequences (
name text PRIMARY KEY,
value int
);
INSERT INTO sequences VALUES ('blog_posts_id_seq', 0);
-- when you want a new value:
INSERT INTO sequences (name) VALUES ('blog_posts_id_seq')
ON CONFLICT (name) DO UPDATE SET value = sequences.value + 1
RETURNING value AS nextval;
This works, but has a significant drawback (besides incompatibility with applications which are set up to use builtin sequences): the update to the row in the sequences table locks that row until its transaction has committed. Thus, any transactions which create records in the table for which the sequence is being used contend on that one row in the sequence table.
The sequence implementations in Postgres, MySQL, Oracle, and SQL Server avoid this by taking place outside of a SQL transaction: updates to the sequence are atomic, but are never rolled back. As soon as a new sequence value is given out, the next transaction can access the sequence, regardless of whether the first transaction aborts or commits. This may create gaps in the table which is using the sequence, but the performance win seems worth it. (If not, customers can use the table strategy.) We propose to follow in their footsteps and implement this atomic but non-transactional behavior.
We propose leaving Cockroach's SERIAL type the way it is (using the
unique_rowid function), since it offers better performance due to not
requiring IO to the KV layer, and should work for most applications.
Future work could increase the performance of sequences by implementing the
CACHE setting, which pre-allocates batches of values to be handed out from
memory; we could then switch SERIAL over to be backed by sequences. This work
could be done later on top of the work proposed in this RFC, if we deem it
necessary. (See "Rationale and Alternatives")
Sequences are part of the SQL standard, and are implemented with nearly the same
syntax in Postgres, MySQL, and Microsoft SQL Server. The CREATE SEQUENCE
statement is nearly identical between them, but the syntax for getting the next
value is different. This proposal mirrors the Postgres approach closely. There
are a few aspects of the Postgres implementation which are awkward or difficult
to implement but may be worth the effort for compatibility; they're enumerated
in the "Design / Scope Issues" setting under "Unresolved Questions".
Some users may not care that sequences exist and happily continue using serial
(which would be faster since it operates locally without contention or
coordination over the network), but their existence may make migration to
Cockroach easier for users who are used to them.
The feature is relatively low-impact for Cockroach internals, since it adds a few new SQL statements and functions, adds sequence information to table descriptors, and uses the existing KV store to store the sequence value.
To support this feature, I propose the following changes:
CREATE SEQUENCE <sequence name> <sequence settings> (see section "Sequence
settings details")ALTER SEQUENCE <sequence name> <sequence settings>DROP SEQUENCE <sequence name>nextval(sequence_name)currval(sequence_name)lastval()setval(sequence_name, value, is_called)SELECT my_sequence.next_val FROM dual (dual is a builtin one-row table in
Oracle used when a table isn't needed).NEXT VALUE FOR my_sequence (This is what's
specified in the SQL standard).information_schema.sequences and pg_catalog.pg_class show the
sequences (information_schema.sequences currently exists but is empty)DROP SEQUENCE <sequence name> CASCADE removes the DEFAULT expression
from any columns using the sequence. (The default on DROP SEQUENCE is
RESTRICT, which errors if there are any dependencies on the sequence.)CREATE TABLE,
ALTER TABLE, ALTER COLUMN, and ADD COLUMN. We already put dependency
tracking information for views on TableDescriptors, but these dependencies
will be recorded on the column descriptors of the columns which use
sequences in their DEFAULT expressions.serial column automatically
creates a sequence and records the dependency. Since our serial type
is not based on sequences, our users will have to manually create sequences,
with CREATE SEQUENCE, unless we add new syntax.INSERT,
DELETE, UPDATE, ALTER TABLE, etc. They should be like views or
virtual tables: their contents can only be affected by other means.planNode which allows
SELECT * FROM my_sequence to work. This allows users to introspect the
sequence value and settings, and may be relied upon by PG tools.cockroach dump and cockroach load work with sequences. I.e. dump
dumps the current sequence value, and load sets the value so that nextval
can pick up where it left off. Adding
CREATE SEQUENCE my_seq START WITH <current value> to the dump would
achieve this.I propose that sequence metadata (name and settings) be represented internally
as a type of TableDescriptor. Just as a TableDescriptor has fields which are
populated only for views, it will have a field (sequence_settings or similar)
which is only populated on table descriptors which describe sequences. The
sequence_settings field on the table descriptor will include sequence settings
such as increment, minvalue, maxvalue, start, and cycle. INSERTs,
UPDATEs, and schema changes to the sequence will be disallowed based on the
presence of the sequence_settings field.
CREATE SEQUENCE will use the same machinery as table and view creation,
including:
system.namespace table or erroring if the name is
already taken, since sequences exist in the same namespace as tables.system.descriptor table.Additionally, ALTER SEQUENCE will use machinery designed for table schema
changes: all nodes will be notified of the change and read the new version of
the descriptor using the lease-based descriptor caching system.
See the "Sequence metadata" section under "Rationale and Alternatives" for a discussion of alternate approaches.
Each sequence value will be stored in its own range, with a key in this format:
/Table/<DescriptorID>/1/0/1
Where the numbers are dummy values for the index ID, primary key value, and column family ID, respectively. This mimics the structure of a normal table key, so backup/restore can work without modification.
Reads and writes to the sequence value (via the functions nextval, currval,
etc.), will be implemented by direct calls to the KV layer's Get, Inc, and
Set functions.
Storing sequence values in their own ranges means that inserts to tables which use sequences will always touch two ranges. However, since the sequence update takes place outside of the SQL transaction, this should not trigger the 2PC commit protocol. Savvy users might question this; we should note it in our documentation.
See the "Sequence values" section under "Rationale and Alternatives" for a discussion of alternate approaches.
The sequence functions must respect several optional settings (see Postgres docs):
AS <integer type>: What type the sequence value should be (default INT,
64 bits). We'll store them as 64 bits, but set the MINVALUE and MAXVALUE
settings to limit the value to the specified type.INCREMENT BY <increment>: how much to increment by. A negative number
creates a descending sequence; a positive number creates an ascending
sequence.MINVALUE <minvalue> | NO MINVALUE (defaults apply if clause not specified or
you say NO MINVALUE)
MAXVALUE <maxvalue> | NO MAXVALUE
START WITH start: default 1 for ascending; -1 for descending.CACHE <cache>: how many values to allocate in memory, for faster access. I
propose recognizing this syntax to avoid broken clients but implementing
the actual functionality later, if customers are using sequences and
demanding faster performance. See "Rationale and Alternatives".CYCLE | NO CYCLE: whether or not to wrap around when the sequence value hits
the max or min.OWNED BY <table_name.column_name> | OWNED BY NONE: Records a dependency
on this sequence from the column. We would also create this association
if we see a call to nextval with a sequence name in the DEFAULT expression
of a column.ALTER SEQUENCE my_sequence MAXVALUE 5. my_sequence is now
at its maximum; any calls to nextval should error out or wrap around.nextval('my_sequence') on a different nodenextval may not yet have received word of the
schema change, since schema changes are scheduled and gossipped. Thus,
a value higher than the sequence's maximum could be given out if nextval
and the schema change are running concurrently.CYCLE setting).Users might not understand that this type creates more contention than the
serial type, and be frustrated that their app is slow. Maybe we shouldn't
give them this option, and direct everyone toward serial.
serial is still there to use. The documentation for sequences
should just warn that serial is faster.bigserial by default.The CREATE SEQUENCE statement, other DDL statements, and sequence manipulation
functions mirror Postgres. Oracle and SQL Server have different syntax for
getting sequence values (my_sequence.nextval and NEXT VALUE FOR my_sequence)
respectively, but these don't have clear advantages over PG's syntax. MySQL
is significantly different in that you can only mark a single column as
AUTO_INCREMENT, and there is no separate sequence object and thus no
CREATE SEQUENCE. Again this would likely be fine, but we should match
Postgres, as we do elsewhere.
CACHE setting for performanceWe propose that the additional engineering effort to support this be done in a later RFC, if multiple customers are using sequences and need better performance. It would involve nodes making requests to the KV layer to increment sequence values by the specified batch size, then keeping track of what values they have to hand out on a per-session or per-node basis.
Requirements for sequence metadata storage are as follows:
ALTER SEQUENCE, just as tables and views are cached and kept
up to date.Options are:
SequenceDescriptor type, added to the Descriptor proto
TableDescriptor.TableDescriptor as exclusive options in
a oneof. (and rename TableDescriptor to something more generic like
DatabaseObjectDescriptor)
DatabaseObjectDescriptor they're receiving is a table, not a view or
sequence?TableDescriptor proto (chosen)
TableDescriptorTableDescriptor is createdINSERT
local to one range (until the table splits)CREATE|ALTER SEQUENCE, but we should really be directing them toward
unique_rowid.system.sequences table
Inc KV operation (better than Get and then Set because it
requires only one round trip), we'd have to teach Inc about SQL's
integer encoding, which seems like a separation-of-concerns violation.
The introspectability could be achieved by creating a virtual table.
(The information_schema.sequences table, which we'll be implementing,
gets you some of this: all the metadata, just not the value.)Users who prefer sequences or who have applications which use them will have an obstacle in moving to Cockroach.
Change our SERIAL type to be backed by sequences: I don't think
we can do this until we make it faster by implementing CACHE (see below).
We would also have to decide whether to make SERIAL int32. While more
compatible with Postgres, this would be backwards-incompatible with Cockroach,
since our current SERIAL is int64.
Making SERIAL always sequence backed would also allow us to automatically
create a sequence when a SERIAL is used in a CREATE TABLE, as Postgres
does. With the implementation in this RFC, to use sequences users will have to
explicitly type CREATE SEQUENCE and set their column to
DEFAULT nextval('my_sequence').
Implement the CACHE setting: (pre-allocation on each node of
batches of values, which can then be handed out quickly from memory)
This would probably make sequences fast enough to be our default for the
SERIAL type, but I think we should build the simple version, benchmark,
and do CACHE in a later RFC/PR.
Exposed the sequence values as a system.sequences table, for easy
introspectability.
Make SELECT * FROM my_sequence work: In Postgres, it returns a
single row containing the sequence value and settings:
| sequence_name | last_value | start_value | increment_by | max_value | min_value | cache_value | log_cnt | is_cycled | is_called |
|---|---|---|---|---|---|---|---|---|---|
| foo | 2 | 1 | 1 | 9223372036854775807 | 1 | 1 | 31 | f | t |
This is easy to do, and may be helpful for compatibility. It can be saved for the last PR of the change, but should probably be done.
Currently, some functions are nondeterministic (e.g. random), but none write
to the KV store. The EvalContext function implementations get only lets them
run SQL (via its EvalPlanner member). This will have to be refactored to give
the functions the capabilities they need.