docs/RFCS/20210115_new_schema_changer.md
We introduce a new framework for online schema changes which decouples the planning of DDL statements from their execution: It allows schema changes to be specified declaratively in terms of schema element states, state transitions, and state dependencies, and it generates schema change plans for transactions involving DDL statements that are independent of the details of how state transitions are executed.
This framework remedies long-standing problems with the current implementation of schema changes in CRDB, which was built around a limited, hard-coded set of state transitions and has proved insufficient for the range of DDL statements and their combinations that we support today. Its flexibility with regards to both execution context and specifying rules for planning schema changes is a prerequisite for transitioning from today's async schema change jobs to fully transactional schema changes.
A schema change is a sequence of state changes on a set of descriptors—or, at a more granular level, on a set of schema elements such as columns and indexes. This state-based abstraction, central to schema changes in CRDB, was adapted from Google's F1 schema change paper. Today, those state changes take place over multiple transactions, most of which run in an asynchronous job after the user transaction has committed. During a schema change, pairs of consecutive descriptor versions, which may be simultaneously in use in the cluster, must provide mutually compatible schemas for concurrent transactions accessing table data. It's also possible to initiate arbitrary combinations of schema changes within a transaction (with some limitations), which imposes dependencies between schema elements: For instance, new constraints cannot be validated until the columns they refer to have been backfilled.
Schema changes in CRDB have always been guided by the state-based abstraction, but their current implementation has serious deficiencies, some of which make the implementation unsuitable for the impending introduction of transactional schema changes. The current schema changer has hard-coded states and transitions that best accomodate a single column or index change, and is fundamentally limited in its ability to support other schema changes and their combinations. This has become untenable as the number of supported DDL statements has grown.
Problems with the schema changer which threaten the correctness and stability of schema changes, both current and future, include:
ALTER PRIMARY KEY, and ALTER COLUMN TYPE have all been added to the schema changer through introducing implicitly defined states and special cases outside the main column and index change track, and, in the latter two cases, by chaining two runs of the schema changer together.sql package and is heavily entangled with it, so it's difficult to isolate components in testing. As a result, schema changes are severely under-tested.Furthermore, major changes to schema changes, both in infrastructure and across many statements' specific execution logic, are necessary to support transactional schema changes.
The main infrastructural change needed is to decouple the specification of schema change states and transitions from the context in which they are executed. Schema changes should continue to run when split across the user transaction and async job, as they are today, and the schema changer should conceptually unify those execution phases. In the future, when schema changes are transactional, they should also be able to run in a series of child transactions within the user transaction. A state transition corresponding to a new descriptor version can, and should, be specified independently of actually writing an updated descriptor to disk.
Some schema changes also require reimplementation for a transactional world. Most significantly, we'll need to use the index backfiller to rewrite the primary index for all column changes, because an unchanged primary index must remain part of the table for concurrent transactions. The reimplementation will be much easier when states and rules can be declaratively specified and modified.
The new schema changer separates the planning from the execution of schema changes, somewhat analogously to the planning and execution of queries. Planning involves generating, from the DDL statements' AST nodes, an intermediate representation of all schema changes to be performed, and computing dependencies and generating declaratively specified operations ("ops") to be executed.
Ops can be descriptor updates, backfills, or constraint validation queries. Different specific implementations of the execution of these ops are used depending on context: For instance, we currently have a special "backfiller" which builds indexes on tables created earlier in the same transaction, and the new schema changer will allow us to unify its interface with that of the real index backfiller, plumbing in one or the other as appropriate. (This flexibility will also be useful in the future when implementing transactional schema changes.)
A single instance of a schema changer is responsible for all the schema changes initiated in a transaction, and is tied to the connExecutor state machine: Before the transaction commits, it will both incrementally accumulate state as part of planning and execute descriptor changes as necessary for each statement, though the details vary depending on whether the schema change is executing mostly in an async job or within the user transaction itself. In this RFC, most examples will focus on the former case, where most of the work of execution occurs after the transaction in a separate job.
Note: Throughout the RFC, we use "schema" as in "table schema" and "schema change", not as in "user-defined schema."
DescriptorMutation states, like delete-only and write-only, as well as newly formalized ones like backfilled.a is being added, there is an op edge from (a, delete-only) to (a, write-only).c is being added on a new column a, there is a dep edge from (c, validated) to (a, backfilled).a, delete-only) to (a, write-only) example above, the op would be a descriptor update.We'll consider the ALTER TABLE statement that follows:
CREATE TABLE t (a INT PRIMARY KEY);
ALTER TABLE t ADD COLUMN b INT DEFAULT 0;
We'll assume the column is added by writing its data to an entirely new primary index and then swapping the old and new primary indexes, akin to the implementation of ALTER PRIMARY KEY. Therefore, three targets will be needed in all: One to add the column, one to add the new primary index (which behaves like a secondary index while still being added and in a non-public state), and one to drop the old primary index.
During planning for the ALTER TABLE statement, the targets are built from the AST node and an immutable (read-only) instance of the table descriptor. Each element in each target contains the metadata for the relevant column or index to be added or dropped, in the form of an embedded sqlbase.ColumnDescriptor or sqlbase.IndexDescriptor, along with other relevant metadata for the schema change.
The following is a simplified definition of the Column element. (Elements are defined as protobuf messages so they can be serialized in the schema change job record.)
message Column {
uint32 table_id = 1;
uint32 family_id = 2;
string family_name = 3;
cockroach.sql.sqlbase.ColumnDescriptor column = 4;
}
The index elements and targets are defined similarly. For the schema change above, the generated "add column" target is as follows:
Target{
ElementProto: ElementProto{
Column: &Column{
TableID: 52,
FamilyID: 0,
FamilyName: "primary",
Column: descpb.ColumnDescriptor{
ID: 2,
Name: "b",
Nullable: true,
Type: <type information>,
},
},
},
Direction: Target_ADD,
}
The execution of the schema change is split across multiple phases: statement execution, pre-commit, and post-commit (i.e., in the schema change job). The goal of planning is to produce, for each execution phase, a sequence of stages to be executed in order. Each stage contains a set of ops to be executed. Roughly speaking, the states of the targets at the conclusion of a phase serve as the starting point for the following phase.
For our targets, the new column and new primary index begin in the absent state, and the old primary index begins in the public state. Recall that in the old schema changer, a column being added would move to delete-only during statement execution. As part of the reimplementation of ADD COLUMN, we will (somewhat arbitrarily) defer that step to the pre-commit phase in the user transaction. Therefore, in this example, the statement execution phase is a no-op; only in the pre-commit phase do the new column and new primary index move to the delete-only state, at which point the updated descriptor is committed, and the table is ready to undergo the remainder of the schema change in the async job. (This distinction between the statement execution and post-commit phases is unimportant for an implicit transaction, but becomes significant for transactions with multiple DDL statements, where the visibility of earlier schema changes matters.)
The post-commit phase involves the most complex planning. The following diagram (generated by EXPLAIN (DDL, DEPS), a feature of the new schema changer), demonstrates the graph of nodes and their op edges and dependency edges:
Figure 1: Output of EXPLAIN (DDL, DEPS)
Targets (elements and their add/drop directions) are shown at the top. The op edges for each target are indicated by black arrows, and the dependency edges between targets by red ones. The dependency rules that apply to this example are as follows (stated informally):
write-only until the column does.public until the column does.public simultaneously with the old one becoming write-only.The dependency edges determine the partitioning of ops into stages. A stage is defined as follows, and is both a transition between sets of Nodes and a specification of ops to execute to effect that transition:
type Stage struct {
Before, After []*scpb.Node
Ops scop.Ops
}
(Note that some targets may undergo no change in a stage, in which case their corresponding nodes in Before and After will be equal.)
Stages are generated in order starting at the initial nodes for each target, and each successive stage is generated by taking all possible ops that don't conflict with any dependency edges. The following diagram, generated by EXPLAIN (DDL), excludes the dependency edges but instead demonstrates the generated stages. (The present implementation doesn't recognize different phases, but is more than adequate for the present example. Stage 0 corresponds to the single stage of descriptor changes needed in the pre-commit phase, and all following stages occur in the async job.)
Link to Figure 2: Output of EXPLAIN (DDL)
Op edges are again represented by arrows with solid black lines. Each set of op edges in a stage correspond exactly to the set of ops to be executed in that stage. Targets which do not undergo any change within a stage, because they are "waiting" for dependencies to be satisfied in a later stage, have arrows with dotted lines within that stage. Note that the generated stages are compatible with each dependency edge indicated in the output of EXPLAIN (DDL, DEPS) (Figure 1).
Ultimately, for this example, planning produces one stage for the pre-commit phase and six stages for the post-commit phase, including a backfill for the default value. We turn our attention to the execution of these stages and their ops.
Ops are executed via the schema change executor, which is an abstraction over the various dependencies needed to perform descriptor updates, index backfills, constraint and unique index validation, and job creation. (In the case of dropping indexes, as in the present example, making an index absent requires queuing a GC job.) Since each op is a complete specification of some unit of work required, the executor serves as a thin translation layer.
Different executors can be initialized with different implementations of descriptor updates, backfills, and validation, the details being mostly independent of the schema change steps themselves. Today's schema changes requiring async jobs will, in general, have two executors: one tied to the user transaction's connExecutor state machine, and another in the job resumer.
In the interim before transactional schema changes, a newly introduced job resumer will run the post-commit phase of the schema change. Its graph nodes (which deterministically determine the execution stages) will be stored in the job progress, which represents a significant conceptual change from the old schema change job: As with the earlier execution phases, the source of truth of the schema change's progress is no longer the descriptors themselves, but rather the nodes stored by the job. Backfill checkpointing will work similarly to the old schema changer job.
This section outlines how the schema changer interacts with the connExecutor state machine and statement execution over the course of a transaction, in the most general case of an explicit transaction containing one or more DDL statements, possibly together with non-DDL statements/queries. This discussion is specific to the non-transactional, asynchronous schema changes of the present, but the concepts involved for the transactional schema changes of the future will not fundamentally differ.
The new schema changer has a single unified planning implementation and planNode for all DDL statements, with all specialization occurring in the builder itself. The present implementation, when new-style schema changes are enabled, intercepts all DDL statements during planning but falls back to the existing implementation if the statement is unsupported.
Figure 3 shows all three phases of a general schema change, with a focus on what occurs within the user transaction. We first examine the planning and execution of each DDL statement, followed by what happens when the transaction commits.
Figure 3:
The main additional state introduced is in extraTxnState, which stores a set of nodes representing the "current" targets and their states, accumulated from all the DDL statements issued in the transaction so far.
For each DDL statement issued, represented by the "Statement planning/execution" box in Figure 3, the stored extraTxnState nodes will, in general, undergo two changes: The first occurs during statement planning, when targets are added or modified by the builder to represent the schema change, and the second occurs during statement execution, when ops are actually executed to produce state changes and the nodes are updated accordingly.
(Note: Although constructing stages and ops is conceptually part of the planning of a schema change, in the present implementation it occurs during the planNode's execution (i.e., in startExec) instead of during its planning. Since the stages and ops for a schema change are basically a pure function of the input nodes and the execution phase, it doesn't matter specifically when we construct them. For the rest of this document, we'll continue to refer to constructing stages as part of "planning" a schema change.)
The process of adding targets for an ADD COLUMN schema change was illustrated in the previous section. In general, an incoming DDL statement may also modify the existing set of targets in addition to adding new ones. For example, when an ADD COLUMN statement is issued following another earlier one in the same transaction, the preexisting targets to change the primary index must be updated to add the second new column to the new index's stored columns.
Stages are then generated from the updated targets produced by the builder, and the resulting ops are all executed within the transaction. For explicit transactions with multiple DDL statements, the distinction between the statement execution and pre-commit phases becomes relevant. As mentioned earlier, we defer some descriptor updates to the pre-commit phase to avoid writing new schema elements that may need further updates, enabled by the property that newly added, non-public schema elements are not visible to reads and writes later in the transaction. However, some schema changes do have immediately visible effects, such as DROP COLUMN, as well as all of today's schema changes that do not require descriptor mutations. For these schema changes, a non-empty list of stages will be generated for the statement execution phase.
Special handling is needed when the schema change operates on tables created earlier in the same transaction. Since other transactions cannot read or write to the table, there is usually no multi-version, asynchronous schema change required (with an exception being the treatment of foreign key backreferences—though note that this case is not treated correctly in the old schema changer), and it is usually possible to complete the schema change entirely within statement execution. We currently have an separate shortcut implementation for these "in-transaction" schema changes, but the new schema changer implementation will unify the in-transaction schema change path with the usual one: We'll simply generate a list of stages for the statement execution phase indicating that the schema change should be run to completion. The executor, accordingly, will use the special backfiller implementation that performs transactional KV writes to the index.
Note that targets only change as part of statement planning. After all DDL statements within the transaction have been executed, the targets become stable, and only their associated states (nodes) will change, resulting from op execution, from that point on.
Finally, when the connExecutor is about to commit the transaction, we construct stages and execute the ops for the pre-commit phase, with the input nodes being the nodes stored in extraTxnState as a result of all the DDL statements in the transaction. This is a "planning" and "execution" cycle for the schema changer, though it is not tied to the planning or execution of any particular statement. At this point, we also create the job, with the nodes at the conclusion of the pre-commit phase marshalled in the job record. After the transaction commits, the remainder of the schema change can proceed in the job.
Schema elements include the following:
Recall that a target is simply an element together with a direction (either add or drop).
For any given DDL statement, the logic of building its targets will be similar to its existing execution logic (for ALTER TABLE, CREATE INDEX, etc.). The crucial difference is that instead of accumulating changes on a set of mutable descriptors which are directly read from and written to the store, we only interact with immutable descriptors while building, and instead accumulate changes to the targets themselves.
Our strategy is to reimplement all existing schema changes in parallel, only sharing code with the existing implementation if it requires minimal refactoring (which, in practice, means that there are still plenty of opportunities). This strategy presents a risk of introducing divergences between the two implementations, but our existing SQL tests (both specifically for DDL statements and otherwise) should reduce the risk.
A new problem presented by the target-based approach is that when adding new targets, the relevant preexisting schema elements on the descriptor are spread across both the descriptor and any preexisting targets from earlier DDL statements in the transaction. (We may safely assume there are no concurrent schema changes started in other transactions, as described in a later section.)
One solution is to take advantage of the duality between descriptor states and state changes, and decompose the descriptor itself into a set of completed targets: A table with a column is simply an table with an "empty" schema together with a column target in the public state. This amounts to creating an abstraction for descriptors purely in terms of their schema elements, and allows for unifying "completed" and in-progress schema elements in a much more principled way than what our current mutable descriptors and APIs provide. It also suggests a natural way to implement CREATE statements, by starting with an "empty" descriptor and accumulating targets generated from the statement.
A graph, which is a directed graph of nodes, op edges, and dependency edges, is an intermediate representation used in constructing stages and their ops from a given set of input nodes. Constructing a graph is equivalent to this two-step process:
The rules which specify the successive states and op edges to generate for a given node, and the rules which specify which pairs of nodes have dependencies, are independent of any specific instance of a schema change; rather, they encode the semantics of classes of schema changes in general. In the current implementation, the rules are specified declaratively in a complicated Go data structure, in terms of predicates on nodes and pairs of nodes. Future iterations may use code generation.
A valid list of stages produced from a graph is an ordered partition of all its op edges (not its nodes!), such that all ops within the same stage have the same type (descriptor mutation, backfill, or constraint validation), and the order respects both the op edges and the dependency edges: The op edges for each target must be in order, and if node x depends on node y, then the order of ops must not cause x to be reached before y.
Stages are constructed successively and greedily: From the nodes produced as a result of the previous stage, we simply take all the op edges directed from those nodes which do not cause a dependency edge constraint to be violated. This works because once a constraint specified by a dependency edge is satisfied, it will be satisfied for all later states of the target being depended on.
In some cases, a target may be able to skip some intermediate states during a schema change while preserving correctness. Examples include:
Both of these optimizations are equivalent to deleting some nodes from the graph, and redirecting inbound and outbound dependency edges to adjacent nodes as appropriate. More generally, an optimization can be conceived of as a transformation of the graph according to some rewrite rules. Optimization passes would occur after the basic graph is constructed, and the resulting optimized graph would still be subject to the same process of constructing stages.
In the second example, the optimization relies on public schema elements (the preexisting unique index) which are not necessarily involved in the schema change. This poses a problem if targets and nodes are limited to only the schema elements undergoing changes. One potential solution, previously discussed in the context of building targets, is to synthesize targets from all preexisting public schema elements. These targets would undergo no state changes within the schema change graph, and would effectively serve only as inputs to the optimizer.
Schema change jobs can fail during normal execution in ways that do not reflect a failure or fault in the database: Backfills can fail when evaluating expressions or writing unique indexes, and constraint validation can fail if the constraint is not valid. Canceling long-running schema change jobs (before a certain point) will also be supported. But schema changes can also run into transient failures, and we need to be defensive about bugs. In all these cases, we'll attempt to revert the schema change job.
Schema changes being reverted are planned by simply reversing the directions in each of the original targets, and then executed as normal. This will be done in OnFailOrCancel.
Reverting is not possible in all states of a schema change: When dropping some schema elements, such as columns, table data will become unrecoverable once the element reaches the delete-only state. When such a state is reached, the job payload will be marked as non-cancelable. If the job still fails after that point, which should be a rare occurrence since all backfills and validation will have been completed, reverting will not be attempted.
The existing schema changer has an allowlist of retryable errors which are expected to be transient. Our current plan is to forego such a list in the new implementation, and then add back allowed errors when deemed necessary. It may also separately be useful to be more explicit about expected and unexpected errors from the backfill and from constraint validation, though this will involve some work, especially on the backfill side.
The new schema changer forbids multiple concurrent async schema changes which (roughly speaking) require descriptor mutations. New schema changes cannot even be "queued" in the mutations list, as with today's schema changes, until the list is empty. Queuing a mutation is equivalent to partially executing a schema change, since the elements have moved to a new non-public state, and we've decided to cut all instances of concurrent execution from the state space to make schema changes easier to reason about. This is compatible with our plans to lock descriptors in the context of transaction schema changes.
To preserve some compatibility with existing behavior, if a schema change is attempted when other schema changes on the same descriptor are running, we must poll and wait until the other schema changes are finished before committing the transaction. In general, within an explicit transaction, we'll need to poll using child transactions and then attempt to push the transaction timestamp. For implicit transactions with a single statement, it suffices to simply restart the transaction, which is similar to the process of waiting for old descriptor leases to drain.
An attempted formalization of mutual exclusion for schema changes is as follows: Within a single set of changes managed by a single schema changer, some modified descriptors are "locked" and undergo "locking" changes, while others undergo non-locking changes. Locking changes roughly correspond to schema changes requiring mutations, and are subject to mutual exclusion; non-locking changes, which are anticipated to be single-transaction changes such as adding foreign key backreferences, can occur even on locked tables. These concepts need more clarification for schema changes that do not require descriptor mutations.
We plan to gradually cut over to the new schema changer, enabling it for statements by default as they become supported. There will be at least one release in which new and old schema change jobs must coexist. (One reason is that although version gates provide guarantees about when jobs can start, schema change jobs from past versions can continue running past the version gate.)
Our current solution, which is relatively simple, is to have new-style schema changes wait for either new or old schema changes to finish before committing, and for old-style schema changes to simply return an error if there are ongoing new-style schema changes. There are certainly more user-friendly ways to deal with this.
Both old- and new-style schema changes can perform "non-locking" updates on tables currently undergoing the other kind of schema change. This seems fine in principle, but may deserve closer examination.
Making it easier to write high-quality schema change tests, especially tests which cover more combinations of schema changes or more intermediate states compared to the status quo, was an explicit goal of the design.
Since targets and stages are declaratively specified and decoupled from execution, we can test the planning components separately using data-driven tests, with no need for a SQL server. The input to the tests consists of DDL statements (or lists of DDL statements grouped in the same "transaction"), and the output consists of either targets or stages.
We now also have a uniform interface, the executor, through which all side effects and state changes occur, which allows for the introduction of testing knobs which also have a uniform interface. We plan to put a better abstraction over the testing knobs in place for end-to-end tests which need to execute queries, etc., in specific states.
We also intend to reuse our existing tests, including the SQL logic tests and the randomized schema changer workload tests. The latter, in particular, is useful because it potentially allows for testing combinations of both old- and new-style schema changes running concurrently.
Previously we had considered the idea of hard-coding a different set of states in the schema changer simply to avoid the problem of permanently losing data after a rollback. This would not have solved the other problems listed at the beginning of this document, and was not seriously pursued once we got started with transactional schema change planning.
DROP TABLE, and its relative TRUNCATE TABLE, in the presence of ongoing schema changes. Part of the difficulty is that the state of the schema change, which would presumably need to be updated, is now stored in its job.