docs/RFCS/20191014_savepoints.md
Remember, you can submit a PR with your RFC before the text is complete. Refer to the README for details.
Table of contents:
This RFC proposes to implement SQL savepoints as supported by PostgreSQL.
Savepoints enable a client to partially roll back a transaction.
This is a feature that is often requested by users, and used prominently in the test suites of 3rd party tools, in particular ORMs. We want to enable these test suites so as to increase our compatibility coverage.
The addition of SQL savepoints is enabled by recent changes to the KV layer, whereby write intents now preserve the history of sequence numbers that wrote to a key throughout a transaction.
SQL savepoints are prominently used by 3rd party tools and frameworks. They are key to implement nested transactions, which is a common programming idiom from Java and other object-oriented languages.
A savepoint is a special mark inside a transaction that allows all commands that are executed after it was established to be rolled back, restoring the transaction state to what it was at the time of the savepoint.
See also: https://www.postgresql.org/docs/current/sql-savepoint.html
to establish a savepoint inside a transaction: SAVEPOINT savepoint_name
The usual PostgreSQL identifier rules apply: SAVEPOINT foo and
SAVEPOINT Foo define the same savepoint, whereas SAVEPOINT "Foo"
defines another.
to roll back a transaction partially to a previously established
savepoint: ROLLBACK TO SAVEPOINT savepoint_name
to forget a savepoint, and keep the effects of statements executed
after the savepoint was established: RELEASE SAVEPOINT savepoint_name
For example:
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
ROLLBACK TO SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (3);
COMMIT;
The above transaction will insert the values 1 and 3, but not 2.
Savepoints can be nested.
For example:
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
SAVEPOINT my_savepoint2;
INSERT INTO table1 VALUES (3);
ROLLBACK TO SAVEPOINT my_savepoint2;
INSERT INTO table1 VALUES (4);
RELEASE my_savepoint;
COMMIT;
This inserts values 1, 2 and 4 but not 3.
Changes partially committed by a savepoint release can be rolled back by an outer savepoint.
For example:
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
SAVEPOINT my_savepoint2;
INSERT INTO table1 VALUES (3);
RELEASE SAVEPOINT my_savepoint2;
ROLLBACK TO SAVEPOINT my_savepoint;
COMMIT;
This inserts only value 1. The value 3 is rolled back alongside 2.
As an extension to the SQL standard, PostgreSQL allows a
SAVEPOINT statement to shadow an earlier savepoint with the same
name. The name refers to the new savepoint until released/rolled back,
after which the name reverts to referring to the previous savepoint.
For example:
BEGIN;
INSERT INTO table1 VALUES (1);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (2);
SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (3);
ROLLBACK TO SAVEPOINT my_savepoint;
INSERT INTO table1 VALUES (4);
RELEASE SAVEPOINT my_savepoint;
COMMIT;
This inserts values 1, 2 and 4, but not 3.
RELEASE SAVEPOINT and ROLLBACK TO SAVEPOINT can refer to
a savepoint "higher" in the nesting hierarchy. When this occurs, all
the savepoints "under" the nesting are automatically released/rolled
back too.
For example:
BEGIN;
SAVEPOINT foo;
INSERT INTO table1 VALUES (1);
SAVEPOINT bar;
INSERT INTO table1 VALUES (2);
RELEASE SAVEPOINT foo;
COMMIT;
This inserts both 1 and 2.
BEGIN;
SAVEPOINT foo;
INSERT INTO table1 VALUES (1);
SAVEPOINT bar;
INSERT INTO table1 VALUES (2);
ROLLBACK TO SAVEPOINT foo;
COMMIT;
This inserts nothing: both inserts are rolled back.
BEGIN;
SAVEPOINT foo;
SAVEPOINT bar;
ROLLBACK TO SAVEPOINT foo;
RELEASE SAVEPOINT bar; -- error: savepoint "bar" does not exist
COMMIT;
This demonstrates that the name "bar" is not visible after it was rolled back over.
If a SQL error occurs "under" a savepoint, it is possible to recover an open, "healthy" txn by rolling back the savepoint without rolling back the txn. (An exception to this is discussed below.)
For example:
kena=> create table u(x int unique);
CREATE TABLE
kena=> insert into u(x) values(1);
INSERT 0 1
kena=> begin; \
savepoint foo; \
insert into u(x) values (1); \
rollback to savepoint foo; \
insert into u(x) values (2); \
commit;
BEGIN
SAVEPOINT
ERROR: duplicate key value violates unique constraint "u_x_key"
DETAIL: Key (x)=(1) already exists.
ROLLBACK
INSERT 0 1
COMMIT
kena=> select * from u;
x
---
1
2
(2 rows)
(i.e. the second insert succeeds even though the first insert encountered an error)
In the first implementation, an exception to this will be the handling of retry errors. These will not be cancellable with a savepoint rollback. Instead, a client-side retry loop must be implemented using the regular mechanism (see next section).
CockroachDB supports exclusive row locks. (They come in two variants - distributed as write intents and, since v20.1, also non-distributed from SELECT FOR UPDATE. However this distinction is not material here).
This is an architectural difference in v20.1 that may or may not be lifted in a later CockroachDB version.
Client code that rely on row locks in client apps must be reviewed and possibly modified to account for this difference. In particular, if an application is relying on ROLLBACK TO SAVEPOINT to release row locks and allow a concurrent txn touching the same rows to proceed, this behavior will not work with CockroachDB.
The introduction of regular SQL savepoints reveals a difference between two situations that were previously somewhat indistinguishable:
retry errors (txn conflicts) that are detected / occur in the middle of a transaction, as a result for some SQL statement.
Some (most) of these can now be recovered using ROLLBACK TO SAVEPOINT and a regular savepoint name.
retry errors (txn conflicts) that are detected at the end of a transaction, during COMMIT.
These remain detected during COMMIT, or during RELEASE cockroach_restart (more below). These can only be recovered using
ROLLBACK TO SAVEPOINT cockroach_restart and a full txn replay.
From a user perspective, this translates as follows.
The newly introduced support for SQL savepoints still considers
the name cockroach_restart special. A savepoint defined with the
name cockroach_restart is a "restart savepoint" and has separate
semantics:
it must be opened immediately when the transaction starts. It is not supported/allowed to open a restart savepoint after some other statements have been executed.
In contrast, regular savepoints can be opened after some other statements have been executed already.
after a successful RELEASE, a restart savepoint does not allow further use of the transaction. COMMIT must immediately succeed the RELEASE.
In contrast, other statements can be executed after a RELEASE of a restart savepoint.
restart savepoints cannot be nested.
Issuing SAVEPOINT cockroach_restart two times in a row only
creates a single savepoint marker (this can be seen with
SHOW SAVEPOINT STATUS). Issuing SAVEPOINT cockroach_restart
after ROLLBACK TO SAVEPOINT cockroach_restart reuses the marker
instead of creating a new one.
In contrast, two SAVEPOINT statements with a regular savepoint
name, or SAVEPOINT immediately after ROLLBACK, create two
distinct savepoint markers.
Note: the session setting force_savepoint_restart still works
and causes every savepoint name to become equivalent to
cockroach_restart with the special semantics described above.
Note: this section is not yet implemented in 20.1 and is planned for a later version.
Schema changes and other DDL are supported "under" savepoints and can be partially rolled back without rolling back the entire transaction.
For example:
BEGIN;
CREATE TABLE u(x INT);
SAVEPOINT foo;
CREATE TABLE t(x INT);
INSERT INTO t(x) VALUES (1);
ROLLBACK TO SAVEPOINT foo;
INSERT INTO u(x) VALUES(1);
SAVEPOINT bar;
CREATE TABLE t(x TEXT);
RELEASE SAVEPOINT foo;
INSERT INTO t(x) VALUES ('a');
COMMIT;
This inserts 1 into u and 'a' into t. The table t with an INT column
does not exist after the transaction commits.
Note that the ordering of DDL vs DML statements remain restricted as per previously, none of the known limitations are lifted in this work.
The overall design can be understood as follows:
at the SQL/KV interface, KV operations are associated with sequence numbers (seqnums):
a savepoint is defined by:
conceptually, savepoints define a stack-like structure. Any savepoint has at most one savepoint open "under" it. When releasing or rolling back to a savepoint higher in the stack, all the savepoints in-between are also released or rolled back. In the SQL executor, this is managed using a stack-like data structure.
to introduce savepoint semantics in schema updates / DDL, the descriptor caches are extended to store the seqnum alongside the cached descs. Cache entries are not considered/evicted if the seqnum has been rolled back.
Initial analysis, pre-impl, suggests the following Work to be performed:
Step() method and call it where relevant.txnPipeliner.To support nesting and shadowing, the SQL executor maintains a naming environment: a list of mappings from names to savepoints.
In Go, using []struct{name,sp} or equivalent.
(We do not use a simple map because savepoints form a stack.)
The most recent, innermost savepoint information is at the end of the list.
When defining a savepoint, a new entry is appended. When releasing or rolling back a savepoint, the latest mapping for that sp name is removed:
cockroach_restart special names
first verify that the stack of savepoints is empty. If it is not,
a usage/syntax/unimplemented error is produced.Background:
To implement SQL savepoints:
the descriptor caches must be invalidated when a savepoint is rolled back. To achieve this:
we let the further processing of DDL statements proceed using KV operations as usual. We expect that the rest of the KV/SQL semantics will make schema changes "just work":
(This will be validated by experimentation in a prototype).
Background:
We need to keep both values around even without savepoints because we may reorder KV reads and writes. For example, in the sequence write1-read1-write2, if the read1 operation happens to be processed after write2, we still want it to see only write1 that is logically in its past.
The way this works today is that each KV operation also has a metadata field which tells it "at which seqnum" it should read. The MVCC read logic, when it encounters an intent, scans the entries in the intent from last to first, and ignores all writes performed at a later seqnum.
For savepoints, this needs to be extended as follows:
the txn metadata, which is available in the context of every MVCC read, is extended by a new field "list of ignored seqnum ranges".
when a MVCC reads finds an intent and scans to find the first value that's no later than the current read seqnum, it should also skip over any value written at a seqnum also in the ignore list.
Savepoint rollbacks are implemented by populating the ignore list with the range of seqnums generated from the point the savepoint was last established, to the point of the rollback.
This storage-specific part of the work is described in this issue: https://github.com/cockroachdb/cockroach/issues/41612
This RFC assumes understanding from the TxnCoordSender tech note at #42116 or here after this PR merges.
On the "all-is-well" path (no errors) we want to associate SQL savepoints with seqnums, and clean up the read-your-own-writes semantics of SQL mutations.
The code changes will introduce the following principles for SQL transactions (non-SQL txns are unaffected):
seqnums will not reset to 0 when a txn epoch is incremented so that seqnums can continue to increase monotonically across txn epochs and there is no seqnum reuse across epochs.
a SQL savepoint token is a copy of the current write seqnum, together with the minimal additional state sufficient to partially roll back the txn (this is discussed further below).
a SQL savepoint release checks and reports any currently deferred error (see tech note for definition, e.g. txn push or WriteTooOld).
a SQL savepoint rollback is implemented as a SavepointRollback() method
in the TCS, which takes as argument the SQL savepoint token
where to rollback, computes the range of seqnums from the target savepoint to the
current last generated seqnum, and then populates the current txn
object to mark this entire range as rolled back. (This
information will live in the transaction record and be available
through TxnMeta to every KV op in MVCC, which need it.)
there cannot be any in-flight LeafTxn active when a savepoint is rolled back, so that the list of ignored seqnum ranges can never change "under" a LeafTxn concurrently.
The SQL executor is responsible for organizing SQL execution so as to
prevent LeafTxns existing concurrently between
SavepointRollback() operations.
See the section of the tech note for background information.
| Error kind | Prior situation | New (proposed) situation |
|---|---|---|
| recoverable errors with in-place recovery | Auto-retry/adjust internal to TCS, txn object remains live, no client error | (unchanged) |
| recoverable errors with txn restart | txn object re-init with epoch bump, retry error | unchanged, but see (2) below |
| deferred retry errors (eg WriteTooOld) | error stored, temporarily hidden from client, re-reported during commit | unchanged, but see (3) below |
| transient processing errors | TCS trashed + txn aborted | TCS + txn remain active, no txn state change, see (1) below |
| transaction aborts | TCS trashed + txn aborted | (unchanged) |
| unhandled errors | TCS trashed + txn aborted | (unchanged) |
avoid trashing the TCS when innocuous KV errors happen
The first and main change is to relax the notion of "unrecoverable error" (see tech note for a definition).
Today, transient processing errors like reading from a historical ts that's been GCed, an invalid CPut condition, etc all cause the TCS to move to the "txnError" state after which no operation is ever possible any more. The RFC proposes to change this so that only internal assertion errors cause a TCS to become fully invalidated.
Instead, KV errors like CPut condition errors will simply generate an error object (as they already do) and nothing else, and this error will flow back to SQL where it can be dealt with as usual.
The new behavior is that it will be possible to continue issuing KV requests via the TCS after such an error occurs.
The main change from the Seqnums and savepoints section above decouples the seqnum increments from epoch increments. This means that the "internal" error handling performed by TCS on retry errors will not invalidate seqnum ranges (and rolled back seqnum ranges after SQL savepoint rollbacks).
The new behavior is that it becomes possible to recover from a retry error (other than WriteTooOld, see point 3 below) using a SQL savepoint rollback.
any deferred error (currently, just WriteTooOld) is deferred merely
to the first next savepoint RELEASE. If present it is reported then.
If a client opens a new savepoint while there is a pending WriteTooOld
error, that error state is preserved in the savepoint token and
restored when the savepoint is rolled back.
(TBD: whether this is correct/desirable. There is discussion about whether WriteTooOld is still handled this way at all.)
From the perspective of SQL clients:
| Error type | Prior situation | New (proposed) situation |
|---|---|---|
| transaction aborts | no recovery possible | (unchanged) |
| transient processing errors | no recovery possible | can be recovered using savepoint rollback |
| unhandled errors | no recovery possible | (unchanged) |
| recoverable errors with in-place recovery | automatic recovery, invisible to client | (unchanged) |
| recoverable errors with txn restart | retry error, must start from beginning of txn | can be recovered using savepoint rollback |
| deferred retry errors (eg WriteTooOld) | error reported during commit, entire txn can be retried | can be recovered using savepoint rollback |
GetSavepoint() (SavepointToken, error) method on TCS and
*client.Txn, returns a SavepointToken, to be used exclusively on
RootTxns without LeafTxns active.
new GetSavepoint() method on TCS.
Initially:
type SavepointToken struct {
SeqNum enginepb.TxnSeq
Timestamp hlc.Timestamp
RefreshedTimestamp hlc.Timestamp
OrigTimestampWasObserved bool
InFlightWriteIndex int // see explanation below
Epoch int32 // (may be removed entirely)
}
2) conceptually asks all interceptors to "build a savepoint" although
first implementation will be as simple as getting the current
counter value in the `txnSeqNumAllocator`.
3) new `GetSavepoint()` on `TxnSender` interface
4) new `GetSavepoint()` on `client.Txn`, forwards to the `TxnSender`.
RollbackToSavepoint(SavepointToken) error method on TCS
and *client.Txn, to be used exclusively on RootTxns
without LeafTxns active.
new RollbackToSavepoint() method on TCS.
performs the seqnum invalidation in the txn record, as described
above. Restores the additional state. In particular, the in
flight write slice (InFlightWrites in the roachpb.Transaction
record) is truncated to the position indicated by
InFlightWriteIndex).
new RollbackToSavepoint() method on TxnSender interface
new RollbackToSavepoint() method on client.Txn, forwards to TxnSender.
The current code in CockroachDB preserves row locks during savepoint rollbacks. This was already the case with the "restart savepoint" mechanism introduced in v1.0, and continues to be the case with the more general savepoints proposed here.
This behavior is a divergence from PostgreSQL which releases locks upon rollbacks. This may have user-visible impact (see guide-level section).
Releasing locks on a savepoint rollback is not trivial because we'd have to track what sequence number each lock was acquired at, and we'd have to maintain full fidelity for this tracking (preventing us from collapsing different spans in the footprint to save memory).
One option is to not eagerly release locks, but to update the transaction record upon a rollback with a list of ignored seq nums. Then concurrent pusher txns could be allowed to push locks corresponding to ignored seq nums. This would mean that the kvserver would have to maintain full fidelity on the locks, however.
There's also a case where we benefit from holding on to locks - when rolling back to an initial savepoint after a retriable error. In that case, chances are the same locks will be requested again, so holding on to them ensures the transaction can acquire them on the retry.
All in all we need some experience with savepoints being used in the wild to fully understand the trade-offs and make better informed decisions. Until then, the previous/current behavior and pg divergence will remain and needs to be documented.
This feature introduces more complexity in the SQL executor.
(to be populated)
There are two general design directions:
mark certain ranges of sequence numbers as "rolled back" in the transaction record itself, to be ignored during MVCC reads.
Pros: rollbacks are cheaper
Cons: reads are more expensive
Cons: after a rollback the intents are "still there" and create contention with concurrent txns that touch the rolled back intents.
proactively iterate through all intents generated for the current txn and remove the rolled back sequence numbers from the generated intents, including removing the intent if the sequence numbers being rolled back are the only ones remaining.
Pros: reads are cheaper
Pros: rolled back intents also roll back the contention
Pros: may simplify the row locking story (although Tobias found out that PostgreSQL is pretty bad on this so we don't need this pro to be at least as good as pg)
Cons: rollbacks are more expensive
Nathan recommends approach 1, sensing that it requires less work.
(none at this time)