docs/RFCS/20151111_txn_gc.md
{Response->Sequence}Cache, Transaction{Table->Cache}.Txn record on (successful) EndTransaction.EndTransaction call after successful resolution of (possibly) outstanding intents.ResolveIntent{,Range} which are carried out as a consequence of client's EndTransaction (successful or not).Push during ResolveIntent{,Range}, preventing the anomaly in #2231 (own writes vanishing).Both transaction and sequence cache records should be deleted when they're no longer useful, ideally without introducing extra work. The procedure outlined here accomplishes that in the vast majority of cases (including all non-abandoned transactions).
Refresher:
BeginTransaction).Range mutated by the txn
with a sequence counter (mutation with non-increasing sequence number triggers
a txn restart).Push always aborts the transaction.The above means that we can always garbage collect aborted transactions with only a best-effort attempt to clean up their intents (but we'll do it only after the client's EndTransaction or, if that never happens, the "slow way" via the GC queue; see below).
For committed transactions, we must guarantee that no open intents exist before deleting the entry (we already synchronously resolve all intents local to the transaction record and GC the record right away if no external intents exist). The straightforward solution is to have EndTransaction persist the external intents on the transaction record and let the goroutine which resolves them asynchronously do a little more work: after successfully carrying out the batch worth of ResolveIntent, it can delete the corresponding txn record.
All of this is best effort: we're still going to have a gc queue which walks over old transaction entries, poking old transactions and retrying their intent resolution for the .0001% of transactions which are left hanging.
EndTransaction gets executed (regardless of its outcome), but not when a concurrent transaction manages to abort it by means of a Push.ResolveIntent there anyway, so we simply make clearing (idempotently) the sequence cache entry a side effect of a ResolveIntent{,Range} (when it's carried out as part of EndTransaction).on ResolveIntent triggered through an aborting Push, we can actually deal with #2231 nicely. The issue there is that a running transaction may not know that it's been aborted already, which leads to anomalies related to the fact that its intents may be gone (so it may not read what it wrote). The key, again, is ResolveIntent{,Range}:
(epoch, seq) < (epoch', seq') iff epoch < epoch' || (epoch == epoch' && seq < seq').Push, we simply poison the sequence cache on that range (setting sequence=math.MaxInt64). Assuming that we check the sequence cache on every batch (not only for writes), we trigger a transaction restart should the transaction come back to the Range. If checking the sequence cache on reads shows up in performance considerations, there are going to be ways to avoid disk I/O in most cases.
The retry increases the epoch, so when the txn comes back, it will be able to perform normally.On both Split and Merge we'll copy the entry (keeping the larger one on collision).
The slow path to sequence cache GC takes place in the following situations:
Split to a Range not touched by ResolveIntent{,Range} for its transaction.In the same queue which grooms the transaction cache, we'll also groom the local sequence cache with the goal of finding "inactive" entries, pinging their transaction and removing according to the outcome. To be able to do that, we need to persist more information into the response cache key:
Some of the additional overhead could be avoided if transaction IDs encoded some of that information. For example, instead of UUID4 transaction IDs we could adopt the scheme <hlc_wall=64bit,hlc_logical=32bit><random=32bit>, but the entropy is considerably lower. This is out of scope for this RFC.
Possibly checking the sequence cache on reads can show up in performance tuning (not necessarily expected though), in which case some extra caching should do the trick to avoid I/O.
Likewise, deleting the txn entries may need some batching up for performance (to save Raft proposals; again straightforward to do).
The original design proposed keeping track of the cluster-wide oldest intent's timestamp, which would allow all txn entries older than that timestamp to be GC'ed. The mechanism with its global characteristics doesn't seem preferable to the one outlined above (especially since little complexity and no significant performance hits or new RPCs are introduced there) and does not immediately solve #2231.