docs/RFCS/20150731_local_intent_resolution.md
Asynchronous intent resolution upon Transaction completion should be moved from client gateway to Range. An EndTransaction call should contain a definitive list of intents belonging to the Transaction, and the Range executing the EndTransaction call (i.e. the base Range for the Transaction) can in most cases resolve most (if not all) intents synchronously and often garbage-collect the record right away. In all other cases, the Range will perform the asynchronous resolution of the remaining intents, and a TransactionGCQueue can periodically scrub those Transaction's records.
Currently, the client gateways (TxnCoordSender) involved in executing a Transaction (Txn) keep individual records of the intents written throughout the Txn. Upon completion (that is, when EndTransaction either commits or aborts the Txn), the client gateway executing that command asynchronously resolves the intents tracked by it. There are some issues with that:
Txn's record.Txn carried out over multiple coordinators will not have all of its intents resolved, only those on the coordinator which executes EndTransaction.Txn records: it's not clear whether there are any open intents somewhere else in the system.Not resolving intents is not a correctness issue, but can significantly reduce performance in the presence of contention as conflict resolution requires a lot of round-trip delays and potential backoff. Not garbage-collecting Txn records is also not a correctness issue, but necessary at some point.
At the Range level, we can deal with most of these issues, and adding a bit more data to the Transaction records which can not be gc'ed immediately allows speedy cleanup of the rarer cases as well.
Dealing with multiple coordinators handling a single Txn is tricky: It's hard to keep clients from using more than one gateway reliably without additional complexity, but the natural path to take with the suggested changes is to declare all intents which are not contained in the transaction record illegal. This could mean that parts of a client's Transaction might get lost upon commit when written through another gateway, unless we take precautions.
An easy and reliable method (for correctly operating clients) is to require that the request which starts the Txn on the coordinator comes with a bare Txn header.
This provides a simple method for clients to stick to one coordinator, and gives us an authoritative list of intents. Any intent not in the list is considered aborted; that way ill-behaved clients may lose writes, but otherwise we would not be able to gc Txn records ever.
The logic for resolving intents on EndTransaction is changed: Instead of trying to resolve the intents, their keys are bundled up and sent along with the EndTransaction call. Once that call has returned, the coordinator's work for that Txn is done.
Heartbeating a Transaction which just got committed could recreate the freshly-gc'ed record; some care needs to be taken that only the first heartbeat can create a record (all others should CPut), and that the heartbeat goroutine ends before the gateway passes on the EndTransaction call. This isn't a correctness issue, though.
The logic for EndTransaction is updated to reflect the following enhancements:
EndTransaction batchBatch asynchronously. Once resolved, the Txn record should be updated (as with any intent resolution; the functionality can be factored out into a single location). If the Txn is without intents after the update, it can be removed.Transaction records of each Range and re-trigger the resolution process described above for all records which have open intents (which will remove them when done). This should only ever have to do actual work in the event of node crashes, etc.All in all, this should be fairly straightforward and reduces some complexity and shared responsibility regarding intent resolution.
Clients will have their Txn aborted if switching gateways: This could be a long-term concern for long-running Txns or clients which talk to a non-sticky load-balancer. Via @bdarnell:
At Google, mapreduces would originally have to restart if their (single) master task failed. This was enough of a problem that they needed to implement master failover (eventually - it was tolerated for a long time).
This isn't enough of an issue now to consider, but good to keep in mind. We can always soften the single-coordinator constraint later, but will need to put more complexity into the resolution process in that case (for instance, the client could track the intents), and may have to reconsider heartbeating (with many coordinators, a Txn would otherwise consume many goroutines).