docs/rfcs/013-term-history.md
Currently, apart from WAL safekeeper persistently stores only two logical clock
counter (aka term) values, sourced from the same sequence. The first is bumped
whenever safekeeper gives vote to proposer (or acknowledges already elected one)
and e.g. prevents electing two proposers with the same term -- it is actually
called term in the code. The second, called epoch, reflects progress of log
receival and this might lag behind term; safekeeper switches to epoch n when
it has received all committed log records from all < n terms. This roughly
corresponds to proposed in
https://github.com/neondatabase/rfcs/pull/3/files
This makes our biggest our difference from Raft. In Raft, every log record is
stamped with term in which it was generated; while we essentially store in
epoch only the term of the highest record on this safekeeper -- when we know
it -- because during recovery generally we don't, and epoch is bumped directly
to the term of the proposer who performs the recovery when it is finished. It is
not immediately obvious that this simplification is safe. I thought and I still
think it is; model checking confirmed that. However, some details now make me
believe it is better to keep full term switching history (which is equivalent to
knowing term of each record).
Without knowing full history (list of <term, LSN> pairs) of terms it is hard to determine the exact divergence point, and if we don't perform truncation at that point safety becomes questionable. Consider the following history, with safekeepers A, B, C, D, E. n_m means record created by proposer in term n with LSN m; (t=x, e=y) means safekeeper currently has term x and epoch y.
Now, A gets back and P3 starts recovering it. How it should proceed? There are two options.
...start sending WAL conservatively since the horizon (1.1), and truncate obsolete part of WAL only when recovery is finished, i.e. epochStartLsn (4) is reached, i.e. 2.3 transferred -- that's what https://github.com/neondatabase/neon/pull/505 proposes.
Then the following is possible:
Now log of A is basically corrupted. Moreover, since ABE are all in epoch 1 and A's log is the longest one, they can elect P4 who will commit such log.
Note that this particular history couldn't happen if we forbid to create new records in term n until majority of safekeepers switch to it. It would force CDE to switch to 2 before 2.2 is created, and A could never become donor while his log is corrupted. Generally with this additional barrier I believe the algorithm becomes safe, but
Then step 4 would delete 1.3 1.4 on A, and we are ok. The question is, how do we do that? Without term switching history we have to resort to sending again since the horizon and memcmp'ing records, which is inefficient and ugly. Or we can maintain full history and determine truncation point by comparing 'wrong' and 'right' histories -- much like pg_rewind does -- and perform truncation + start streaming right there.
epoch). We also send end of wal as we do now to determine the donor.Truncates wrong part of WAL on safekeeper (divergence point is already calculated at proposer, but can be cross-verified here).
Communicates the 'right' history of its term (taken from donor). Seems better to immediately put the history in the controlfile, though safekeeper might not have full WAL for previous terms in it -- this way is simpler, and we can't update WAL and controlfile atomically anyway.
This also constitutes analogue of current epoch bump for those safekeepers which don't need recovery, which is important for sync-safekeepers (bump epoch without waiting records from new term).
pros/cons: