docs/rfcs/20200713_ghost_batches_recovery.md
In raft log maintained by the consensus algorithm is gap less. This assumption allow nodes to reason about log correctness checking only the last log entry. This assumption is valid for non compacted Redpanda log instances. Compaction algorithm introduces intentional gaps in the log to remove outdated values. Additionally the compaction is not coordinated on all nodes and can be in different state at each replica.
Compacted log record batches (gaps) makes Raft recovery impossible, as follower is unable to check if it log is the same as the leader one. Moreover in Redpanda offsets are assigned to batches when they are appended to disk. Having gaps in the log would lead to incorrect offset assignment as even though there are gaps in the offsets follower will assign consecutive offsets.
Leader log: [0,10][11,20]<--gap from 21 to 32-->[33,40]
Follower log: [0,10]
When offsets are send to follower the offset assignment happen in its log_appender instance. The offsets are assigned monotonically starting from 0 without gaps.
During normal recovery procedure the follower would have the following offsets assigned:
[0,10][11,20][21,28]
this is incorrect as batch [21,28] has different offsets [33,40] at the
leader.
[base_offset,last_offset] - single batch
<-- --> - gap
The proposal is to use the 'ghost batches' indicating intentional gaps in the log. The ghost batches will not be persisted to disk but will be used in raft recovery mechanism to force correct offset assignment and provide gap-less log for Raft follower nodes.
Proposed solution supports raft recovery for compacted logs not comprising the Raft protocol consistency guarantees. The solution does not introduce additional storage cost introducing only small overhead in append entries request (57 bytes per one gap).
Expected outcome of the proposed solution is to have mechanism allowing Raft to work with compacted logs without comprising any of it safety guarantees.
Proposed solution defines new model::record_batch_type being representation
of 'ghost' batch denoting an intentional gap left after compacting the log.
raft::recovery_stmIn raft recovery state machine record batches read for the follower recovery are
materialized in memory as ss::circular_buffer<model::record_batch>.
The recovery STM iterates over materialized records buffer and inserts
storage::ghost_record_batch_type into places where gap in the log occur.
// fill batches collection with gaps
auto gap_filled_batches = details::fill_gaps(
start_offset, std::move(batches));
_base_batch_offset = gap_filled_batches.begin()->base_offset();
_last_batch_offset = gap_filled_batches.back().last_offset();
auto f_reader = model::make_foreign_record_batch_reader(
model::make_memory_record_batch_reader(
std::move(gap_filled_batches)));
return replicate(std::move(f_reader));
storage::disk_log_appenderDisk log appender use ghost record batches for correct offsets accounting but do not persists them to disk.
ss::future<ss::stop_iteration>
disk_log_appender::append_batch_to_segment(const model::record_batch& batch) {
// ghost batch handling, it doesn't happen often so we can use unlikely
if (unlikely(batch.header().type == storage::ghost_record_batch_type)) {
_idx = batch.last_offset() + model::offset(1); // next base offset
_last_offset = batch.last_offset();
return ss::make_ready_future<ss::stop_iteration>(
ss::stop_iteration::no);
}
return _seg->append(batch).then([this](append_result r) {
_idx = r.last_offset + model::offset(1); // next base offset
_byte_size += r.byte_size;
// do not track base_offset, only the last one
_last_offset = r.last_offset;
auto& p = _log.get_probe();
p.add_bytes_written(r.byte_size);
p.batch_written();
// substract the bytes from the append
// take the min because _bytes_left_in_segment is optimistic
_bytes_left_in_segment -= std::min(_bytes_left_in_segment, r.byte_size);
return ss::stop_iteration::no;
});
The ghost batches may fill gap that spans across batches coming from different terms.
Consider an example:
|- term 1 --| |- term 3 ...
Leader log: [0,10][11,20]<--gap from 21 to 32-->[33,40]
In this example batches in range [0,20] come from term 1 while batches in
range [33,...] from term 3. We are unable to determine what was the term of
that were removed during compaction process.
In order to guarantee correctness we assume that the gap has term of batch that
immediately follows it in the log. (In current example ghost batch in range
[21,32] would have term 3). An alternative would be to use term of a batch
immediately preceding the gap in log, however the ghost batch may be the first
batch in the range of batches, then it term couldn't be established.
This problem do not exists for the proposed approach as the gap will never exist
as a last element in a series of batches delivered for recovery.
The main drawback of the solution is an overhead introduced by gap filling batches i.e. 57 bytes per gap. In real life scenarios the overhead will be minimized as some large gaps will require only single gap representing batch even though many batches were compacted.
More complicated alternative would be to force follower assigned offsets with those coming from the leader. The main flaw of this solution is the fact that it is hard to correctly reason about it's impact on raft protocol safety as it diverge from the originally published Raft algorithm.