docs/rfcs/030-vectored-timeline-get.md
Created on: 2024-01-02 Author: Christian Schwarz
A brief RFC / GitHub Epic describing a vectored version of the Timeline::get method that is at the heart of Pageserver.
EDIT: the implementation of this feature is described in Vlad's (internal) tech talk.
During basebackup, we issue many Timeline::get calls for SLRU pages that are adjacent in key space.
For an example, see
https://github.com/neondatabase/neon/blob/5c88213eaf1b1e29c610a078d0b380f69ed49a7e/pageserver/src/basebackup.rs#L281-L302.
Each of these Timeline::get calls must traverse the layer map to gather reconstruct data (Timeline::get_reconstruct_data) for the requested page number (blknum in the example).
For each layer visited by layer map traversal, we do a DiskBtree point lookup.
If it's negative (no entry), we resume layer map traversal.
If it's positive, we collect the result in our reconstruct data bag.
If the reconstruct data bag contents suffice to reconstruct the page, we're done with get_reconstruct_data and move on to walredo.
Otherwise, we resume layer map traversal.
Doing this many Timeline::get calls is quite inefficient because:
We should have a vectored aka batched aka scatter-gather style alternative API for Timeline::get. Having such an API unlocks:
There is a new variant of Timeline::get, called Timeline::get_vectored.
It takes as arguments an lsn: Lsn and a src: &[KeyVec] where struct KeyVec { base: Key, count: usize }.
It is up to the implementor to figure out a suitable and efficient way to return the reconstructed page images.
It is sufficient to simply return a Vec<Bytes>, but, likely more efficient solutions can be found after studying all the callers of Timeline::get.
Functionally, the behavior of Timeline::get_vectored is equivalent to
let mut keys_iter: impl Iterator<Item=Key>
= src.map(|KeyVec{ base, count }| (base..base+count)).flatten();
let mut out = Vec::new();
for key in keys_iter {
let data = Timeline::get(key, lsn)?;
out.push(data);
}
return out;
However, unlike above, an ideal solution will
struct Layer at most once.Layer::get_value_reconstruct_data at most once.
DiskBtree page at most once.Each of these items above represents a significant amount of work.
Ideally, the base performance of a vectored get of a single page should be identical to the current Timeline::get.
A reasonable constant overhead over current Timeline::get is acceptable.
The performance improvement for the vectored use case is demonstrated in some way, e.g., using the pagebench basebackup benchmark against a tenant with a lot of SLRU segments.
High-level set of tasks / changes to be made:
Timeline::get_vectored implementation & adopt it across pageserver.Vec<Bytes> vs impl Stream).LayerMap::search (take 1 LSN and N Keys instead of just 1 LSN and 1 Key)Timeline::get_reconstruct_data to hold & return state for N Keys instead of 1
cont_lsn after we've found some reconstruct data for some keys
but need more.
Likely we'll need to keep track of cont_lsn per key and continue next iteration at max(cont_lsn) of all keys that still need data.Layer::get_value_reconstruct_data / DiskBtree
DiskBtreeReader::visit() to collect the (offset,len) pairs for delta record blobs to load.DiskBtreeReader::get to get the offset of the image blob to load. Underneath, that's just a ::visit() call.DiskBtree::visit()?
KeyVec instead of a single Key as argument, i.e., take a single contiguous key range to visit.KeyVec's key range&[KeyVec], sort it;KeyVec range to determine whether we need to descend or back out.KeyVec".DiskBtree::visit produces a set of offsets which we then read from a VirtualFile here
PageCache and `VirtualFile here (not great under pressure)blob_io interface and then the VirtualFile API.VirtualFile API, which sits underneath blob_io, is being touched by ongoing io_uring workPageCache, but there's controversy around the future of PageCache.
PageCache.PageCache as an extra hop in the I/O chain, rather than as an integral part of buffer management.Let's see how we can improve by doing the first three items in above list first, then revisit.
No feature flags are required for this epic.
At the end of this epic, Timeline::get forwards to Timeline::get_vectored, i.e., it's an all-or-nothing type of change.
It is encouraged to deliver this feature incrementally, i.e., do many small PRs over multiple weeks. That will help isolate performance regressions across weekly releases.
Sharding splits up the key space, see functions is_key_local / key_to_shard_number.
Just as with Timeline::get, callers of Timeline::get_vectored are responsible for ensuring that they only ask for blocks of the given struct Timeline's shard.
Given that this is already the case, there shouldn't be significant interaction/interference with sharding.
However, let's have a safety check for this constraint (error or assertion) because there are currently few affordances at the higher layers of Pageserver for sharding<=>keyspace interaction.
For example, KeySpace is not broken up by shard stripe, so if someone naively converted the compaction code to issue a vectored get for a keyspace range it would violate this constraint.