docs/rfcs/2025-04-30-direct-io-for-pageserver.md
Date: Apr 30, 2025
This document is a retroactive RFC. It
The initial proposal that kicked off the work can be found in this closed GitHub PR.
People primarily involved in this project were:
For posterity, here is the rough timeline of the development work that got us to where we are today.
tokio-epoll-uring along with owned buffers APItokio-epoll-uring enabled in all regions in buffered IO modekernel page cache: the Linux kernel's page cache is a write-back cache for filesystem contents. The cached unit is memory-page-sized & aligned chunks of the files that are being cached (typically 4k). The cache lives in kernel memory and is not directly accessible through userspace.
Buffered IO: an application's read/write system calls go through the kernel page cache.
For example, a 10 byte sized read or write to offset 5000 in a file will load the file contents
at offset [4096,8192) into a free page in the kernel page cache. If necessary, it will evict
a page to make room (cf eviction). Then, the kernel performs a memory-to-memory copy of 10 bytes
from/to the offset 4 (5000 = 4096 + 4) within the cached page. If it's a write, the kernel keeps
track of the fact that the page is now "dirty" in some ancillary structure.
Writeback: a buffered read/write syscall returns after the memory-to-memory copy. The modifications
made by e.g. write system calls are not even issued to disk, let alone durable. Instead, the kernel
asynchronously writes back dirtied pages based on a variety of conditions. For us, the most relevant
ones are a) explicit request by userspace (fsync) and b) memory pressure.
Memory pressure: the kernel page cache is a best effort service and a user of spare memory capacity.
If there is no free memory, the kernel page allocator will take pages used by page cache to satisfy allocations.
Before reusing a page like that, the page has to be written back (writeback, see above).
The far-reaching consequence of this is that any allocation of anonymous memory can do IO if the only
way to get that memory is by eviction & re-using a dirty page cache page.
Notably, this includes a simple malloc in userspace, because eventually that boils down to mmap(..., MAP_ANON, ...).
I refer to this effect as the "malloc latency backscatter" caused by buffered IO.
Direct IO allows application's read/write system calls to bypass the kernel page cache. The filesystem
is still involved because it is ultimately in charge of mapping the concept of files & offsets within them
to sectors on block devices. Typically, the filesystem poses size and alignment requirements for memory buffers
and file offsets (statx Dio_mem_align / Dio_offset_align), see this gist.
The IO operations will fail at runtime with EINVAL if the alignment requirements are not met.
"buffered" vs "direct": the central distinction between buffered and direct IO is about who allocates and fills the IO buffers, and who controls when exactly the IOs are issued. In buffered IO, it's the syscall handlers, kernel page cache, and memory management subsystems (cf "writeback"). In direct IO, all of it is done by the application. It takes more effort by the application to program with direct instead of buffered IO. The return is precise control over and a clear distinction between consumption/modification of memory vs disk.
Pageserver PageCache: Pageserver has an additional PageCache (referred to as PS PageCache from here on, as opposed to "kernel page cache").
Its caching unit is 8KiB blocks of the layer files written by Pageserver.
A miss in PageCache is filled by reading from the filesystem, through the VirtualFile abstraction layer.
The default size is tiny (64MiB), very much like Postgres's shared_buffers.
We ran production at 128MiB for a long time but gradually moved it up to 2GiB over the past ~year.
VirtualFile is Pageserver's abstraction for file IO, very similar to the facility in Postgres that bears the same name.
Its historical purpose appears to be working around open file descriptor limitations, which is practically irrelevant on Linux.
However, the facility in Pageserver is useful as an intermediary layer for metrics and abstracts over the different kinds of
IO engines that Pageserver supports (std-fs vs tokio-epoll-uring).
For multiple years, Pageserver's PageCache was on the path of all read and write IO.
It performed write-back to the kernel using buffered IO.
We converted it into a read-only cache of immutable data in PR 4994.
The introduction of tokio-epoll-uring required converting the code base to used owned IO buffers.
The PageCache pages are usable as owned IO buffers.
We then started bypassing PageCache for user data blocks.
Data blocks are the 8k blocks of data in layer files that hold the multiple Values, as opposed to the disk btree index blocks that tell us which values exist in a file at what offsets.
The disk btree embedded in delta & image layers remains PageCache'd.
Epics for that work were:
Timeline::get (cf RFC 30) skipped delta and image layer data block PageCacheing outright.The outcome of the above:
VirtualFile APIs, hitting the kernel buffered read path (=> kernel page cache).PageCache.In production we size the PS PageCache to be 2GiB.
Thus drives hit rate up to ~99.95% and the eviction rate / replacement rates down to less than 200/second on a 1-minute average, on the busiest machines.
High baseline replacement rates are treated as a signal of resource exhaustion (page cache insufficient to host working set of the PS).
The response to this is to migrate tenants away, or increase PS PageCache size.
It is currently manual but could be automated, e.g., in Storage Controller.
In the future, we may eliminate the PageCache even for indirect blocks.
For example with an LRU cache that has as unit the entire disk btree content
instead of individual blocks.
So, before work on this project started, all data block reads and the entire write path of Pageserver were using kernel-buffered IO, i.e., the kernel page cache. We now want to get the kernel page cache out of the picture by using direct IO for all interaction with the filesystem. This achieves the following system properties:
Predictable VirtualFile latencies
Explicitness & Tangibility of resource usage
CPU Efficiency
The trade-off is that we no longer get the theoretical benefits of the kernel page cache. These are:
We are happy to make this trade-off:
The desired end state of the project is as follows, and with some asterisks, we have achieved it.
All IOs of the Pageserver data path use direct IO, thereby bypassing the kernel page cache.
In particular, the "data path" includes
Timeline::get / Timeline::get_vectored path.The production Pageserver config is tuned such that virtually all non-data blocks are cached in the PS PageCache. Hit rate target is 99.95%.
There are no regressions to ingest latency.
The total "wait-for-disk time" contribution to random getpage request latency is O(1 read IOP latency).
We accomplish that by having a near 100% PS PageCache hit rate so that layer index traversal effectively never needs not wait for IO.
Thereby, it can issue all the data blocks as it traverses the index, and only wait at the end of it (concurrent IO).
The amortized "wait-for-disk time" contribution of this direct IO proposal to a series of sequential getpage requests is 1/32 * read IOP latency for each getpage request.
We accomplish this by server-side batching of up to 32 reads into a single Timeline::get_vectored call.
(This is an ideal world where our batches are full - that's not the case in prod today because of lack of queue depth).
A lot of prerequisite work had to happen to enable use of direct IO.
To meet the "wait-for-disk time" requirements from the DoD, we implement for the read path:
page_service_pipelining)get_vectored_concurrent_io)
The work for both of these these was tracked in the epic.
Server-side batching will likely be obsoleted by the #proj-compute-communicator.
The Concurrent IO work is described in retroactive RFC 2025-04-30-pageserver-concurrent-io-on-read-path.md.
The implementation is relatively brittle and needs further investment, see the Future Work section in that RFC.For the write path, and especially WAL ingest, we need to hide write latency.
We accomplish this by implementing a (BufferedWriter) type that does double-buffering: flushes of the filled
buffer happen in a sidecar tokio task while new writes fill a new buffer.
We refactor InMemoryLayer as well as BlobWriter (=> delta and image layer writers) to use this new BufferedWriter.
The most comprehensive write-up of this work is in the PR description.
Direct IO puts requirements on
The requirements are specific to a combination of filesystem/block-device/architecture(hardware page size!).
In Neon production environments we currently use ext4 with Linux 6.1.X on AWS and Azure storage-optimized instances (locally attached NVMe).
Instead of dynamic discovery using statx, we statically hard-code 512 bytes as the buffer/offset alignment and size-multiple.
We made this decision because:
Values that needs to be read are far apart)This was discussed here.
The new IoBufAligned / IoBufAlignedMut marker traits indicate that a given buffer meets memory alignment requirements.
All VirtualFile APIs and several software layers built on top of them only accept buffers that implement those traits.
Implementors of the marker traits are:
IoBuffer / IoBufferMut: used for most reads and writesPageWriteGuardBuf: for filling PS PageCache pages (index blocks!)The alignment requirement is infectious; it permeates bottom-up throughout the code base.
We stop the infection at roughly the same layers in the code base where we stopped permeating the
use of owned-buffers-style API for tokio-epoll-uring. The way the stopping works is by introducing
a memory-to-memory copy from/to some unaligned memory location on the stack/current/heap.
The places where we currently stop permeating are sort of arbitrary. For example, it would probably
make sense to replace more usage of Bytes that we know holds 8k pages with 8k-sized IoBuffers.
The IoBufAligned / IoBufAlignedMut types do not protect us from the following types of runtime errors:
The following higher-level constructs ensure we meet the requirements:
ChunkedVectoredReadBuilder and mod vectored_dio_read ensure reads happen at aligned offsets and in appropriate size multiples.BufferedWriter only writes in multiples of the capacity, at offsets that are start_offset+N*capacity; see its doc comment.Note that these types are used always, regardless of whether direct IO is enabled or not. There are some cases where this adds unnecessary overhead to buffered IO (e.g. all memcpy's inflated to multiples of 512). But we could not identify meaningful impact in practice when we shipped these changes while we were still using buffered IO.
In the previous section we described how all users of VirtualFile were changed to always adhere to direct IO alignment and size-multiple requirements.
To actually enable direct IO, all we need to do is set the O_DIRECT flag in open syscalls / io_uring operations.
We set O_DIRECT based on:
virtual_file_io_mode configuration flagread and/or write flags.The VirtualFile APIs suffixed with _v2 are the only ones that may open with O_DIRECT depending on the other two factors in above list.
Other APIs never use O_DIRECT.
(The name is bad and should really be _maybe_direct_io.)
The reason for having new APIs is because all code used VirtualFile but implementation and rollout happened in consecutive phases (read path, InMemoryLayer, write path). At the VirtualFile level, context on whether an instance of VirtualFile is on read path, InMemoryLayer, or write path is not available.
The _v2 APIs then check make the decision to set O_DIRECT based on the virtual_file_io_mode flag and the OpenOptions read/write flags.
The result is the following runtime behavior:
|what|OpenOptions|v_f_io_mode
=buffered|v_f_io_mode
=direct|v_f_io_mode
=direct-rw|
|-|-|-|-|-|
|DeltaLayerInner|read|()|O_DIRECT|O_DIRECT|
|ImageLayerInner|read|()|O_DIRECT|O_DIRECT|
|InMemoryLayer|read + write|()|()*|O_DIRECT|
|DeltaLayerWriter| write | () | () | O_DIRECT |
|ImageLayerWriter| write | () | () | O_DIRECT |
|download_layer_file|write |()|()|O_DIRECT|
The InMemoryLayer is marked with * because there was a period when it did use O_DIRECT under =direct.
That period was when we implemented and shipped the first version of BufferedWriter.
We used it in InMemoryLayer and download_layer_file but it was only sensitive to v_f_io_mode in InMemoryLayer.
The introduction of =direct-rw, and the switch of the remaining write path to BufferedWriter, happened later,
in https://github.com/neondatabase/neon/pull/11558.
Note that this way of feature flagging inside VirtualFile makes it less and less a general purpose POSIX file access abstraction.
For example, with =direct-rw enabled, it is no longer possible to open a VirtualFile without O_DIRECT. It'll always be set.
The correctness risks with this project were:
IoBuffer / IoBufferMut implementation.
These types expose an API that is largely identical to that of the bytes crate and/or Vec.We sadly do not have infrastructure to run pageserver under cargo miri.
So for memory safety issues, we relied on careful peer review.
We do assert the production-like alignment requirements in testing builds.
However, these asserts were added retroactively.
The actual validation before rollout happened in staging and pre-prod.
We eventually enabled =direct/=direct-rw for Rust unit tests and the regression test suite.
I cannot recall a single instance of staging/pre-prod/production errors caused by non-adherence to alignment/size-multiple requirements.
Evidently developer testing was good enough.
The read path went through a lot of iterations of benchmarking in staging and pre-prod. The benchmarks in those environments demonstrated performance regressions early in the implementation. It was actually this performance testing that made us implement batching and concurrent IO to avoid unacceptable regressions.
The write path was much quicker to validate because bench_ingest covered all of the (less numerous) access patterns.
There is minor and major follow-up work that can be considered in the future. Check the (soon-to-be-closed) Epic https://github.com/neondatabase/neon/issues/8130's "Follow-Ups" section for a current list.
Read Path:
Write Path:
TempVirtualFile introduced as part of this project could internalize more of the common usage pattern: https://github.com/neondatabase/neon/issues/11692virtual_file_io_mode: https://github.com/neondatabase/neon/issues/11676Both:
Misc:
VirtualFile::crashsafe_overwrite and VirtualFile::read_to_string
are good entrypoints for cleanup.) https://github.com/neondatabase/neon/issues/11809In the Motivation section, we stated:
- The kernel page cache ineffective at high tenant density anyways (#tenants/pageserver instance).
The reason is that the Pageserver workload sent from Computes is whatever is a Compute cache(s) miss.
That's either sequential scans or random reads.
A random read workload simply causes cache thrashing because a packed Pageserver NVMe drive (im4gn.2xlarge) has ~100x more capacity than DRAM available.
It is complete waste to have the kernel page cache cache data blocks in this case.
Sequential read workloads can benefit iff those pages have been updated recently (=no image layer yet) and together in time/LSN space.
In such cases, the WAL records of those updates likely sit on the same delta layer block.
When Compute does a sequential scan, it sends a series of single-page requests for these individual pages.
When Pageserver processes the second request in such a series, it goes to the same delta layer block and have a kernel page cache hit.
This dependence on kernel page cache for sequential scan performance is significant, but the solution is at a higher level than generic data block caching.
We can either add a small per-connection LRU cache for such delta layer blocks.
Or we can merge those sequential requests into a larger vectored get request, which is designed to never read a block twice.
This amortizes the read latency for our delta layer block across the vectored get batch size (which currently is up to 32).
There are Pageserver-internal workloads that do sequential access (compaction, image layer generation), but these
page_service protocol constraints (image layer generation)