rfcs/2021-10-14-9477-buffer-improvements.md
Vector currently provides two simplistic modes of buffering -- the storing of events as they move between components -- depending on the reliability requirements of the user. Over time, many subtle issues with performance and correctness have been discovered with these buffer implementations. As well, users have asked for more exotic buffering solutions that better fit their architecture and operational processes.
As such, this RFC lays out the groundwork for charting a path to improving the performance and reliability of Vector’s buffering while simultaneously supporting more advanced use cases that have been requested by our users.
Vector users often switch to disk buffering to provide a level of reliability in the case that downstream sinks experience temporary issues, or in the case that the machine running Vector, or Vector itself, have problems that could cause it to crash.
Effectively, these users turn to disk buffering to increase reliability of their Vector-based observability pipelines. However, while disk buffering can help if Vector encounters issues, it does not always help if the disk itself or the machine itself experience problems.
Likewise, we currently push all events through disk buffers when they are enabled, which introduces a performance penalty. Even if the source and sink could talk to each other without any bottlenecks, users pay the cost of writing every event to LevelDB, and then reading that event back from LevelDB. It may or may not be written to disk at that point, which means sometimes writes and reads are fast and sometimes, when they have to go to disk, they’re not, which introduces further performance variability.
Buffering would be separated and documented as consisting of multiple facets:
If external buffering is utilized, multiple Vector processes with an identical configuration can participate in processing the events stored in the external buffer. Crucially, the newest feature would be configuring a buffer in “overflow” mode, which would configure an in-memory channel that, when facing backpressure, writes to either a disk or external buffer.
We would maintain on-the-wire compatibility with the Vector source/sink by using Protocol Buffers, as we do today for disk buffering. This would allow us to maintain a common format for events that would be usable even between different versions of Vector.
tower::Service:
All buffers would represent themselves via simple MPSC channels.
Two new types, for sending and receiving, would be created as the public-facing side of a
buffer. These types would provide a Sink and Stream implementation, respectively:
struct BufferSender {
base: PollSender,
base_ready: bool,
overflow: Option<BufferSender>,
overflow_ready: bool,
}
struct BufferReceiver {
base: PollReceiver,
overflow: Option<BufferReceiver>,
}
The buffer wrapper types would coordinate how the internal buffer channels are used, such that they were used in a cascading fashion:
impl Sink<Event> for BufferSender {
fn poll_ready(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
// Figure out if our base sender is ready, and if not, and we have an overflow sender configured, see if _they're_ ready.
match self.base.poll_ready(cx) {
Poll::Ready(Ok(())) => self.base_ready = true,
_ => if let Some(overflow) = self.overflow {
if let Poll::Ready(Ok(())) = overflow.poll_ready(cx) {
self.overflow_ready = true;
}
}
}
// Some logic here to handle dropping the event or blocking.
// Either our base sender or overflow is ready, so can we proceed.
if self.base_ready || self.overflow_ready {
Poll::Ready(Ok(()))
} else {
Poll::Pending
}
}
fn start_send(self: Pin<&mut Self>, item: Item) -> Result<(), Self::Error> {
let result = if self.base_ready {
self.base.start_send(item)
} else if self.overflow_ready {
match self.overflow {
Some(overflow) => overflow.start_send(item),
None => Err("overflow ready but no overflow configured")
}
} else {
Err("called start_send without ready")
};
self.base_ready = false;
self.overflow_ready = false;
result
}
}
Thus, BufferSender and BufferReceiver would contain a "base" sender/receiver that
represents the "primary" buffer channel for that instance, with the ability to "overflow" to
another instance of BufferSender/BufferReceiver. While BufferSender would try the base
sender first, and then the overflow sender, BufferReceiver would operate in reverse, trying
the overflow receiver first, then the base receiver
This design isn’t necessarily novel, but if designed right, allows us to arbitrarily augment the in-memory channel with an overflow strategy, potentially nested even further: in-memory overflowing to external overflowing to disk, etc.
To tie back it back to tower::Service, in this sense, all implementations would share a common
interface that allows them to be layered/wrapped in a generic way.
Vector’s raison d'être is that it provides both reliability and performance compared to other observability solutions. Vector’s current capabilities for handling errors -- regardless of whether they’re in Vector itself or at a lower level like the operating system or hardware -- do not meet the bar of being both reliable and performant. To that end, improving buffers is simply table stakes for meeting our reliability and performance goals.
If we did not do this, we could certainly still meet our performance goals, but users would not be able to confidently use Vector in scenarios that required high reliability around collecting and shipping observability events. We already know empirically that many users are waiting for these types of improvements before they will consider deploying, and depending on, Vector in their production environments.
Practically speaking, buffering always imposes a performance overhead, even if minimal. Disks can have inexplicably slow performance. Talking to external services over the network can encounter transient but harmful latency spikes. While we will be increasing the reliability of buffering, we’ll also be introducing another potential source of unexpected latency in the overall processing of events.
Of course, we can instrument this new code before-the-fact, and attempt to make sure we have sufficient telemetry to diagnose problems if and when they occur, we’re simply branching out into an area that we don’t know well yet, and there’s likely to be a learning curve spent debugging any potential issues in the future as this code is put through its paces in customer environments.
Many of the alternatives to Vector offer some form of what we call buffering:
Generally speaking, I believe our intended approach is the best possible summation of the various approaches taken by other alternative projects, and there is no specific reason to directly copy or tweak our approach based on how they have done it.
In terms of the implementation of the RFC, there exists only one crate that is fairly close to our desire for a disk-based reader/writer channel, and that is hopper. However, hopper itself is not a direct fit for our use case:
Due to this, it is likely that we look at hopper simply as a guide of how to structure and rewrite the disk buffer, as well as for ideas on how to test it and validate its performance and reliability.
The simplest alternative approach overall would be users running Vector such that their observability events got written to external storage: this could be anything from Kafka to an object store like S3. They would then run another Vector process, or separate pipeline in the same Vector process, that read those events back. Utilizing end-to-end acknowledgements, they could be sure that events were written to the external storage, and then upon processing those events on the other side, also using end-to-end acknowledgements, they could be sure they reached the destination.
The primary drawback with this approach is that it requires writing all events to external storage, rather than just ones that have overflowed, which significantly affects performance and increases cost.
serde::Serializer implementations that can generate varying outputs based on the value of a
given field, so we will add a new field, tentatively called advanced, that can be set to
true to unlock support for defining a series of buffer types that get nested within one another.io_uring on Linux (likely via tokio-uring) to drive the disk buffer I/O for lower overhead,
higher efficiency, higher performance, etc. Potentially usable for Windows as well if
tokio-uring gained support for Windows' new I/O Rings feature.