rfcs/2020-10-15-3480-file-source-rework.md
The file source is one of Vector's oldest and most widely used components.
While it began with relatively simple interface and implementation, it has
collected a large number of knobs, warts, and general issues over time. To
ensure that this important source continues to serve users well, we should
invest effort into making it easier to use, extend, and debug.
This RFC will cover a general overhaul of the file source internals and
user-facing config. The focus will be on migrating the existing implementation
to a better structure, not rewriting from scratch.
There are three pages of issues tagged source: file in the Vector repo. Below
I've tried to collect some of the more notable ones and categorize them:
File source retains files on disk (#763)
Checksums are confusing (#828)
start_at_beginning is confusing (#1020)
Only warn about small files not empties (#1065)
Periodically read a non-file file (#1198)
Clean up file checkpoint files (#1427)
File source slows down with many completed files (#1466)
Add path-based fingerprinting (#1948)
Log permissions issues (#2420)
Stop after reading files (backfill mode) (#3216)
Tailing limited to single core (#3379)
Not releasing descriptors due to falling behind (#3440)
ignore_older is confusing (#3567)
Ignore dangling symlinks (optionally?) (#3662)
Make it easier to reuse internally (#4048)
Batch mode (#4271)
Tail only new data (#4382)
Slow with millions of files on EBS (#4434)
The goal of this RFC is not to fix all of these issues in one fell swoop, but to suggest a reorganization that renders the relevant behaviors as orthogonal as possible and enables addressing them easily and independently.
In order to achieve our goals, we need to break down the mostly-monolithic
file source into smaller components that can be understood, configured, and
extended in isolation, and then assembled smoothly into a cohesive unit.
To start, let's examine the current top-level structure of the file source.
This doesn't cover every interesting thing we want to address, but it's useful
context. In very rough pseudocode, it does the following:
let checkpoints = load_checkpoints();
// Find files we're configured to watch
let file_list = look_for_files();
// Prioritize files that exist on startup
sort(file_list);
loop {
// Do these things occasionally to avoid burning CPU
if its_time() {
checkpoints.persist();
let current_file_list = look_for_files();
reconcile(&mut file_list, current_file_list);
}
for file in file_list {
// Don't check inactive files as often
if !should_read(file) {
continue;
}
// Try to read new data from the files
while let Some(line) = file.read_line() {
output.push(line)
// But not an infinite amount at one time
if limit_reached { break }
}
// If configured, rm files when we're done with them
maybe_rm(file)
// Either continue to read the next file, or break to start back at the
// beginning of the (prioritized) list
if should_not_read_next_file() {
break
}
}
// Drop handles of deleted files that we've finished reading
unwatch_dead(&mut file_list);
// Send the gathered data downstream
emit(output);
// If we're not seeing any new data, back off to avoid burning CPU
maybe_backoff();
// If Vector is shutting down, stop processing
maybe_shutdown();
}
Excepting observability and performance, the primary user-facing concerns in the issues above are roughly as follows:
These concerns are a good guide for determining which parts of the implementation tend to change together. Our first goal is to isolate those areas of change from one another and construct subcomponents that effectively contain complexity. With those seams in place, it should become simpler to improve the internals of each.
Based on the user-level concerns we collected from the list of issues, let's break down what each respective subcomponent should look like.
In the pseudocode above, read scheduling is controlled by sort, should_read,
limit_reached, should_not_read_next_file, and maybe_backoff. A component
would need to answer the following questions:
These questions would then be answered by a combination of configuration, file metadata, and gathered statistics.
There are (at least) two approaches we could take to building this component. The most obvious would be to implement a struct that contains configuration and exposes methods very much like the ones mentioned above. This would maintain the current structure of one main read loop with multiple points of control. An alternative would be to introduce a trait representing the logic of the read loop. This may lead to some duplication, but would allow simpler separation of different use cases and require readers of the code understand the interaction between fewer subtle points of control.
I would propose that we begin with the simpler method of consolidating logic from the existing structure. This involves fewer design decisions and we may find with the other simplifications we're planning make that further simplifying the read loop is not worthwhile. That being said, once related features (like the batch mode discussed later) are complete, we can reevaluate this decision.
Identity is an area where we already have the beginning of modularity with the
Fingerprinter abstraction, but it's incomplete. In our pseudocode, identity is
involved in both look_for_files and reconcile. It's not just a way to figure
out a magic identifier, but also the logic to update our list of watched files
based on those identifiers. It needs to answer the following:
This logic is mostly in one place in the current implementation, so it should
not be terribly difficult to extract it. The use of Fingerprinter should
likely become purely internal to the new component.
In addition to simply consolidating the logic, we can expand and make the use of
Fingerprinter more intelligent. We currently have three ways it can work:
I'd first propose that we drop device and inode fingerprinting and add path-based fingerprinting in its place. This gives users the option to do the simplest possible thing for use cases that don't need to worry about traditional rotation.
Next, I suggest we unify the two checksumming strategies. Neither is perfectly general, but we should be able to come up with a simple algorithm that combines the best of both. As a prerequisite, we should unify the read path such that these checksums handle compressed files correctly (discussed later). With that in place, the algorithm can look something like the following:
max_line_length bytes from the file starting at
ignored_header_bytesThis should give a good balance between usability and flexibility for the default strategy. As we implement it, we should evolve the current representation of fingerprints to one that maintains information about how it was determined. This will allow us more flexibility to evolve and/or combine strategies in the future.
This logic is one layer below what's represented in the pseudocode above, but
there is just as much complexity. When we look_for_files and build watchers
for them, we need to make a decision about where to start reading in that file.
That decision should be based on any stored checkpoint, file metadata (e.g.
mtime), whether the file was found at startup or runtime, and how the source is
configured.
Since this decision really only happens in one place, the challenge is more about providing an understandable config UI than designing the right interface. This should be driven by real world use cases. For example:
I would suggest a config like the following:
ignore_checkpoints = true|false (still write but don't read existing checkpoints)read_from = beginning|end (where to start if there's no checkpoint or
they're being ignored)skip_older_than = duration (relevant when read_from = beginning, seek to
end based on mtime)mv from a create and write, so don't rely on seeing an empty file
first to get all the data)If we adopt this, we can also implement a solution where we don't hold open file
handles for files that match skip_older_than, since this has caused some
issues. Naively, we need the open handle to attempt reads in case new data is
written to the file, but we could also implement that as a new state of
FileWatcher that stashes the initial size and periodically checks the file
metadata (mtime and size) to see if it should start reading at what was the end.
In addition to extracting these various subcomponents, there are some other relatively simple changes we can make to help address the issues we discussed.
We currently use the outdated glob_minimum_cooldown config option to determine
how often to do both of these tasks. We should switch them to their own
independent config options and allow them to be disabled (e.g. a batch use case
does not need to continuously look for new files).
We should also move them into their own periodic tasks, outside of the main read loop. This should give us better performance and help avoid bad behavior in situations where either is expensive (e.g. discovering millions of files on EBS).
With most of the other concerns pulled out, we can afford to adjust the main read loop to allow for some new capabilities. The most interesting is concurrent reads, but it will require a bit of experimentation before we're able to determine if it's worthwhile. There are a few possible approaches:
iouringThe first two both introduce the questions of sizing and the ability of the underlying file system to enable concurrent access in a way that actually adds performance. We would need to test a wide variety of scenarios to evaluate the best path for either.
Much more interesting, but also limited, would be to build on top of iouring.
Given that it's only available in modern Linux, it could not be the only
implementation. But since modern Linux is the majority of Vector usage, it could
be enough to cover the most demanding use cases. This approach would let us
avoid any questions of thread counts and defer concurrency to the kernel, which
is much better equipped to make use of the available hardware.
Given these options, I would currently propose that we wait. There are other
changes in this RFC that should impact performance positively, and we'll have
a better view of the potential cost/benefit of these approaches once those have
landed. When we reach that phase, I would suggest starting with the tokio
filesystem interface, as future improvements like iouring are likely to better
match the async interface.
As it stands, both Fingerprinter and FileWatcher instances read data from
files, one to checksum that data and the other to return lines. The problem is
that only FileWatcher handles compression, so our fingerprints can get
confused if a file is rotated and compressed.
To address that, we should evolve FileWatcher into a more general wrapper
struct for file handles. This will allow us to encapsulate all direct file
access to within the struct, where we can more easily ensure that concerns like
compression are handled uniformly.
In addition to our current shutdown logic, we should add a configuration option and corresponding conditional for exiting the source once all files have reached EOF. This, along with disabled file discovery, would neatly and simply implement the oft-requested batch mode.
There's no grand design here, but we should do the work to go through and address all of the relevant issues. Some examples:
For the last item, it would be particularly helpful to evolve how we store checkpoints. Instead of the strange filename-based system we have now, we should migrate towards a JSON file-based approach as laid out in #1779.
ignore_older to skip_older_thanstart_from_beginning with ignore_checkpoints and read_from (see
earlier section)oldest_first with mode = tail|tail_sequential|batchglob_minimum_cooldown to discovery_interval and disable when mode = batchThe file source has dramatically overgrown its original design and is causing pain for both users and developers of Vector. It warrants spending some time and effort to improve both usability and maintainability. Otherwise, we risk losing a large and increasing amount of our time to its maintenance and user support.
By focusing on modularity and simple improvements over a rewrite, we will be able to maintain the good parts of the source's long history (accumulated bug fixes). This also reduces the risk of introducing new bugs that would be inevitable with a more aggressive rewrite.
Modularity also positions us well for the future by encouraging a split between user-facing config and implementation-level config. This matches well with our config composition RFC and should enable simple config facades for specific file-based use cases.
This is a backwards incompatible change to the config of one of our most widely used sources. To change the interface would likely inconvenience a large number of current users. There is also always a risk that bugs are introduced as part of refactoring, even though our approach is designed to minimize that risk.
One alternative would be to rewrite the source from scratch. While this would likely result in more maintainable code, it would risk losing much of the accumulated knowledge present in the existing implementation. It would also likely take much longer and present a more difficult transition plan than the evolution proposed here.
Another alternative would be to leave the implementation largely alone and focus on improving our documentation and config UI alone. While this would likely yield strong benefits, it would leave a number of important issues unaddressed and do little for our ability to extend the source in the future.
The work with the highest investment/payoff ratio is likely around checksumming, so I would suggest we attack that first:
FileWatcher to general purpose file wrapper with fingerprintingThe next most valuable is removing extraneous work from the read loop and hopefully improving performance in some edge cases considerably:
With the relatively low hanging fruit taken care of, we can move on to the more general reorganization tasks that set the stage for configuration improvements:
FileWatcher work)FileWatcher work)Followed by the new and improved configuration itself:
version = 2 of the file source config with deprecation warningignore_older to skip_older_thanstart_from_beginning with ignore_checkpoints and read_fromoldest_first with mode = tail|tail_sequentialglob_minimum_cooldown to discovery_intervalWhich in turn enable the new batch mode to be implemented and exposed:
modeAny time after the new file wrapper work is done, we can improve it to stop holding unnecessary file handles:
And finally, or in parallel with any of the above, go through and do the work to smooth out observability warts: