rfcs/2023-05-03-data-volume-metrics.md
Vector needs to be able to emit accurate metrics that can be usefully queried to give users insights into the volume of data moving through the system.
component_received_event_bytes_totalcomponent_sent_event_bytes_totalcomponent_received_event_totalcomponent_sent_event_totalservice. This is a new concept
within Vector and represents the application that generated the log,
metric or trace.component_sent_bytes_total and component_received_bytes_total
that indicate network bytes sent by Vector are not considered here.Currently it is difficult to accurately gauge the volume of data that is moving through Vector. It is difficult to query where data being sent out has come from.
Global config options will be provided to indicate that the service tag and the
source tag should be sent. For example:
telemetry:
tags:
service: true
source_id: true
This will cause Vector to emit a metric like (note the last two tags):
vector_component_sent_event_bytes_total{component_id="out",component_kind="sink",component_name="out",component_type="console"
,host="machine",service="potato",source_id="stdin"} 123
The default will be to not emit these tags.
service - to attach the service, we need to add a new meaning to Vector -
service. Any sources that receive data that could potentially
be considered a service will need to indicate which field means
service. This work has largely already been done with the
LogNamespacing work, so it will be trivial to add this new field.
Not all sources will be able to specify a specific field to
indicate the service. In time it will be possible for this to
be accomplished through VRL.
source_id - A new field will be added to the Event metadata -
Arc<OutputId> that will indicate the source of the event.
OutputId will need to be serializable so it can be stored in
the disk buffer. Since this field is just an identifier, it can
still be used even if the source no longer exists when the event
is consumed by a sink.
We will need to do an audit of all components to ensure the
bytes emitted for the component_received_event_bytes_total and
component_sent_event_bytes_total metrics are the estimated JSON size of the
event.
These tags will be given the name that was configured in [User Experience] (#user-experience).
Transforms reduce and aggregate combine multiple events together. In this
case the source and service of the first event will be taken.
If there is no source a source of - will be emitted. The only way this can
happen is if the event was created by the lua transform.
If there is no service available, a service of - will be emitted.
Emitting a - rather than not emitting anything at all makes it clear that
there was no value rather than it just having been forgotten and ensures it
is clear that the metric represents no service or source rather than the
aggregate value across all services.
The Component Spec will need updating to indicate these tags will need including.
Performance - There is going to be a performance hit when emitting these metrics. Currently for each batch a simple event is emitted containing the count and size of the entire batch. With this change it will be necessary to scan the entire batch to obtain the count of source, service combinations of events before emitting the counts. This will involve additional allocations to maintain the counts as well as the O(1) scan.
component_received_event_bytes_totalThis metric is emitted by the framework's source sender, so it looks like the only change needed is to add the service tag.
component_sent_event_bytes_totalFor stream based sinks this will typically be the byte value returned by
DriverResponse::events_sent.
Despite being in the Component Spec, not all sinks currently conform to this.
As an example, from a cursory glance over a couple of sinks:
The Amqp sink currently emits this value as the length of the binary
data that is sent. By the time the data has reached the code where the
component_sent_event_bytes_total event is emitted, that event has been
encoded and the actual estimated JSON size has been lost. The sink will need
to be updated so that when the event is encoded, the encoded event together
with the pre-encoded JSON bytesize will be sent to the service where the event
is emitted.
The Kafka sink also currently sends the binary size, but it looks like the estimated JSON bytesize is easily accessible at the point of emitting, so would not need too much of a change.
To ensure that the correct metric is sent in a type-safe manner, we will wrap the estimated JSON size in a newtype:
pub struct JsonSize(usize);
The EventsSent metric will only accept this type.
It is currently not possible to have dynamic tags with preregistered metrics.
Preregistering these metrics are essential to ensure that they don't expire.
The current mechanism to expire metrics is to check if a handle to the given metric is being held. If it isn't, and nothing has updated that metric in the last cycle - the metric is dropped. If a metric is dropped, the next time that event is emitted with those tags, the count starts at zero again.
We will need to introduce a registered event caching layer that will register and cache new events keyed on the tags that are sent to it.
Currently a registered metrics is stored in a Registered<EventSent>.
We will need a new struct that can wrap this that will be generic over a tuple of
the tags for each event and the event - eg. Cached<(String, String), EventSent>.
This struct will maintain a BTreeMap of tags -> Registered. Since this will
need to be shared across threads, the cache will need to be stored in an RwLock.
In pseudo rust:
struct Cached<Tags, Event> {
cache: Arc<RwLock<BTreemap<Tags, Registered<Event>>>,
register: Fn(Tags) -> Registered<Event>,
}
impl<Tags, Event> Cached<Tags, Event> {
fn emit(&mut self, tags: Tags, value: Event) -> {
if Some(event) = self.cache.get(tags) {
event.emit(value);
} else {
let event = self.register(tags);
event.emit(value);
self.cache.insert(tags, event);
}
}
}
The ability to visualize data flowing through Vector will allow users to ascertain the effectiveness of the current use of Vector. This will enable users to optimise their configurations to make the best use of Vector's features.
The additional tags being added to the metrics will increase the cardinality of those metrics if they are enabled.
We could use an alternative metric instead of estimated JSON size.
Incremental steps to execute this change. These will be converted to issues after the RFC is approved:
source field to the Event metadata to indicate the source the event has come from.JsonSize value. Use the compiler to ensure all metrics
emitted use this. The EstimatedJsonEncodedSizeOf trait will be updated return a JsonSize.telemetry configuration options
into account.source_id and service.