rfcs/2020-08-26-3191-host-metrics.md
This RFC is to introduce a new metrics source to consume host-based metrics. The high level plan is to implement one (or more?) sources that collect CPU, disk, network, and memory metrics from Linux-based hosts.
This RFC will cover:
This RFC will not cover:
Users want to collect, transform, and forward metrics to better observe how their hosts are performing.
Build a single source called host_metrics (name to be confirmed) to collect host/system level metrics.
I've found a number of possible Rust-based solutions for implementing this collection, cross-platform.
The Heim crate has a useful comparison doc showing the available tools and their capabilities.
For this implementation we're recommending the Heim crate based on platform and feature coverage.
We'd use one of these to collect the following metrics:
host_cpu_seconds_total tagged with mode (idle, nice, system, user) and CPU. (counter)host_disk_read_bytes_total tagged with the disk (counter)host_disk_read_errors_total tagged with the disk (counter)host_disk_read_retries_total tagged with the disk (counter)host_disk_read_sectors_total tagged with the disk (counter)host_disk_read_time_seconds_total tagged with the disk (counter)host_disk_reads_completed_total tagged with the disk (counter)host_disk_write_errors_total tagged with the disk (counter)host_disk_write_retries_total tagged with the disk (counter)host_disk_write_time_seconds_total tagged with the disk (counter)host_disk_writes_completed_total tagged with the disk (counter)host_disk_written_bytes_total tagged with the disk (counter)host_disk_written_sectors_total tagged with the disk (counter)host_filesystem_avail_bytes tagged with the device, filesystem type, and mountpoint (gauge)host_filesystem_device_error tagged with the device, filesystem type, and mountpoint (gauge)host_filesystem_free_bytes tagged with the device, filesystem type, and mountpoint (gauge)host_filesystem_size_bytes tagged with the device, filesystem type, and mountpoint (gauge)host_filesystem_total_file_nodes tagged with the device, filesystem type, and mountpoint (gauge)host_filesystem_free_file_nodes tagged with the device, filesystem type, and mountpoint (gauge)host_load1 (gauge)host_load5 (gauge)host_load15 (gauge)host_memory_active_bytes (gauge)host_memory_compressed_bytes (gauge)host_memory_free_bytes (gauge)host_memory_inactive_bytes (gauge)host_memory_swap_total_bytes (gauge)host_memory_swap_used_bytes (gauge)host_memory_swapped_in_bytes_total (gauge)host_memory_swapped_out_bytes_total (gauge)host_memory_total_bytes (gauge)host_memory_wired_bytes (gauge)host_network_receive_bytes_total tagged with device (counter)host_network_receive_errs_total tagged with device (counter)host_network_receive_multicast_total tagged with device (counter)host_network_receive_packets_total tagged with device (counter)host_network_transmit_bytes_total tagged with device (counter)host_network_transmit_errs_total tagged with device (counter)host_network_transmit_multicast_total tagged with device (counter)host_network_transmit_packets_total tagged with device (counter)Users should be able to limit the collection of metrics to specific classes, here: cpu, memory, disk, filesystem, load, and network.
Metrics will also be tagged with:
host: the host name of the host being monitored.And collector for type of metric:
disk for disk based metricscpu for CPU based metricsfilesystem for filesystem based metricsmemory for memory based metrics.load for load based metrics.network for network based metrics.Specific explanation of some of the filesystem metrics:
host_filesystem_avail_bytes = Filesystem space available to non-root users in bytes (including reserved blocks).host_filesystem_free_bytes = Filesystem free space in bytes (excluding reserved blocks).host_filesystem_size_bytes = Filesystem size in bytes.The following additional source configuration will be added:
[sources.my_source_id]
type = "host_metrics" # required
collectors = [ "all"] # optional, defaults collecting all metrics.
filesystem.mountpoints = [ "*" ] # optional, defaults to all mountpoints.
disk.devices = [ "*" ] # optional, defaults to all to disk devices.
network.devices = [ "*" ] # optional, defaults to all to network devices.
scrape_interval_secs = 15 # optional, default, seconds
namespace = "host" # optional, default is "host", namespace to put metrics under
For collector (name open to discussion) we'd specify an array of metric collectors:
collectors = [ "cpu", "memory", "network" ]
For disk and network devices or filesystem mountpoints the default is to collect for all ("*") devices and mountpoints. Or you can configure Vector to only collect from specific devices, for example:
network.devices = [ "eth0" ]
Or, if we think its feasible, using globbing like so:
network.devices = [ "eth*" ]
And, if feasible, have syntax for exclusion of resources.
CPU, Memory, Disk, and Network are the basic building blocks of host-based monitoring. They are considered "table stakes" for most metrics-based monitoring of hosts. Additionally, if we do not support ingesting metrics from them, it is likely to push people to use another tool
As part of Vector's vision to be the "one tool" for ingesting and shipping observability data, it makes sense to add as many sources as possible to reduce the likelihood that a user will not be able to ingest metrics from their tools.
We could not add the source directly to Vector and instead instruct users to run Telegraf or Prometheus' node exporter and point Vector at the exposed Prometheus scrape endpoint. This would leverage the already supported inputs from those projects.
I decided against this as it would be in contrast with one of the listed principles of Vector:
One Tool. All Data. - One simple tool gets your logs, metrics, and traces (coming soon) from A to B.
On the same page, it is mentioned that Vector should be a replacement for Telegraf.
You SHOULD use Vector to replace Logstash, Fluent*, Telegraf, Beats, or similar tools.
If users are already running Telegraf or Node Exporter though, they could opt for this path.
host_metrics or cpu_metrics, mem_metrics, disk_metrics, or load_metrics?Incremental steps that execute this change. Generally this is in the form of: