rfcs/2021-09-01-8547-accept-metrics-in-datadog-agent-source.md
datadog_agent sourceCurrently the datadog_agent source only
supports logs. This RFC suggests to extend Vector to support receiving metrics from Datadog agents and ingest those as
metrics from a Vector perspective so they can be benefit from Vector capabilities.
Some known issues are connected to the work described here: #7283, #8493 & #8626. This mostly concerns the ability to store/manipulate distribution using sketches, send those to Datadog using the DDSketch representation. Other metrics sinks would possibly benefit from having distribution stored internally with sketches as this would provide better aggregation and accuracy.
datadog_metrics sinks, it is required reach a fully functional situation but
this is not the goal of this RFC that focus on receiving metrics from Datadog Agents.n*(Datadog Agents) -> Vector -> Datadog should just workRegarding the Datadog Agent configuration, ideally it should be only a matter of configuring metrics_dd_url: https://vector.mycompany.tld to forward metrics to a Vector deployment.
The current dd_url endpoint configuration has a conditional
behavior (also
in the forwarder health check). I.e. if
dd_url contains a known pattern (i.e. it has a suffix that matches a Datadog site) some extra hostname manipulation
happens. But overall, the following paths are expected to be supported on the host behind dd_url:
/api/v1/validate for API key validation/api/v1/check_run for check submission/intake/ for events and metadata (possibly others)/support/flare/ for support flare/api/v1/series & /api/beta/sketches for metrics submissionThen to only ship metrics, and let other payload follow the standard path, the newly introduced Datadog Agent setting
metrics_dd_url would have to be set to point to a Vector host, with a datadog_agent source enabled. And then request
targeted to /api/v1/series & /api/beta/sketches would be diverted there allowing Vector to further processed them.
A few details about the Datadog Agents & Datadog metrics:
MetricSample and can be
of several typesMetricSample structure in the
dogstatsd enrich module. However Datadog Agents
metrics are transformed before being sent, ultimately metrics accounts for two different kind of payload:/api/v1/series using the JSON schema officially
documented with few undocumented additional
fields, but this align very well
with the existing datadog_metrics sinks./api/beta/sketches and serialized as protobuf as shown in the
serializer (it ultimately lands in the
sketch_series module). Public .proto
definition can be found in the
agent-payload proto.Vector has a nice description of its metrics data model and a concise enum for representing it.
The implementation would then consist in:
metrics_dd_url) that would only divert
request to /api/v1/series & /api/beta/sketches to a specific endpoints./api/v1/series route (based on both the official API and
the Datadog Agent itself) to
cover every metric type handled by this endpoint (count, gauge and rate) and:
datadog_metrics sinkskey:foo & key:bar but Vector doesn't) maybe supported
later if there is demand for it (see the note below)./api/beta/sketches route in the datadog_agent source to support sketches/distribution encoded using
protobuf, but once decoded those sketches will require internal support in Vector:
datadog_metrics sink would need to use sketches and the associated endpoint. This is
a prerequisite to support end-to-end sketches forwarding.(Agent Sketch) -> (Vector) -> (Datadog intake). This RFC focus on ingesting sketch
and not the rest of the flow.Regarding the tagging issue: A -possibly temporary- workaround would be to store incoming tags with the complete
"key:value" string as the key and an empty value to store those in the existing map Vector uses to store
tags and slightly rework the
datadog_metrics sink not to append : if a tag key has the empty string as the corresponding value. However Datadog
best practices can be followed with the current Vector data model, so unless something unforeseen or unexpected demand
arise, Vector internal tag representation will not be changed following this RFC.
Users that would want to use this feature will need to upgrade both Vector and the Agent. If a new metric route comes up in a Datadog Agent upgrade, users will need to upgrade Vector as well.
There are few existing metric aggregation solution. The Datadog Agent is able to aggregate, in some extend, metrics coming over dogstatsd and from go/python code. It mostly aims at reducing the amount of metrics samples sent by the Agent.
Veneur offers an aggregation feature, but it does not support sketches/distribution per se. It requires what is called a central veneur, that would compute aggregated value for selected metrics and some percentile. Some aspects of this solution could be seen as an alternative approach. However this approach has two major drawbacks: it relies on a central service for aggregation and it does not support sketches.
The use an alternate protocol between Datadog Agents and Vector (Like Prometheus, Statds, OpenTelemetry or Vector own protocol) could be envisioned. This would call for a significant, yet possible with the current Agent architecture, addition, those changes would mostly be located in the forwarder and serializer logic. This would imply a hugh chunk of work on the Agent side, require update to use the feature, probably also require some work on Vector side. This is not something that aligns well with the purpose of the Datadog Agent. This would also add a risk of losing information because of protocol conversion.
For sketches, we could flatten sketches and compute usual derived metrics (min/max/average/count/some percentiles) and send those as gauge/count, but it would prevent (or at least impact) existing distribution/sketches users. Moreover if instead of sketches only derived metrics are used a lot of the tagging flexibility will be lost. By submitting tagged sketches to the Datadog intake, any tag selector can be used to compute a distribution based on the sketches that bear matching tag. This cannot be done without sending sketches. But flattening sketches would have the benefit of simplify the implementation in Vector and remove the prerequisite of having sketches support inside Vector.
Instead of being done in the Agent, the request routing could be implemented either:
Note: proxying non-metric request is not a completely discarded option, as this might still be useful in some situation where proxying everything is explicitly wanted or where proxying unknown payload (for example if the Agent is upgraded and comes with a new metric route not yet supported by Vector) would serve as a data loss prevention mechanism and/or help to maintain metric continuity.
None
metrics_dd_url overrides in the Datadog Agent/api/v1/series route still in the datadog_agent source, implement complete support in the
datadog_metrics sinks for the undocumented fields, incoming tags would be stored as key only with an empty string
for their value inside Vector. Validate the Agent->Vector->Datadog scenario for gauge, count & rate./api/beta/sketches route, again in the datadog_agent, and validate the Agent->Vector->Datadog
scenario for sketches/distributions. This would also required internal sketches support in Vector along with sending
sketches from the datadog_metrics sinks, this is not directly addressed by this RFC but it is tracked in the
following issues: #7283,
#8493 & #8626.The later task depends on the issue #9181.