rfcs/2021-11-03-9862-ingest-apm-stats-along-traces-in-dd-agent-source.md
datadog-agent source with consistent user experienceDatadog traces support in the datadog_agent source was initially documented in this RFC. However, the
Datadog trace-agent submits more data than just traces. It also sends statistics ("APM Stats") about the running time of
each instrumented resource (i.e. a given piece of code) that are aggregated over time (based on 100% of the traces
received). Those APM stats are very important as they highlight code hot spots and ease aggregation. And, as a result, APM stats should be handled by Vector.
Traces are new in Vector, initial support focuses on traces coming from the Datadog Agent, the RFC discussing traces describes how those traces will be handled in Vector and highlights the need to support APM stats.
APM stats are encoded using Protobuf, the current schema is available in the datadog-agent
repository. APM stats may be computed by the tracing lib in some cases, or most of the time by the trace-agent, but once
they are emitted by the trace-agent there is no difference except a boolean value that indicates
where it was computed. They are sent by the trace-agent to the same endpoint as trace payloads, only the paths differ,
allowing easy discrimination between trace & APM stats payloads.
Those stats are computed by a component named concentrator in the trace-agent. There is a dedicated
path for APM stats that comes directly from tracing libraries. But ultimately they flow
through Datadog with the same sending code.
Below is an example of a stats payload, it is an aggregate of ClientGroupedStats:
string service = 1;
string name = 2;
string resource = 3;
uint32 HTTP_status_code = 4;
string type = 5;
string DB_type = 6; // db_type might be used in the future to help in the obfuscation step
uint64 hits = 7; // count of all spans aggregated in the groupedstats
uint64 errors = 8; // count of error spans aggregated in the groupedstats
uint64 duration = 9; // total duration in nanoseconds of spans aggregated in the bucket
bytes okSummary = 10; // ddsketch summary of ok spans latencies encoded in protobuf
bytes errorSummary = 11; // ddsketch summary of error spans latencies encoded in protobuf
bool synthetics = 12; // set to true on spans generated by synthetics traffic
uint64 topLevelHits = 13; // count of top level spans aggregated in the groupedstats
As you can see, it is a group of various metrics that can be represented in Vector as such (Sketches are supported since this PR). In the proto definition sketches are stored as unstructured bytes slices, but those fields are filled with a protobuf encoded ddsketch. Given that sketches in Vector are also heavily based on ddsketch, APM stats sketches can be converted to/from the Vector internal representation without incurring too much accuracy loss, but this would require significant work to implement that conversion.
This opens two major very different paths for APM stats in Vector:
ClientGroupedStats would be mapped to
one log event.ClientGroupedStats would be
emitted as a metric with all upper level information stored as tags. Each metric would flow independently from
others and this would be require significant re-aggregation logic in the datadog_traces sink.Those two approaches can be mixed together if we introduce one of the following abilities to bundle multiple metrics into a single event:
Metric type to the Value enumThis also raises the question of having a second (assuming that the datadog_agent sources accepts Datadog Agent
metrics - RFC), unrelated, metric stream coming out of the datadog_agent source. The
Vector event ingested representing APM stats will have to be routed along with traces, and most often they will follow a
different path that other plain metrics/logs received from a core-agent. Thus it is suggested to re-arrange the
datadog_agent source, many option are available (Additional details on what exactly are "Datadog Agents" can be
found in the trace support RFC and may provide relevant context for undermentioned points):
datadog_agent source:
agent: <TYPE> where <TYPE>
could be core (would support metrics & logs - we could optionally add logs & metrics to only allow logs or
metrics, along with core that would allow both), trace and could be extended to support process, security and
so ondatadog_agent source, e.g. <SRC_ID>.metrics, <SRC_ID>.logs, <SRC_ID>.traces, <SRC_ID>.apm_stats, etc.datadog_core or the current datadog_agent sources that would receive the data sent by the Datadog "core" Agent
(the agent that collects logs & metrics).datadog_trace that would support all data sent by the trace-agentdatadog_process source for the Datadog process-agentdatadog_security for the Datadog security Agentnamed_outputs that is laying the ground for the same feature but on sources,
one PR has already be merged while scheduled work is tracked in the named outputs improvements issue.The Vector datadog_agent source would accept all supported data type including APM stats (along with traces), and emit
Vector event (logs or metrics depending on the implementation) including all metadata as tags/fields, so filtering could
be done later in the topology on both APM stats and traces.
In order to avoid complex and unreliable route transforms to properly differentiate logs from traces (as the latter
will be represented as logs inside Vector), and plain metrics (received from the core Agent) from APM stats metrics
(received from the trace-agent) we can plan to extend the behaviour that was added to the remap
transform. This would translate to the following kind of config, easy to read, easy to adapt:
[sources.dd_agents]
type = "datadog_agent"
address = "[::]:8081"
[sinks.dd_traces]
type = "datadog_traces"
inputs = ["dd_agents.traces", "dd_agents.apm_stats" ]
[sinks.dd_logs]
type = "datadog_logs"
inputs = ["dd_agents.logs"]
[sinks.dd_metrics]
type = "datadog_metrics"
inputs = ["dd_agents.metrics"]
[sinks.debug]
type = "console"
# Optionally the non-suffixed name could receive everything, this will be configurable
inputs = ["dd_agents"]
encoding.codec = "json"
The datadog_trace sink will receive those event (metrics and/or log depending on the implementation) and do the
opposite conversion, pending that expected tags will be there.
Regarding the Datadog trace-agent config, the APM stats endpoint is the same as the trace one (apm_config.apm_dd_url
config key), there will be nothing else to configure.
And finally API key management will be the same as it is for other Datadog sources/sinks.
Each group is relatively independent:
datadog_agent source:
named_outputs feature, that is available four transforms, to sources so they can expose multiple named
outputs (<SRC_ID>.<OUTPUT_NAME>)). The feature for transform was initially add in this PR,
subsequent work on it is tracked [here][named-outputs-improvements.named_outputs in the datadog_agent: <SRC_ID>.traces, <SRC_ID>.apm_stats,
<SRC_ID>.metrics, <SRC_ID>.logs. Note that the non suffixed output should have a predictable behaviour, so we
could add a knob named top_level_output that would allow the user to choose which data to get out of of the
suffix-less output.APM stats support would be done according to the following plan:
ClientGroupedStats into relevant metrics all possible metadata to allow
the lossless pass-through scenario and the same level of filtering/routing we can achieve for traces. APM stats
sketches would then be converted to the Vector internal sketches. Vector internal sketches would then get
parameterized gamma and maxbin that would still default to the agent sketch value.datadog_traces sink would then receive incoming APM stats metric along with traces. It will
reaggregate those metrics according to the relevant dimensions using the Partitioner trait to rebuild APM stats
payloads:
At a certain point in time, when sketches-rs will be production ready:
datadog_agent source and the datadog_metrics sink will then have to handle conversion between vector sketches
(plain ddsketches) and the agent variation (note: it's on the crate roadmap to offer that kind of conversion)[btime]: https://github.com/DataDog/datadog-agent/blob/dc2f202/pkg/trace/stats/concentrator.go#L148-L159 [csb-start]: https://github.com/DataDog/datadog-agent/blob/dc2f202/pkg/trace/pb/stats.proto#L47
None identified so far.
N/A.
Regarding the fact that we could ignore/drop incoming APM stats:
trace-agent side and compute APM stats in the datadog-trace sink, this could work but
this is a lot for a initial implementation (it would required plain ddsketch support on top of the computation logic)
and to match the accuracy of current APM stats, Vector would have to receive 100% of traces, which may not always be
possible. But this would pave the wayfor generic APM stats computation wherever the traces come from.Regarding the internal representation, APM stats could alternatively be represented either by a log event with some numerical fields. As stated above a hybrid approach like allowing a log event to have metric fields or introducing metric event that could hold multiple value could also be a solution.
Regarding sketches, thos from APM stats are not exactly the same as the internal representation we have in Vector, thus converting them to the internal representation will required some plumbing this could be avoided by not decoding those sketches as all and keeping those as opaque data/raw bytes slices inside Vector.
About the source(s) reorganisation an alternative to avoid the work to implement the <source_id>.<suffix> an
alternative would be to handle different Datadog Agent in a dedicated source:
datadog_agent source is adjusted to be configurable with an type settings (that could be set to logs,
metrics or tracesdatadog_logs, datadog_metrics & datadog_traces (datadog-agent
would probably became an alias for datadog_logs or datadog_metrics before being deprecated),This would lead to the following config, functionally identical to the snippet above, a bit longer but still very straightforward and easily readable (note that having multiple binding addresses may translate to more parameter in later work around helm charts):
[sources.dd_in_logs]
type = "datadog_logs"
address = "[::]:8081"
[sources.dd_in_metrics]
type = "datadog_metrics"
address = "[::]:8082"
[sources.dd_in_traces]
type = "datadog_traces"
address = "[::]:8083"
[sinks.dd_traces]
type = "datadog_traces"
inputs = ["dd_in_traces" ]
[sinks.dd_out_logs]
type = "datadog_logs"
inputs = ["dd_in_logs"]
[sinks.dd_out_metrics]
type = "datadog_metrics"
inputs = ["dd_in_metrics"]
[sinks.debug]
type = "console"
inputs = ["dd_in_*"]
encoding.codec = "json"
None.
datadog_traces sinksDepending on the timeline of sketches-rs this point is at the edge between this section and the next one:
datadog_traces sink for any trace format.