RFC - 2021-11-03 - Ingest APM stats in `datadog-agent` source with consistent user experience

Datadog traces support in the datadog_agent source was initially documented in this RFC. However, the Datadog trace-agent submits more data than just traces. It also sends statistics ("APM Stats") about the running time of each instrumented resource (i.e. a given piece of code) that are aggregated over time (based on 100% of the traces received). Those APM stats are very important as they highlight code hot spots and ease aggregation. And, as a result, APM stats should be handled by Vector.

Context

Traces are new in Vector, initial support focuses on traces coming from the Datadog Agent, the RFC discussing traces describes how those traces will be handled in Vector and highlights the need to support APM stats.

APM stats are encoded using Protobuf, the current schema is available in the datadog-agent repository. APM stats may be computed by the tracing lib in some cases, or most of the time by the trace-agent, but once they are emitted by the trace-agent there is no difference except a boolean value that indicates where it was computed. They are sent by the trace-agent to the same endpoint as trace payloads, only the paths differ, allowing easy discrimination between trace & APM stats payloads.

Those stats are computed by a component named concentrator in the trace-agent. There is a dedicated path for APM stats that comes directly from tracing libraries. But ultimately they flow through Datadog with the same sending code.

Below is an example of a stats payload, it is an aggregate of ClientGroupedStats:

protobuf

string service = 1;
string name = 2;
string resource = 3;
uint32 HTTP_status_code = 4;
string type = 5;
string DB_type = 6; // db_type might be used in the future to help in the obfuscation step
uint64 hits = 7; // count of all spans aggregated in the groupedstats
uint64 errors = 8; // count of error spans aggregated in the groupedstats
uint64 duration = 9; // total duration in nanoseconds of spans aggregated in the bucket
bytes okSummary = 10; // ddsketch summary of ok spans latencies encoded in protobuf
bytes errorSummary = 11; // ddsketch summary of error spans latencies encoded in protobuf
bool synthetics = 12; // set to true on spans generated by synthetics traffic
uint64 topLevelHits = 13; // count of top level spans aggregated in the groupedstats

As you can see, it is a group of various metrics that can be represented in Vector as such (Sketches are supported since this PR). In the proto definition sketches are stored as unstructured bytes slices, but those fields are filled with a protobuf encoded ddsketch. Given that sketches in Vector are also heavily based on ddsketch, APM stats sketches can be converted to/from the Vector internal representation without incurring too much accuracy loss, but this would require significant work to implement that conversion.

This opens two major very different paths for APM stats in Vector:

Either the APM stats are emitted as a log, each ClientGroupedStats would be mapped to one log event.
Or they are emitted as metrics. Each value from every ClientGroupedStats would be emitted as a metric with all upper level information stored as tags. Each metric would flow independently from others and this would be require significant re-aggregation logic in the datadog_traces sink.

Those two approaches can be mixed together if we introduce one of the following abilities to bundle multiple metrics into a single event:

Allow multiple metric samples into a single metric event
Allow log events to embed metrics, it would mean adding a Metric type to the Value enum

This also raises the question of having a second (assuming that the datadog_agent sources accepts Datadog Agent metrics - RFC), unrelated, metric stream coming out of the datadog_agent source. The Vector event ingested representing APM stats will have to be routed along with traces, and most often they will follow a different path that other plain metrics/logs received from a core-agent. Thus it is suggested to re-arrange the datadog_agent source, many option are available (Additional details on what exactly are "Datadog Agents" can be found in the trace support RFC and may provide relevant context for undermentioned points):

Keeping a single datadog_agent source:
- and add a settings to switch between agent kind: agent: <TYPE> where <TYPE> could be core (would support metrics & logs - we could optionally add logs & metrics to only allow logs or metrics, along with core that would allow both), trace and could be extended to support process, security and so on
- or introduce the ability for a source to have multiple output for the datadog_agent source, e.g. <SRC_ID>.metrics, <SRC_ID>.logs, <SRC_ID>.traces, <SRC_ID>.apm_stats, etc.
Another solution would be to keep one Vector source per kind of Datadog Agent an then we would have the following Vector sources:
- datadog_core or the current datadog_agent sources that would receive the data sent by the Datadog "core" Agent (the agent that collects logs & metrics).
- datadog_trace that would support all data sent by the trace-agent
- And so on as the support list grows:
  - datadog_process source for the Datadog process-agent
  - datadog_security for the Datadog security Agent
  - Etc.

Cross cutting concerns

Ongoing work on transforms to add named_outputs that is laying the ground for the same feature but on sources, one PR has already be merged while scheduled work is tracked in the named outputs improvements issue.
Ongoing work on schemas will ultimately offer a programatic way of validating required fields and express constrains on incoming event for a given sink. Traces & APM stats are a good fit for that because they will be represented as standard Vector events, but sinks handling thos will expect some mandatory information.
An official crate for dd-sketches is being worked on.

Scope

In scope

Decode APM request (protobuf) received from the trace agent
Convert those into a Vector internal representation
Enable the passthrough use case (trace-agent -> Vector -> Datadog) in a lossless fashion

Out of scope

Compute APM stats in Vector, but this should be kept in mind as a valuable feature for third party traces
Any support for other kind of traces

Pain

Vector has no trace support
Datadog traces support without APM stats makes the whole APM product much less powerful.

Proposal

User Experience

The Vector datadog_agent source would accept all supported data type including APM stats (along with traces), and emit Vector event (logs or metrics depending on the implementation) including all metadata as tags/fields, so filtering could be done later in the topology on both APM stats and traces.

In order to avoid complex and unreliable route transforms to properly differentiate logs from traces (as the latter will be represented as logs inside Vector), and plain metrics (received from the core Agent) from APM stats metrics (received from the trace-agent) we can plan to extend the behaviour that was added to the remap transform. This would translate to the following kind of config, easy to read, easy to adapt:

toml

[sources.dd_agents]
  type = "datadog_agent"
  address = "[::]:8081"

[sinks.dd_traces]
  type = "datadog_traces"
  inputs = ["dd_agents.traces", "dd_agents.apm_stats" ]

[sinks.dd_logs]
  type = "datadog_logs"
  inputs = ["dd_agents.logs"]

[sinks.dd_metrics]
  type = "datadog_metrics"
  inputs = ["dd_agents.metrics"]

[sinks.debug]
  type = "console"
  # Optionally the non-suffixed name could receive everything, this will be configurable
  inputs = ["dd_agents"]
  encoding.codec = "json"

The datadog_trace sink will receive those event (metrics and/or log depending on the implementation) and do the opposite conversion, pending that expected tags will be there.

Regarding the Datadog trace-agent config, the APM stats endpoint is the same as the trace one (apm_config.apm_dd_url config key), there will be nothing else to configure.

And finally API key management will be the same as it is for other Datadog sources/sinks.

Implementation

Each group is relatively independent:

Reorganise the datadog_agent source:
- Extend the named_outputs feature, that is available four transforms, to sources so they can expose multiple named outputs (<SRC_ID>.<OUTPUT_NAME>)). The feature for transform was initially add in this PR, subsequent work on it is tracked [here][named-outputs-improvements.
- Add the following named_outputs in the datadog_agent: <SRC_ID>.traces, <SRC_ID>.apm_stats, <SRC_ID>.metrics, <SRC_ID>.logs. Note that the non suffixed output should have a predictable behaviour, so we could add a knob named top_level_output that would allow the user to choose which data to get out of of the suffix-less output.

APM stats support would be done according to the following plan:

Import all APM stats as standard vector metrics
- Turn each ClientGroupedStats into relevant metrics all possible metadata to allow the lossless pass-through scenario and the same level of filtering/routing we can achieve for traces. APM stats sketches would then be converted to the Vector internal sketches. Vector internal sketches would then get parameterized gamma and maxbin that would still default to the agent sketch value.
- The upcoming datadog_traces sink would then receive incoming APM stats metric along with traces. It will reaggregate those metrics according to the relevant dimensions using the Partitioner trait to rebuild APM stats payloads:
  - Incoming metrics will be buffered, and populate a struct matching the APM stats base object, those struct will be stored in a map according to the very same kind of [keys][trace-stats-agg-key] used by the trace-agent.
  - And every 10 seconds (this is the sending interval of the trace-agent) serializing and flushing those to Datadog. To account for late metrics the sink would have to keep 2 or 3 buckets in the past and delay flushing accordingly. This would rely on the [bucket timestamp][btime] kept by the trace agent and [stored in APM stats payload][csb-start].

At a certain point in time, when sketches-rs will be production ready:

Swich Vector sketches to use the crate sketches-rs instead of the Agent based implementation
Then relocate conversion logic: sketch conversion will then be effectively useless in traces handling, but the datadog_agent source and the datadog_metrics sink will then have to handle conversion between vector sketches (plain ddsketches) and the agent variation (note: it's on the crate roadmap to offer that kind of conversion)

[btime]: https://github.com/DataDog/datadog-agent/blob/dc2f202/pkg/trace/stats/concentrator.go#L148-L159 [csb-start]: https://github.com/DataDog/datadog-agent/blob/dc2f202/pkg/trace/pb/stats.proto#L47

Rationale

We should Keep valuable metrics relevant to end-user
Dropping APM stats would cause current user to lose some insight on execution time.

Drawbacks

None identified so far.

Prior Art

N/A.

Alternatives

Regarding the fact that we could ignore/drop incoming APM stats:

Either completely drop APM stats, but this is not really an option as it would lead to user experience degradation
Or disable sampling on the trace-agent side and compute APM stats in the datadog-trace sink, this could work but this is a lot for a initial implementation (it would required plain ddsketch support on top of the computation logic) and to match the accuracy of current APM stats, Vector would have to receive 100% of traces, which may not always be possible. But this would pave the wayfor generic APM stats computation wherever the traces come from.

Regarding the internal representation, APM stats could alternatively be represented either by a log event with some numerical fields. As stated above a hybrid approach like allowing a log event to have metric fields or introducing metric event that could hold multiple value could also be a solution.

Regarding sketches, thos from APM stats are not exactly the same as the internal representation we have in Vector, thus converting them to the internal representation will required some plumbing this could be avoided by not decoding those sketches as all and keeping those as opaque data/raw bytes slices inside Vector.

About the source(s) reorganisation an alternative to avoid the work to implement the <source_id>.<suffix> an alternative would be to handle different Datadog Agent in a dedicated source:

Either the datadog_agent source is adjusted to be configurable with an type settings (that could be set to logs, metrics or traces
Or source types are mapped to Datadog types: datadog_logs, datadog_metrics & datadog_traces (datadog-agent would probably became an alias for datadog_logs or datadog_metrics before being deprecated),

This would lead to the following config, functionally identical to the snippet above, a bit longer but still very straightforward and easily readable (note that having multiple binding addresses may translate to more parameter in later work around helm charts):

toml

[sources.dd_in_logs]
  type = "datadog_logs"
  address = "[::]:8081"

[sources.dd_in_metrics]
  type = "datadog_metrics"
  address = "[::]:8082"

[sources.dd_in_traces]
  type = "datadog_traces"
  address = "[::]:8083"

[sinks.dd_traces]
  type = "datadog_traces"
  inputs = ["dd_in_traces" ]

[sinks.dd_out_logs]
  type = "datadog_logs"
  inputs = ["dd_in_logs"]

[sinks.dd_out_metrics]
  type = "datadog_metrics"
  inputs = ["dd_in_metrics"]

[sinks.debug]
  type = "console"
  inputs = ["dd_in_*"]
  encoding.codec = "json"

Outstanding Questions

None.

Plan Of Attack

Implement the multiple outputs per source option
Implement APM stats decoding to Vector metrics
Add APM support to the datadog_traces sinks

Depending on the timeline of sketches-rs this point is at the edge between this section and the next one:

Switch Vector to use sketches-rs internally instead of the Agent variant

Future Improvements

Compute APM stats in the datadog_traces sink for any trace format.
Overall most improvements suggested in the Datadog trace RFC applies here, but having constraints on the schema use (in the case we represent APM stats as log event) would be very useful here.

RFC - 2021-11-03 - Ingest APM stats in `datadog-agent` source with consistent user experience

RFC - 2021-11-03 - Ingest APM stats in datadog-agent source with consistent user experience

Context

Cross cutting concerns

Scope

In scope

Out of scope

Pain

Proposal

User Experience

Implementation

Rationale

Drawbacks

Prior Art

Alternatives

Outstanding Questions

Plan Of Attack

Future Improvements

RFC - 2021-11-03 - Ingest APM stats in `datadog-agent` source with consistent user experience