rfcs/2021-10-15-9572-accept-datadog-traces.md
This RFC describes a change that would:
This RFC is part of the global effort to enable Vector to ingest & process traffic coming out of Datadog Agents. Vector internal tracing has its own RFC.
Official "Datadog Agent" bundles (rpm/deb/msi/container image) actually ship multiple binaries, collectively named
"agents". Each of these "agents" is tasked to collect some data. For example "the core agent" (often shortened to "the
agent", because it was the first of those agents to be released) is the one collecting metrics, logs and running checks.
There are other agents like the process-agent, the security-agent or the trace-agent. But all of those are part of the
official "Datadog Agent" distribution logic, and they all come out of the datadog-agent codebase. So we are focusing
on the trace-agent which is one of the several binaries shipped along with others agents.
Traces are collected by this specific agent, that comes with a lot of dedicated configuration settings (usually under
the apm_config prefix), but it also shares some global option like site with other agents to select the Datadog
region where to send data. It exposes a local API, that is used by tracing libraries to submit traces & profiling
data.
It has several communication channels to Datadog:
trace.<SITE> (can be overridden by the apm_config.apm_dd_url config key)intake.profile.<SITE> (can be overridden by the
apm_config.profiling_dd_url config key), they are not processed by the trace-agent and relayed directly to Datadoghttp-intake.logs.<SITE> (can be overridden by the
apm_config.debugger_dd_url config key), it is fairly [recent] and unused as of October 2021.trace-agent. Aggregated stats are send back to
Datadog to the same host as processed traces (trace.<SITE>). Tracer-side stats are supported since
Agent 7.25, but APM stats computed by the trace-agent itself are not strictly mandatory but they
produce very useful stats.Profiling and Tracing are enabled independently on traced applications. But they can be correlated once ingested at Datadog, mainly to refine a span with profiling data.
The trace-agent encodes data using protobuf, .proto are located in the datadog-agent repository. Trace-agent requests to the trace endpoint contain two major kind of data:
On-going work to support event schema would allow to express some constrains on an event structure. In this case this would allow to formalize a trace schema while keeping the underlying data as standard Vector event. The trace sink would then expect event following this schema.
datadog_agent sourcedatadog_trace sinkdatadog_agent source & datadog_trace sinkVector does not support any traces (full json representation may be ingested as log event) at the moment and it is a key part of observability. Therefore, users cannot use Vector for the business-level user cases on trace data, like cost control and reduction, redacting PII, routing, and more.
datadog_agent source -> some filtering/enrichment transform ->
datadog_trace sinkapm_config.apm_dd_url config
keyTo keep vector-core as generic as possible, the first implementation will decode datadog traces as LogEvent, the
resulting event will be deeper than usual but this should not be a problem. In order to distinguish trace from log,
the Event enum will get a new Trace variant that will wrap LogEvent.
Upcoming work on having the ability to validate a LogEvent against a schema would provide a
nice way (with the performance question) of ensuring that a datadog-traces sinks would receive a properly
structured LogEvent.
Based on the aforementioned work the following source & sink addition would have to be done:
datadog_agent addition that decodes incoming gzip'ed protobuf over http to a LogEvent .proto files are located
in the datadog-agent repository.datadog_trace sink that does the opposite conversion and sends the trace to Datadog to the relevant region
according to the sink config.The datadog_agent agent addition would materialize as new filter (like the one dedicated to receive
logs), ideally colocated the trace decoding logic in its own source file
(./src/sources/datadog/traces.rs). The filter would be attached to the warp server upon a new configuration flags. This
way the traces related code would be isolated. New configuration flags would be three booleans, for logs, metrics and
traces enabling/disabling each datatype. This way the user can multiplex all three datatype over a single socket, or a
socket per one or more datatype at users convenience.
Datadog API key management would be the same as it is for Datadog logs & metrics.
Regarding APM stats, if we envision the datadog_trace sink as a universal sender for any kind of traces ingested by
Vector, it shall ultimately support computing APM stats, even if the stats payload is a bit complex
(it includes ddsketches) as this provides valuable stats on ingested traces. The Datadog OTLP traces exporter also
computes those stats. How Vector will handle APM stats is discussed in its own
RFC.
LogEvents to represent traces implies that, until schemas are available, the format a trace sink
would expect cannot be simply expressed and the sink will have to implement various sanity checks to ensure that
received events are properly structured.LogEvent type, a new Trace concrete type could be
added to the Event enum:
None.
Trace variant in the Event enum.datadog_agent source emitting a LogEvent for each trace and each APM event. It will re-organize the source to isolate generic code from data type specific code. APM stats will be dropped at this point.datadog_trace that turns relevant LogEvent back into Datadog protobuf-encoded traces.vector.traces.url & vector.traces.enabled