docs/design/proposals/tracing-proposals.md
This document summarises a set of proposals triggered by the tracing documentation PR.
This section explains some terminology required to understand the proposals. Further details can be found in the tracing documentation PR.
| Trace mode | Description | Use-case |
|---|---|---|
| Static | Trace agent from startup to shutdown | Entire lifespan |
| Dynamic | Toggle tracing on/off as desired | On-demand "snapshot" |
| Trace type | Description | Use-case |
|---|---|---|
| isolated | traces all relate to single component | Observing lifespan |
| collated | traces "grouped" (runtime+agent) | Understanding component interaction |
| Lifespan | trace mode | trace type |
|---|---|---|
| short-lived | static | collated if possible, else isolated? |
| long-running | dynamic | collated? (to see interactions) |
Implement all trace types and trace modes for agent.
Why?
Maximum flexibility.
Counterargument:
Due to the intrusive nature of adding tracing, we have learnt that landing small incremental changes is simpler and quicker!
Compatibility with Kata 1.x tracing.
Counterargument:
Agent tracing in Kata 1.x was extremely awkward to setup (to the extent that it's unclear how many users actually used it!)
This point, coupled with the new architecture for Kata 2.x, suggests that we may not need to supply the same set of tracing features (in fact they may not make sense)).
All tracing will be static.
Why?
Because dynamic tracing will always be "partial"
In fact, not only would it be only a "snapshot" of activity, it may not even be possible to create a complete "trace transaction". If this is true, the trace output would be partial and would appear "unstructured".
Agent tracing will be "isolated" by default.
Agent tracing will be "collated" if runtime tracing is also enabled.
Why?
Are your containers long-running or short-lived?
Would you ever need to turn on tracing "briefly"?
If "yes", is a "partial trace" useful or useless?
Likely to be considered useless as it is a partial snapshot. Alternative tracing methods may be more appropriate to dynamic OpenTelemetry tracing.
Are you happy to stop a container to enable tracing? If "no", dynamic tracing may be required.
Would you ever want to trace the agent and the runtime "in isolation" at the same time?
If "yes", we need to fully implement trace_mode=isolated
This seems unlikely though.
The second set of proposals affect the way traces are collected.
Currently:
trace-forwarder component.Kata agent tracing overview:
+-------------------------------------------+
| Host |
| |
| +-----------+ |
| | Trace | |
| | Collector | |
| +-----+-----+ |
| ^ +--------------+ |
| | spans | Kata VM | |
| +-----+-----+ | | |
| | Kata | spans | +-----+ | |
| | Trace |<-----------------|Kata | | |
| | Forwarder | VSOCK | |Agent| | |
| +-----------+ Channel | +-----+ | |
| +--------------+ |
+-------------------------------------------+
Currently:
If agent tracing is enabled but the trace forwarder is not running, the agent will error.
If the trace forwarder is started but Jaeger is not running, the trace forwarder will error.
The runtime and agent should:
Kata should should support more trace collection software or SaaS
(for example Zipkin, datadog).
Trace collection should not block normal runtime/agent operations
(for example if vsock-exporter/Jaeger is not running, Kata Containers should work normally).
Kata runtime/agent all send spans to trace forwarder, and the trace forwarder,
acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or datadog.
Pros:
Cons:
Send spans to collector directly from runtime/agent, this proposal need network accessible to the collector.
Pros:
Cons: