Kata Tracing proposals

Overview

This document summarises a set of proposals triggered by the tracing documentation PR.

Required context

This section explains some terminology required to understand the proposals. Further details can be found in the tracing documentation PR.

Agent trace mode terminology

Trace mode	Description	Use-case
Static	Trace agent from startup to shutdown	Entire lifespan
Dynamic	Toggle tracing on/off as desired	On-demand "snapshot"

Agent trace type terminology

Trace type	Description	Use-case
isolated	traces all relate to single component	Observing lifespan
collated	traces "grouped" (runtime+agent)	Understanding component interaction

Container lifespan

Lifespan	trace mode	trace type
short-lived	static	collated if possible, else isolated?
long-running	dynamic	collated? (to see interactions)

Original plan for agent

Implement all trace types and trace modes for agent.
Why?
- Maximum flexibility.
  
  Counterargument:
  
  Due to the intrusive nature of adding tracing, we have learnt that landing small incremental changes is simpler and quicker!
- Compatibility with Kata 1.x tracing.
  
  Counterargument:
  
  Agent tracing in Kata 1.x was extremely awkward to setup (to the extent that it's unclear how many users actually used it!)
  
  This point, coupled with the new architecture for Kata 2.x, suggests that we may not need to supply the same set of tracing features (in fact they may not make sense)).

Agent tracing proposals

Agent tracing proposal 1: Don't implement dynamic trace mode

All tracing will be static.
Why?
- Because dynamic tracing will always be "partial"
  
  In fact, not only would it be only a "snapshot" of activity, it may not even be possible to create a complete "trace transaction". If this is true, the trace output would be partial and would appear "unstructured".

Agent tracing proposal 2: Simplify handling of trace type

Agent tracing will be "isolated" by default.
Agent tracing will be "collated" if runtime tracing is also enabled.
Why?
- Offers a graceful fallback for agent tracing if runtime tracing disabled.
- Simpler code!

Questions to ask yourself (part 1)

Are your containers long-running or short-lived?
Would you ever need to turn on tracing "briefly"?
- If "yes", is a "partial trace" useful or useless?
  
  Likely to be considered useless as it is a partial snapshot. Alternative tracing methods may be more appropriate to dynamic OpenTelemetry tracing.

Questions to ask yourself (part 2)

Are you happy to stop a container to enable tracing? If "no", dynamic tracing may be required.
Would you ever want to trace the agent and the runtime "in isolation" at the same time?
- If "yes", we need to fully implement trace_mode=isolated
  
  This seems unlikely though.

Trace collection

The second set of proposals affect the way traces are collected.

Motivation

Currently:

The runtime sends trace spans to Jaeger directly.
The agent will send trace spans to the trace-forwarder component.
The trace forwarder will send trace spans to Jaeger.

Kata agent tracing overview:

+-------------------------------------------+
| Host                                      |
|                                           |
| +-----------+                             |
| | Trace     |                             |
| | Collector |                             |
| +-----+-----+                             |
|       ^                  +--------------+ |
|       | spans            | Kata VM      | |
| +-----+-----+            |              | |
| | Kata      |    spans   |     +-----+  | |
| | Trace     |<-----------------|Kata |  | |
| | Forwarder |    VSOCK   |     |Agent|  | |
| +-----------+    Channel |     +-----+  | |
|                          +--------------+ |
+-------------------------------------------+

Currently:

If agent tracing is enabled but the trace forwarder is not running, the agent will error.
If the trace forwarder is started but Jaeger is not running, the trace forwarder will error.

Goals

The runtime and agent should:
- Use the same trace collection implementation.
- Use the most the common configuration items.
Kata should should support more trace collection software or SaaS (for example Zipkin, datadog).
Trace collection should not block normal runtime/agent operations (for example if vsock-exporter/Jaeger is not running, Kata Containers should work normally).

Trace collection proposals

Trace collection proposal 1: Send all spans to the trace forwarder as a span proxy

Kata runtime/agent all send spans to trace forwarder, and the trace forwarder, acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or datadog.

Pros:

Runtime/agent will be simple.
Could update trace collection target while Kata Containers are running.

Cons:

Requires the trace forwarder component to be running (that is a pressure to operation).

Trace collection proposal 2: Send spans to collector directly from runtime/agent

Send spans to collector directly from runtime/agent, this proposal need network accessible to the collector.

Pros:

No additional trace forwarder component needed.

Cons:

Need more code/configuration to support all trace collectors.

Future work

We could add dynamic and fully isolated tracing at a later stage, if required.

Further details

Summary

Time line

2021-07-01: A summary of the discussion was posted to the mail list.
2021-06-22: These proposals were discussed in the Kata Architecture Committee meeting.
2021-06-18: These proposals where announced on the mailing list.

Outcome

Nobody opposed the agent proposals, so they are being implemented.
The trace collection proposals are still being considered.