docs/rfc/0016-otlp-native-logging-pipeline.md
test/integration/otel/, not the chart).
Streaming --follow is implemented (the kubernetes driver follows the pod stream, the loki driver opens the /tail WebSocket).
Control-plane OTLP log push is implemented (opt-in via OTEL_LOGS_ENABLED): a LoggerProvider + otlploggrpc exporter in pkg/utils/otel, and a zap→OTLP bridge in loggerfactory so control-plane logs (carrying trace_id) are pushed alongside traces.
InfluxDB is deprecated (the driver warns and is disabled by default).
Everything in this repo is now implemented; only the env-image per-line helpers remain, and those live in the separate fission/environments repo.
See "As implemented".X-Fission-Request-ID this pipeline keys log correlation on), RFC-0017 (planned; the CLI surfaces --request-id log queries), RFC-0008 (streaming responses share the same access record).Fission's logging stack is built on an end-of-life technology — InfluxDB v1.x (EOL December 2024) shipped via a Fluent-Bit DaemonSet whose only output is InfluxDB.
The query layer (pkg/fission-cli/logdb) hardcodes two drivers in a switch, --follow is one-second polling rather than streaming, and there is no per-invocation correlation: a raw user stdout line carries no request-ID, so fission function logs cannot answer "show me the logs for this request."
This RFC re-bases Fission logging on OpenTelemetry conventions: function and control-plane logs are emitted as structured records to stdout (the router gains a per-invocation access record), which an operator-run external collector ingests and routes to any backend (Loki, Elastic, Datadog, Cloud Logging) — Fission bundles no collector of its own.
fission function logs keeps working through a Loki reference adapter, gains request-ID / trace-ID / level filtering and real streaming, and the zero-dependency kubernetes driver remains the default so a bare install needs nothing new.
InfluxDB and Fluent-Bit are deprecated on a clear migration timeline, not removed out from under anyone.
pkg/fission-cli/logdb/influxdb.go targets InfluxDB v1.x and even carries schema-evolution fallback-column shims (a TODO about fluentd→fluent-bit index renames).
The backend is past end of life; v2.x is not a drop-in.GetLogDB (pkg/fission-cli/logdb/logdb.go:70) is a two-case switch over influxdb and kubernetes; the error string literally reads "now only support influxdb".
Adding a backend means editing this function and shipping a new driver Fission must maintain.LogFilter filters by pod and time window only — there is no request-ID, trace-ID, or level field.
A user's print() line is a raw string with no invocation context.fission function logs --follow (pkg/fission-cli/cmd/function/log.go) re-queries every second, a latency floor and a load source.influxdb.enabled=false), so most users fall back to the kubernetes driver — direct Pods().GetLogs(), bounded by pod retention, single-pod, no search.Every comparable platform gives structured, per-request, searchable logs (CloudWatch, Cloud Logging). The cloud-native answer is not "write more backend drivers" — it is to speak OTLP and let the OpenTelemetry Collector fan out to whatever the operator already runs.
fission function logs keeps working against one supported open backend.GetLogDB into a registry.--follow with real streaming.X-Fission-Request-ID and trace ID.pkg/utils/otel/provider.go — this RFC extends it with logs.fission/environments repo); this RFC defines the contract they implement. CONTROL PLANE (fission-bundle) FUNCTION PODS (any namespace)
┌──────────────────────────────────────┐ ┌────────────────────────────────┐
│ router → ACCESS RECORD (structured │ │ env runtime: stdout/stderr │
│ stdout): request-id, trace-id, fn.*, │ │ (+ optional structured JSON │
│ backend, status, latency │ │ carrying request-id) │
│ — DISPLAY_ACCESS_LOG, opt-in │ └───────────────┬────────────────┘
└──────────────────┬───────────────────┘ │ pod stdout
│ pod stdout │
└──────────────────┬──────────────────────────┘
▼
┌────────────────────────────────────────────┐
│ EXTERNAL collector — operator-run, NOT │
│ bundled by Fission: OTel Collector (via the │
│ OTel Operator) / Promtail / Vector / Fluent │
│ Bit. Scrapes pod stdout (+ optional OTLP │
│ push), labels by fission.function.* → backend│
└───────────────────┬──────────────────────────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
Loki Elastic Datadog …
▲
│ LogQL (query_range)
┌─────────┴──────────┐
│ fission function │ logdb registry → loki (reference) | kubernetes (zero-dep default)
│ logs --request-id │
└────────────────────┘
Fission emits the data; the operator's existing pipeline collects it. There is no Fission-owned collector container.
Fission does not bundle a log collector. Every Fission component (and every function pod) writes structured records to stdout, where the operator's existing log pipeline already collects pod logs and ships them to the backend. This keeps the chart lean and lets operators own their observability stack (the RFC non-goal "operators own the backend").
Fission's contribution to collection is the data, not the pipeline:
stdout/stderr — collected like any pod's logs and labeled from the function pod's labels (functionUid, functionName, functionNamespace, environmentName, executorType), which a k8sattributes-style processor promotes to the fission.function.* labels the read-path adapter queries.An operator wires their collector to scrape the fission/function-namespace pod logs (OTel filelog+k8sattributes, Promtail kubernetes_sd, …) and export to Loki (native OTLP endpoint or push), mapping the pod labels to fission_function_uid etc. so fission function logs finds them.
The legacy in-chart Fluent-Bit→InfluxDB shipper and the pkg/logger symlink DaemonSet are retained behind influxdb.enabled (default false) but deprecated; they are NOT extended.
The router emits one structured access record per invocation — the one place that always sees the request-id, the trace-id, the function identity, and the chosen backend together — from functionHandler.logAccessRecord (called by collectFunctionMetric).
Fields: fission.request.id, trace_id, fission.function.{name,uid,namespace}, http.{method,path,status_code}, backend, retry, duration_ms.
It is opt-in via the existing DISPLAY_ACCESS_LOG flag (chart router.displayAccessLog, default false) — previously an orphaned env var, now wired — so it adds no per-request log volume unless an operator wants log-based correlation.
fission function logs --request-id <id> resolves an invocation to its function and time window via these records once the operator's collector ships them to the backend.
Fission keeps instrumenting metrics with the Prometheus client library (mature, pull-based, the Kubernetes-native standard the chart's ServiceMonitor/PodMonitor and shipped Grafana dashboards are built on) — it does not migrate the ~39 metric definitions to the OpenTelemetry metrics API.
That migration would be high-churn (every call site) and break backward compatibility: the OTel→Prometheus exporter renames series (otel_scope_* labels, target_info, unit/counter suffixes), which the dashboards and operators' alerts query by exact name.
The Prometheus client carries no maintenance burden here (plain counters/gauges/histograms; no custom collectors, native histograms, or exemplars), so there is little to gain and real risk to take.
Instead, the same operator-run collector unifies metrics into the OpenTelemetry pipeline at the collection layer: its prometheus receiver scrapes the existing /metrics:8080 endpoints (exactly the targets the ServiceMonitor already selects, by the svc: label) and converts them to OTLP, alongside the logs (filelog) and the access record.
This gives the vendor-neutral, single-pipeline benefit with zero instrumentation change and full backward compatibility — the Prometheus scrape path keeps working in parallel.
A reference config is in test/integration/otel/metrics-collector.reference.yaml.
A process-native OTel MeterProvider (for in-process exemplars linking metrics↔traces, or push) remains a possible future opt-in — gated like OTEL_LOGS_ENABLED, with name-preserving exporter options — not a wholesale migration.
By default control-plane components (router, executor, …) log structured records to stdout, which the operator's external collector scrapes like any pod — so no Fission code change is required for collection.
As an optional push transport (phase 4), Fission could add a log exporter and LoggerProvider to pkg/utils/otel/provider.go, mirroring the existing trace exporter (otlploggrpc + go.opentelemetry.io/otel/sdk/log, a go.mod bump), and bridge Fission's logr/zap loggers so logger.Info(...) also pushes OTLP records (carrying the trace_id from LoggerWithTraceID) to the operator's OTEL_EXPORTER_OTLP_ENDPOINT.
This is additive — the stdout path already works without it — and is deferred so collection stays infra-free by default.
A user's raw print("hello") produces a line with no request-ID.
The honest options:
| Option | How | Pros | Cons |
|---|---|---|---|
| (a) Runtime-side structured logging | Env-image helper reads X-Fission-Request-ID/traceparent from the request and emits JSON {ts, level, msg, fission.request.id, trace_id}; the Collector's json_parser promotes fields | Exact per-line correlation; carries level | Needs env-image changes (separate repo); only correlates code that uses the helper — bare print() is not stamped; per-env work |
| (b) Collector-side join by pod + timestamp | Correlate raw lines to the router access log by pod + time window | No runtime change | Fragile: concurrent requests on one pod are ambiguous; clock skew; best-effort only |
| (c) Hybrid: router access record + best-effort enrichment | Router emits one structured access record per invocation (request-ID, trace-ID, fn name/uid/ns, status, latency, target pod); the query layer joins it to that pod's logs over the request window; env images additionally adopt (a) where feasible | Always-on correlation for every invocation with zero user/env change; canonical request→pod→time mapping; (a) layers per-line precision on top | Raw lines within a pod's window are pod+time-correlated (inherits (b)'s ambiguity under true concurrency); precise only where (a) is used |
Recommendation: (c), with (a) as the precision upgrade. (c) is the only option that yields a useful answer for arbitrary user code on day one with no env change, because the router is the one place that always sees the request-ID, the trace-ID, and the chosen backend pod together.
The emission point already exists: functionHandler.collectFunctionMetric runs once per completed request with start, status, function, and the resolved pod in hand.
We add a structured access record there:
accessLog.Info("function invoked",
"fission.request.id", reqID, // X-Fission-Request-ID (RFC-0015)
"trace_id", traceID,
"fission.function.name", fn.Name,
"fission.function.uid", string(fn.UID),
"fission.function.namespace", fn.Namespace,
"fission.target.pod", targetPod,
"http.status_code", status,
"duration_ms", duration.Milliseconds())
fission function logs --request-id <id> then (1) queries the access record to resolve request-ID → {pod, ns, [start, end]}, and (2) fetches that pod's logs over the window.
Documented limit, stated not hidden: per-line precision for concurrent requests on one pod requires option (a); the access record covers everything else.
Prioritize (a) for the highest-traffic envs (node, python, go) in the fission/environments repo.
Evolve pkg/fission-cli/logdb/logdb.go.
LogFilter gains additive fields, a streaming interface is added, and GetLogDB becomes a registry:
type LogFilter struct {
// ...existing fields unchanged...
RequestID string // X-Fission-Request-ID
TraceID string
Level string // info|warn|error|debug
Follow bool
}
type StreamingLogDatabase interface {
LogDatabase
StreamLogs(context.Context, LogFilter, chan<- LogEntry) error
}
type Factory func(ctx context.Context, opts LogDBOptions) (LogDatabase, error)
var registry = map[string]Factory{}
func Register(name string, f Factory) { registry[name] = f }
func GetLogDB(dbType string, ctx context.Context, opts LogDBOptions) (LogDatabase, error) {
f, ok := registry[dbType]
if !ok {
return nil, fmt.Errorf("unknown log database %q; supported: %s", dbType, strings.Join(keys(registry), ", "))
}
return f(ctx, opts)
}
Drivers self-register in init():
kubernetes_log.go registers kubernetes — kept as the zero-dependency default.pkg/fission-cli/logdb/loki.go registers loki — the reference adapter.influxdb.go registers influxdb but prints a deprecation warning.The Loki adapter implements both LogDatabase and StreamingLogDatabase over Loki's HTTP API, reusing util.SetupPortForward exactly as influxdb.go does today (port-forward to the Loki gateway when LOKI_URL is unset).
GetLogs builds a LogQL query from LogFilter:
{fission_function_uid="<uid>", fission_function_namespace="<ns>"}
| json
| fission_request_id="<reqid>" # when --request-id set
| trace_id="<traceid>" # when --trace-id set
| level="<level>" # when --level set
mapping Since/RecordLimit/Reverse to start/limit/direction on /loki/api/v1/query_range; StreamLogs uses /loki/api/v1/tail (WebSocket) for --follow.
Replace the one-second polling loop in pkg/fission-cli/cmd/function/log.go with a channel consumer.
The Loki driver streams via /tail; the kubernetes driver switches --follow to the Kubernetes PodLogOptions{Follow: true} streaming API.
OpenTelemetry semantic conventions as the base, Fission keys namespaced under fission.*.
service.name, k8s.pod.name, k8s.namespace.name, k8s.container.name, k8s.node.name, plus fission.function.{name,uid,namespace}, fission.environment.name, fission.executor.type (from pod labels via k8sattributes).fission.request.id, trace_id, span_id, severity_text/level, and on the access record http.status_code + duration_ms.For Loki, low-cardinality keys (fission.function.uid, fission.function.namespace, level) become labels; fission.request.id is kept as structured metadata (Loki 3.x), queried via | json | fission_request_id=, not promoted to a label, to avoid a cardinality explosion.
The read path (phase 1) and the router access record are implemented; collection is delegated to an operator-run external collector (no Fission-owned collector container). Concrete surface:
pkg/fission-cli/logdb/logdb.go — GetLogDB's hardcoded two-driver switch is now a Register/Factory registry; drivers self-register in init(), so adding a backend is a new file, not a central edit.
LogFilter gains additive RequestID / TraceID / Level correlation fields, and a shared writeLogEntry renders identical CLI output for every driver.pkg/fission-cli/logdb/loki.go (new) — the Loki reference adapter: builds a LogQL query_range from the filter (a stream selector on the function labels plus an optional | json | … pipeline for request-id/trace-id/level), reusing util.SetupPortForward when LOKI_URL is unset.
buildLogQL errors rather than emit an empty matcher Loki would reject.pkg/fission-cli/logdb/{kubernetes_log,influxdb}.go — kubernetes (the zero-dependency default) and influxdb self-register; the influxdb driver prints an end-of-life deprecation warning when selected.pkg/fission-cli/cmd/function/log.go + flag — new --request-id / --trace-id / --level flags wire into the filter; a one-shot query now surfaces a backend error (bad query / auth / unreachable) instead of swallowing it; the CLI warns when a correlation filter is set against a backend that does not index it.pkg/router/metrics.go (functionHandler.logAccessRecord) + config.go — the per-invocation access record (section 1a), emitted to router stdout, opt-in via the pre-existing DISPLAY_ACCESS_LOG flag (router.displayAccessLog, default false) which was an orphaned env var until now.
Threaded through the router config like ROUTER_STRUCTURED_ERRORS.--follow (phase 5) — an optional LogStreamer interface (pkg/fission-cli/logdb): --follow streams live to stdout when the driver supports it, otherwise falls back to the one-second poll.
The kubernetes driver follows the pod log stream (PodLogOptions.Follow, one goroutine per env container, line-buffered under a lock); the loki driver opens the /tail WebSocket (coder/websocket, already vendored) and renders frames through the shared lokiEntry mapper.
writeLogEntry now takes io.Writer so a live stream renders without buffering; a cancelled context is a clean stop.pkg/utils/otel/provider.go InitProvider stands up a sdklog.LoggerProvider (batch processor + otlploggrpc exporter) and registers it globally, parallel to the TracerProvider; pkg/utils/loggerfactory tees the zap core to the contrib/bridges/otelzap bridge so each record is also emitted to that provider (carrying trace_id), gated to the console level.
Opt-in: requires OTEL_LOGS_ENABLED=true in addition to the OTLP endpoint (Helm openTelemetry.logsEnabled, default false), so enabling traces alone does not start pushing logs on upgrade.
Inert (global LoggerProvider stays no-op) when not enabled.
go.mod adds the OTel log SDK/exporter/bridge at the v1.44.0 line (core unchanged).The Loki adapter queries against the schema in "Structured-log standard" below, which the operator's external collector produces from the function pod labels + the access record — so it is immediately useful for a cluster already running a collector + Loki with that schema.
GetLogDB to a registry; add loki.go (query_range); extend LogFilter; wire flags in pkg/fission-cli/flag.
Pure CLI change, unit-testable against an httptest Loki stub; immediately useful where Loki + a collector already run.collectFunctionMetric, opt-in via DISPLAY_ACCESS_LOG.
Consumes RFC-0015's request-ID.test/integration/otel/ + TestFunctionLogsLokiCorrelation): the Collector tails the router access record, hoists fission.function.* to resource attributes, and pushes to Loki's OTLP endpoint, which indexes them as the labels the read path queries.
The manifests are CI-only — no Fission chart container — and double as operator guidance for pointing an OTel Collector / Promtail / Vector at the access record + function-namespace pod logs.pkg/utils/otel/provider.go (go.mod bump) so control-plane components can push logs to the operator's OTEL_EXPORTER_OTLP_ENDPOINT instead of relying on stdout scraping.
Additive transport; the stdout access record already works without it.StreamingLogDatabase; implement Loki /tail and the kubernetes follow stream; rewrite the CLI loop.fission/environments repo) — node/python/go helpers per the schema, so user print() lines carry the request-id for per-line correlation.--db-type value still resolves (kubernetes stays the default, influxdb works with a deprecation warning, loki is added).
The new LogFilter fields and StreamingLogDatabase interface are additive — existing callers compile and behave unchanged.DISPLAY_ACCESS_LOG, default off).
An existing install — including one with influxdb.enabled=true — keeps working untouched until the operator opts in.influxdb driver and the Fluent-Bit/InfluxDB config are deprecated now, keep working (with a console.Warn), and are scheduled for removal at least two minors out; the schema-fallback shims are dropped only at removal.influxdb.enabled: true → point your existing collector (OTel Collector / Promtail / Vector) at the function-namespace pod logs + the router access record (set router.displayAccessLog: true), export to Loki, and query with --db-type loki.
Documented in the chart NOTES.txt and an upgrade note.
--db-type default stays kubernetes, so no scripted CLI usage breaks.DISPLAY_ACCESS_LOG env / router.displayAccessLog value; no Fission-owned collector is added.loki.go LogQL construction and response parsing against an httptest.Server (no real Loki), mirroring the influxdb.go style; access-record field assertions via a test logr sink in functionHandler_test.go; log-exporter init covered like provider_test.go.DISPLAY_ACCESS_LOG in kind-ci and asserting the structured function access record appears in the router logs after an invocation (the kind-logs artifact).
A full Loki-backed round-trip — TestFunctionLogsLokiCorrelation — runs on one CI leg: a CI-only OpenTelemetry Collector + Loki (manifests in test/integration/otel/, not the chart) stand in for the operator's pipeline, the test invokes a function with a known X-Fission-Request-ID, and asserts fission function logs --dbtype loki --request-id <id> returns the access record for exactly that invocation.
It is gated on FISSION_TEST_LOKI so it skips on legs (and locally) without the stack, mirroring the Gateway-API / OCI-registry gates.
The kubernetes driver path needs no backend and stays the default smoke test.influxdb driver still registers and warns; assert the default --db-type remains kubernetes.fission function logs --request-id <id> returns the correct lines for at least one reference env.--follow latency below the current one-second poll granularity with true streaming.otlphttp/Elastic) with Helm-values changes only.logger.Info is captured.fission.request.id must be structured metadata, not a label; document a minimum Loki version.k8sattributes needs pod-read in the function namespaces; confirm the Collector ServiceAccount matches the per-namespace vs cluster-wide posture the fluentbit role uses today.