docs/internal-auth/00-design.md
Add a shared-secret HMAC-SHA256 application-layer authentication scheme to Fission's internal HTTP services.
The first delivery covers storagesvc /v1/archive; the same primitive is designed for reuse on every other Fission control-plane HTTP surface (in-pod fetcher, builder, executor, router internal listener) and is rolled out service-by-service.
The scheme is transport-agnostic, replay-resistant within a one-minute window with ±60s skew tolerance, and ships behind a Helm toggle (internalAuth.enabled) so existing installations continue to work during upgrade.
Per-service signing keys are derived from a single chart-managed master secret via HKDF-SHA256, so a leak of one channel's runtime memory cannot forge requests on a different channel. The operator manages exactly one Secret regardless of how many services are signed.
Two coordinated security advisories — GHSA-chf8-4hv6-8pg6 ("StorageSvc has no authentication on /v1/archive") and its duplicate GHSA-7g8g-g937-26g8 — describe the same root cause: any pod able to reach storagesvc over the cluster network can list, download, upload, and delete every Fission package archive.
Storagesvc currently relies on NetworkPolicy (added in #3365) and namespace isolation as its only access controls.
That is brittle:
A symmetric HMAC scheme is the smallest meaningful step up from "no auth" that still works in every cluster (no cert-manager dependency, no service-mesh requirement) and gives us replay resistance without a server-side nonce store.
The scheme defends against:
It does NOT defend against:
Secret/fission-internal-auth (e.g. cluster-admin, full RBAC compromise).
At that point the master rotates and the cluster needs full forensics, not application-layer auth./healthz to bypass signing so kubelet probes pass.pkg/auth/hmac) so each new signed surface adds one server-side middleware registration and one client-side transport wrapper, no new chart wiring.±60s between caller and verifier (typical kubelet drift).OldSecret accepted alongside the current Secret for a 5-minute overlap.internalAuth.enabled=true); opt-in for in-place upgrades from the previous minor.Each row is a logical communication channel. The signer and verifier on a given row use the same per-service derived key (see "Per-service key derivation" below). Service identifiers are part of the HKDF info string and must remain stable across releases.
| Service ID | Server endpoint(s) | Caller(s) | Status |
|---|---|---|---|
storagesvc | storagesvc /v1/archive (GET, POST, DELETE, HEAD) | in-pod cmd/fetcher (download + upload), buildermgr archive cleanup, fission CLI archive subcommands, pkg/fission-cli/cmd/package/util | Phase 1 (this PR) |
fetcher | in-pod cmd/fetcher /fetch, /upload, /clean, /specialize (port 8000) | buildermgr (build → fetcher), executor (specialization → fetcher) | Phase 2 (PR-α) |
builder | in-pod cmd/builder /build (port 8001) | buildermgr | Phase 2 (PR-α) |
executor | pkg/executor HTTP API: /v2/getServiceForFunction, /v2/tap, /v2/error, etc. | router, kubewatcher, timer, mqt-fission-kafka, canaryconfig | Phase 2 (PR-β) |
router-internal | router's internal listener that hosts /fission-function/<ns>/<name> | executor, kubewatcher, timer, mqt-fission-kafka, mqt-keda connectors | Advisory 4 (separate PR) |
Out of scope:
/healthz, /metrics — kubelet / Prometheus probes have no signing path; bypass is mandatory.fission/kafka-http-connector images don't sign; they need an upstream image change or a deploy-time NetworkPolicy-only acceptance.The HMAC input is:
<METHOD>\n
<REQUEST-URI>\n
<SHA256_HEX(BODY)>\n
<UNIX_MINUTE>
where:
<REQUEST-URI> is the path plus the raw query string (r.URL.RequestURI() on the Go side) so query parameters like ?id=<archive-id> are bound to the signature.
A captured GET /v1/archive?id=A cannot be replayed as ?id=B within the skew window.UNIX_MINUTE = floor(unix_seconds / 60) * 60.
The minute granularity ensures the signature is stable for the duration of a typical request (sub-second) and tolerates retries within the same minute without re-signing.POST could swap the body.X-Fission-Auth-Timestamp — the caller's unix-seconds timestamp at request time.X-Fission-Auth-Signature — hex(HMAC-SHA256(derived_key, canonical)).The timestamp is sent as the exact unix seconds the caller used, not the rounded minute, so the verifier can apply skew tolerance against its own clock.
The signature itself is computed over the rounded minute — so two requests issued 30 seconds apart from the same client share a signature input but differ in X-Fission-Auth-Timestamp.
The chart distributes a single 32-byte master secret to every signed service. At runtime, each service derives its own key from the master via HKDF-SHA256:
derived_key = HKDF-SHA256(
ikm = master_secret,
salt = nil,
info = "fission-internal-v1:" + service_id,
length = 32 bytes,
)
The signer and verifier on a given channel both call this with the same service_id, so they end up with the same derived_key end-to-end.
The master never leaves the verifier/signer constructors at the boundary; only the per-service derived key is passed to the actual HMAC primitives.
This gives the operational simplicity of a single shared secret (one chart Secret, one rotation event) with the compromise-isolation properties of independent per-service secrets:
derive(master, "storagesvc") lets the attacker forge storagesvc requests but not fetcher / builder / executor / router-internal.The constant KeyVersion = "fission-internal-v1" is the wire-format version mixed into the HKDF info string.
Bumping KeyVersion invalidates every signature in flight; treat it as a breaking change requiring a coordinated rollout.
1. If Secret is empty → pass through (backwards-compat short-circuit).
2. If path ∈ Bypass set → pass through.
3. Read X-Fission-Auth-Timestamp; reject if missing or unparseable.
4. Read X-Fission-Auth-Signature; reject if missing.
5. abs(now - timestamp) > skew → reject ("stale timestamp"), BEFORE buffering body.
6. Slurp body (bounded by MaxBodyBytes via http.MaxBytesReader; over the limit → 413).
7. Recompute Sign(derived_key, method, request_uri, body, timestamp).
8. crypto/hmac.Equal(want, got) → pass; re-inject body for downstream handler.
9. Else if OldSecret-derived key set → repeat 7-8.
10. Else → 401.
Comparison uses crypto/hmac.Equal to avoid timing oracles.
The skew check happens before body buffering so a stale-timestamp request with a multi-MB body cannot force the verifier to allocate MaxBodyBytes before rejecting.
The body is read once with io.ReadAll, the hash is computed, and r.Body is re-injected as io.NopCloser(bytes.NewReader(body)) so downstream handlers (e.g. multipart parsers in uploadHandler) can re-read it.
/healthz — kubelet probes have no signing path; an unsigned 200 must remain available.No other bypasses.
In particular /metrics is served on a different port (8080) and never reaches a service's signed mux.
pkg/auth/hmac)The package exposes both the low-level primitives and the per-service convenience constructors. New services should always use the convenience constructors so the service identifier is bound at compile time.
// Primitives (each new signed service does NOT need to call these directly):
func Canonical(method, requestURI string, body []byte, ts int64) string
func Sign(key []byte, method, requestURI string, body []byte, ts int64) string
func Verify(key []byte, method, requestURI string, body []byte, ts int64, sig string) bool
func NewSigner(key []byte, rt http.RoundTripper, now func() time.Time) *Signer
func Verifier(opts VerifierOpts) func(http.Handler) http.Handler
// Per-service convenience (preferred entry points):
func DeriveServiceKey(master []byte, service Service) []byte
func ServiceSigner(master []byte, service Service, rt http.RoundTripper, now func() time.Time) *Signer
func ServiceVerifier(master, oldMaster []byte, service Service, opts VerifierOpts) func(http.Handler) http.Handler
// Service identifiers (extend this list when adding a new signed channel):
const (
ServiceStoragesvc Service = "storagesvc"
ServiceFetcher Service = "fetcher"
ServiceBuilder Service = "builder"
ServiceExecutor Service = "executor"
ServiceRouterInternal Service = "router-internal"
)
Adding a new signed surface is mechanical:
Service constant.ServiceVerifier(master, oldMaster, ServiceXxx, opts) as middleware.ServiceSigner(master, ServiceXxx, rt, time.Now) when master is non-empty.FISSION_INTERNAL_AUTH_SECRET mounted (the chart's _helpers.tpl::internalAuth.envs partial covers all top-level deployments; pkg/fetcher/config/config.go::internalAuthEnvVars covers dynamically-created builder/function pods).No new Helm Secret, no new env var, no chart change.
The Helm chart materializes one Secret per Fission-using namespace, named fission-internal-auth, each with the same data.secret value (32-byte alphanumeric, b64-encoded).
Why per-namespace copies and not one cross-namespace reference: kubelet does not support cross-namespace secretKeyRef, and Fission builder / function pods are scheduled into user namespaces (e.g. default).
The chart iterates over .Release.Namespace plus .Values.defaultNamespace plus each .Values.additionalFissionNamespaces and renders one Secret per entry.
A single $secretValue is computed once at the top of the template via lookup (preserved across upgrades) so all copies share the same value.
The Secret is mounted as FISSION_INTERNAL_AUTH_SECRET (and the optional FISSION_INTERNAL_AUTH_SECRET_OLD) into:
storagesvc, buildermgr, executor, router — via the _helpers.tpl::internalAuth.envs partial.
Phase 2-β / Advisory 4 extend the same partial to kubewatcher, timer, mqt-fission-kafka, and mqt-keda once those services need to sign executor / router-internal calls; that chart change ships in the follow-up PRs, not in Phase 1.pkg/fetcher/config/config.go::internalAuthEnvVars, sourced from the per-namespace Secret with optional: true so installs with internalAuth.enabled=false still admit the pod.To rotate the master secret:
secret value into internalAuth.oldSecret.internalAuth.secret to a fresh 32-byte value.helm upgrade rolls every Fission deployment.
During the rollout the verifier accepts both keys (each per-service OldSecret is derived from the master oldSecret); signers use the new derived key as soon as the pod restarts.helm upgrade with internalAuth.oldSecret="" to remove the old key.Because all per-service keys are derived from the master, rotation is atomic across every signed channel.
Three layers of opt-out, in increasing priority:
ServiceVerifier produces an empty derived key; the underlying Verifier middleware short-circuits to pass-through.
That makes it safe to deploy the Go side first; nothing breaks until the env var is set.FISSION_INTERNAL_AUTH_SECRET defaults to empty; verifiers start up unguarded if unset.
Signers likewise check the env and only sign when a master is present.internalAuth.enabled=false skips the Secret resource and the env mounts entirely; the cluster behaves exactly as it does on main today.For new installs internalAuth.enabled=true is the default.
For in-place upgrades from the previous minor we recommend leaving the default — the chart preserves the existing master on subsequent upgrades and signed clients/verifier are introduced atomically by a single helm upgrade.
Considered. Stronger guarantees (transport binding, identity attestation) but requires cert-manager as a hard dependency, which we don't currently mandate. We can add mTLS later as an additional layer; HMAC is not exclusive of it.
Considered and rejected. A static bearer is replayable forever; rotating it produces the same complexity as HMAC without the body binding.
Considered and rejected. Signing JWTs needs a JWKS endpoint, key rotation tooling, and clock-sync that we'd then have to operate. HMAC-SHA256 over a canonical string is a much smaller surface and is the path Kubernetes ServiceAccount tokens use internally.
Already in place via #3365. We keep it; HMAC is layered on top. On its own, NetworkPolicy fails open the moment a cluster operator overrides the namespace selector or a CNI is installed in permissive mode.
Considered and rejected. Operationally equivalent to deriving (one chart Secret, one rotation), but a runtime memory leak on any one service exposes the key for every internal channel. HKDF derivation costs ~µs on first use and gives per-channel compromise isolation for free.
Considered and rejected. Per-channel isolation is identical to the derived-key approach but the operator carries 5 Secrets, 5 rotation events, and 5 chart values. The complexity is not justified when HKDF gives the same isolation from one master.
The scheme has known limitations operators should plan around.
Maximum body size.
The verifier reads the entire request body into memory before computing the signature so the body bytes can be re-injected for downstream handlers (multipart parsers, etc.).
That cost is bounded by VerifierOpts.MaxBodyBytes (default 256 MiB, set on each registration).
Bodies that exceed the cap are rejected with 413 Request Entity Too Large before signature verification — i.e. an unauthenticated attacker cannot use a giant unsigned body to DoS a signed service.
Operators that legitimately need to upload archives larger than 256 MiB should bump the cap rather than disable enforcement; the cap is the largest archive size we expect to see in practice.
Replay within the skew window.
A captured signed request can be replayed any number of times within the SkewSec window (default 60s) and will pass verification each time.
Adding nonce tracking would require a shared replay-cache store across all replicas (Redis, a distributed cache, or a Lease-CR scheme).
That is out of scope for this PR; the 60-second window is short enough that the practical attack surface is limited to a recently-captured packet on the cluster network — at which point an attacker capable of sniffing in-cluster traffic has bigger problems.
A future change may add a nonce store if a use case justifies the operational cost.
fission-cli archive subcommands.
The CLI's archive subcommands (fission archive list, fission archive get, etc.) talk to storagesvc directly via a port-forward.
They sign using HMACSecretFromCluster which reads Secret/fission-internal-auth from the install namespace via the user's kubeconfig — works as long as the operator has read access to that Secret.
Standard CLI flows that go through fission package and fission function are unaffected — those commands talk to the Kubernetes API server, not storagesvc directly.
MQT-KEDA connectors don't sign router-internal requests.
Upstream KEDA connector images (fission/kafka-http-connector etc.) call into /fission-function/<ns>/<name> on the router internal listener (advisory 4) and do not currently sign.
Until those images are upgraded, operators enabling the router-internal verifier should either rely on NetworkPolicy alone for KEDA traffic or build signing-aware connector images.
pkg/auth/hmac/*_test.go cover canonical-string formatting, sign/verify round-trip, OldSecret fallback, skew tolerance, healthz bypass, body re-readability, body-cap enforcement, tampered-body / tampered-method / tampered-path rejection, the rejection-log emission contract, and the per-service derivation (TestDeriveServiceKey*, TestServiceSignerVerifier*).pkg/storagesvc/storagesvc_auth_test.go) exercises the storagesvc middleware chain end-to-end.helm template charts/fission-all -n fission --set additionalFissionNamespaces='{ns-a,ns-b}' renders one fission-internal-auth Secret per namespace, all carrying the same data.secret.fission fn run hello) survives an in-place upgrade with internalAuth.enabled=true.pkg/auth/hmac is introduced with primitives + per-service derivation.
Storagesvc registers ServiceVerifier(..., ServiceStoragesvc, ...); in-pod fetcher / buildermgr / fission-cli sign with ServiceSigner(..., ServiceStoragesvc, ...).
Chart materializes the master Secret in every Fission-using namespace.
internalAuth.enabled defaults to true.
Backwards compat carried by the empty-secret short-circuit.cmd/fetcher/app/server.go registers ServiceVerifier(..., ServiceFetcher, ...).
cmd/builder HTTP server registers ServiceVerifier(..., ServiceBuilder, ...).
pkg/fetcher/client (used by buildermgr → fetcher and executor → fetcher) wraps its transport with ServiceSigner(..., ServiceFetcher, ...).
buildermgr → builder client wraps with ServiceSigner(..., ServiceBuilder, ...).pkg/executor HTTP server registers ServiceVerifier(..., ServiceExecutor, ...).
Each caller (router, kubewatcher, timer, mqt-fission-kafka, canaryconfig) wraps its executor-client transport with ServiceSigner(..., ServiceExecutor, ...).ServiceRouterInternal was reserved up front so advisory 4 can drop into the same library without touching the chart.WWW-Authenticate header on 401?
Current implementation returns a bare 401 with no body to minimize information leakage.fission_internal_auth_failures_total{reason="...", service="..."}?
Useful for detecting mis-rotation; left out of Phase 1 to keep the change minimal.