architecture/ambient/ztunnel.md
This document provides an overview of the architecture and design decisions around Ztunnel, the node-proxy component in ambient mode.
Motivations to implement ztunnel generally came from two areas.
First, and most importantly, it serves as a means to implement the real goal: waypoints. For various reasons outside the scope of this document, there is a desire to move from a sidecar based architecture to a "remote proxy" architecture. However, this has one glaring issue: how do we get the traffic to the remote proxies, while maintaining the zero-trust properties that Istio is built upon?
A secondary goal was to enable a smoother on-ramp from "Zero" to "Getting some value". Historically, Istio had to really be consumed all-or-nothing for things to work as expected. In particular, an easy answer to "I just want to have mTLS everywhere, then I can think about adopting the rest of service mesh" was desired.
Other documents include:
Ztunnel should:
Ztunnel was not designed to be a feature-rich data plane. Quite the opposite - an aggressively small feature set is the key feature that makes ztunnel viable. It very intentionally does not offer L7 (HTTP) functionality, for instance, which would likely violate some of the goals above, without contributing to them. Instead, the rich functionality that service mesh is traditionally associated with is deferred to the waypoints. The ztunnel is primarily a mechanism to get traffic to the waypoints, securely.
In its initial implementations, the ztunnel was actually implemented in 3 different ways: a bespoke Rust implementation, a bespoke Go implementation, and in Envoy.
In the end, after evaluation, the decision was to move forward with a Rust implementation. This offered performance benefits that were too large to leave on the table, as well as opportunities to tune to our specific needs.
Ztunnel, of course, needs to be dynamically configured in order to make decisions on how it should handle traffic. For this purpose, we chose to use xDS transport protocol due to our expertise and existing infrastructure, and because the protocol is well suited to our needs.
However, while we chose to use the xDS transport protocol, we chose to not use the xDS resource types, such as Clusters and Listeners. In our experience and testing, these types force us to represent data in inefficient ways because they are general purpose. Ztunnel is not general purpose; it has an extremely tight goal. We can exploit this to make a more efficient protocol, which is critical to achieve our resource footprint goals.
For example, configuring Istio mTLS in Envoy takes roughly 50 lines of JSON (it is in Protobuf, of course, but still relevant).
Because Ztunnel can have Istio semantics baked in, we do not need to encode all this information on the wire.
Instead, an Istio specific field like ExpectedTLSIdentity: spiffe://foo.bar can encode the same information, at a fraction of the cost.
In our testing, even the most generous representations give custom types a 10x edge (in size, allocations, and CPU time) over Envoy types.
In addition, they are more clear and strictly typed; using Envoy types would require us to put a lot of information in untyped metadata maps.
With this in mind, Ztunnel supports two xDS resources: Address and Authorization.
The primary configuration consumed by Ztunnel is the Address resource.
As the name suggests, an Address represents a particular IP Address.
This can be a Service or a Workload.
The address type has the following goals:
See the XDS Evolution document for more history and details.
The Workload aims to represent everything about a workload (generally a Pod or WorkloadEntry).
This includes things like its IP address, identity, metadata (name, namespace, app, version, etc), and whether it has a waypoint proxy associated.
The Service aims to represent everything about a service (generally a Service or ServiceEntry).
This includes things like its IP addresses, ports and an associated waypoint proxy if it has one.
A secondary configuration consumed by Ztunnel is the Authorization resource.
Original Design Doc.
This resource aims to represent the relatively small set of Authorization policies that Ztunnel support. Most notably, this is only L4 resources.
Most of the API is fairly straight forward. However, one interesting aspect is how these policies associate with workloads. Istio's AuthorizationPolicy has label selectors. However, we intentionally do not send those as part of the Workload API, in order to keep the size low.
The obvious solution to this is to put the list of selected workloads into the policy itself. However, this means anytime a workload changes (often), we need to update the policy.
Instead, the opposite was chosen: each workload will list the policies that select it. This works out to be more efficient in common cases where policies change much less often than workloads. This only applies for selector-based policies; namespaced and global policies can be handled without needing to list them out in the Workload API.
As ztunnel aims to transparently encrypt and route users traffic, we need a mechanism to capture all traffic entering and leaving "mesh" pods. This is a security critical task: if the ztunnel can be bypassed, authorization policies can be bypassed.
Redirection must meet these requirements:
TODO: fill in implementation details of how redirection is actually implemented.
Along with pass-through traffic, Ztunnel supports the "HBONE" (HTTP-Based Overlay Network) protocol. This is not really so much a new protocol, but rather a name we came up with to refer to the expectations of clients and servers communicating in the mesh.
HBONE is just a standard HTTP CONNECT tunnel, over mutual TLS with mesh (SPIFFE) certificates, on a well known port (15008).
The target destination address is set in the :authority header, and additional headers can be included as well.
Currently, only HTTP/2 is supported, though HTTP/1.1 and HTTP/3 are planned.
Currently, SNI is not set by Istio clients and ignored by Istio servers.
This makes identifying which certificate to use problematic for Ztunnel.
To handle this, requests to Ztunnel are sent to DestinationPod:15008 and redirected to ztunnel, rather than ZtunnelPod:15008.
The original destination is then extracted to determined which certificate to use.
SNI is not used because it is illegal to use IPs in SNI, and there is no other existing standard format to represent what we need to.
Additionally, using the redirection mechanism reduces the need for clients to know the destination's ztunnel address.
Below shows an example outbound request. The "target" path is what the client sends, while the "actual" path is the real network flow after redirection.
graph LR
subgraph Client Node
Client
CZ["Ztunnel"]
end
subgraph Server Node
Server
SZ["Ztunnel"]
end
Client--Plain-->CZ
CZ-."HBONE (target)".->Server
CZ--"HBONE (actual)"-->SZ
SZ--Plain-->Server
User connections can be multiplexed over shared HBONE connections.
This is done through standard HTTP/2 pooling.
The pooling is keyed off the {source identity, destination identity, destination ip}.
Ztunnel uses the following well-known headers in HBONE:
| Header | Purpose |
|---|---|
:authority | Required in CONNECT, this is the target destination |
Forwarded | For outgoing requests, the original source IP. Note that since we spoof IPs in most cases, this usually is the same as the actual IP seen. For incoming requests, this is used only for traffic from waypoints (which are trusted and cannot spoof IPs) |
Baggage | (Experimental, likely to be removed) This contains metadata about the source/destination workload for telemetry purposes |
Traceparent | (Experimental) This maintains tracing information. Note this is tracing of connections, and is not correlated to tracing of user's own HTTP requests. However, this is useful to follow a connection across ztunnels. |
Based on the three redirection paths, the ztunnel has three primary types of traffic it handles.
Requests leaving a pod go through the "outbound" code path on port 15001. This is where most of Ztunnel's logic lives.
For outbound traffic, we need to first determine where the traffic is destined to.
As Ztunnel operates at L4, we only have the destination IP/port (recovered via SO_ORIGINAL_DST).
This may be an IP of a Service, a Pod, or something outside the cluster.
Ztunnel will look up the destination from the addresses it is configured with.
For traffic to unknown addresses, or to workloads that are not a part of the mesh, the traffic will just be passed through as is.
To make ztunnel more transparent, the original source IP address will be spoofed.
Additionally, splice will be used to make this proxying more efficient when possible.
For traffic in the mesh, things are a bit more complex:
In all cases, we spoof the original source IP.
Traffic entering a pod that is not transmitted over HBONE (i.e. with a destination port != 15008) is handled by the "inbound passthrough" code path, on ztunnel's port 15006.
This is fairly straightforward.
First, we need to check that this traffic is allowed.
Traffic may be denied by RBAC policies (especially from a STRICT mode enforcement, which denies plaintext traffic).
If it is allowed, we will forward to the target destination.
In the case that the destination has a waypoint, that waypoint must have been bypassed to reach the inbound passthrough codepath. How we handle this is under discussion.
Traffic entering a pod over HBONE will be handled by the "inbound" code path, on port 15008.
Incoming requests have multiple "layers": TLS wrapping HTTP CONNECT that is wrapping the user's connection.
To unwrap the first layer, we terminate TLS. As part of this, we need to pick the correct certificate to serve on behalf of the destination workload. As discussed in HBONE, this is based on the destination IP. Additionally, we enforce the peer has a valid mesh identity (but do not assert which identity, yet).
Next, we terminate the CONNECT. From the headers, we know the target destination. If the target destination has a waypoint, we enforce that the request is coming from that waypoint. Otherwise, the request is rejected. If there is no waypoint, ztunnel will enforce RBAC policies against the request.
If all checks pass, ztunnel will open a connection to the target. This will spoof the source IP (from Forwarded for waypoints, or the incoming IP otherwise).
Once the connection is established we return a 200 HTTP code, and bi-directionally copy data to/from the tunnel to the destination.
Ztunnel certificates are based on the standard Istio SPIFFE format: spiffe://<trust domain>/ns/<ns>/sa/<sa>.
However, the identities of the certificates will be of the actual user workloads, not Ztunnel's own identity. This means Ztunnel will have multiple distinct certificates at a time, one for each unique identity (service account) running on its node.
When fetching certificates, ztunnel will authenticate to the CA with its own identity, but request the identity of another workload. Critically, the CA must enforce that the ztunnel has permission to request that identity. Requests for identities not running on the node are rejected. This is critical to ensure that a compromised node does not compromise the entire mesh.
This CA enforcement is done by Istio's CA, and is a requirement for any alternative CAs integrating with Ztunnel.
Note: Ztunnel authenticates to the CA with a Kubernetes Service Account JWT token, which encodes the pod information, which is what enables this.
Ztunnel will request certificates for all identities on the node. It determines this based on the Workload xDS configuration it receives. When a new identity is discovered on the node, it will be enqueued for fetching at a low priority, as an optimization. However, if a request needs a certain identity that we have not fetched yet, it will be immediately requested.
Ztunnel additionally will handle the rotation of these certificates (typically 24hr expiration) as they approach expiry.
Ztunnel emits the full set of Istio Standard Metrics, for the 4 TCP metrics.