Documentation/network/concepts/routing.rst
.. only:: not (epub or latex or html)
WARNING: You are looking at unreleased Cilium documentation.
Please use the official rendered version released here:
https://docs.cilium.io
.. _routing:
####### Routing #######
.. _arch_overlay: .. _encapsulation:
When no configuration is provided, Cilium automatically runs in this mode as it is the mode with the fewest requirements on the underlying networking infrastructure.
In this mode, all cluster nodes form a mesh of tunnels using the UDP-based
encapsulation protocols :term:VXLAN or :term:Geneve. All traffic between Cilium nodes
is encapsulated.
Encapsulation relies on normal node to node connectivity. This means that if Cilium nodes can already reach each other, all routing requirements are already met.
The underlying network and firewalls must allow encapsulated packets:
================== ===================== Encapsulation Mode Port Range / Protocol ================== ===================== VXLAN (Default) 8472/UDP Geneve 6081/UDP ================== =====================
Simplicity The network which connects the cluster nodes does not need to be made aware of the PodCIDRs. Cluster nodes can spawn multiple routing or link-layer domains. The topology of the underlying network is irrelevant as long as cluster nodes can reach each other using IP/UDP.
Addressing space Due to not depending on any underlying networking limitations, the available addressing space is potentially much larger and allows to run any number of pods per node if the PodCIDR size is configured accordingly.
Auto-configuration When running together with an orchestration system such as Kubernetes, the list of all nodes in the cluster including their associated allocation prefix node is made available to each agent automatically. New nodes joining the cluster will automatically be incorporated into the mesh.
Identity context Encapsulation protocols allow for the carrying of metadata along with the network packet. Cilium makes use of this ability to transfer metadata such as the source security identity. The identity transfer is an optimization designed to avoid one identity lookup on the remote node.
MTU Overhead Due to adding encapsulation headers, the effective MTU available for payload is lower than with native-routing (50 bytes per network packet for VXLAN). This results in a lower maximum throughput rate for a particular network connection. This can be largely mitigated by enabling jumbo frames (50 bytes of overhead for each 1500 bytes vs 50 bytes of overhead for each 9000 bytes).
The following options can be used to configure encapsulation:
tunnel-protocol: Set the encapsulation protocol to vxlan or
geneve, defaults to vxlan.underlay-protocol: Set the IP family for the underlay. Defaults to
ipv4. The underlying network must support that protocol.tunnel-port: Set the port for the encapsulation protocol. Defaults
to 8472 for vxlan and 6081 for geneve... _arch_direct_routing: .. _native_routing:
The native routing datapath is enabled with routing-mode: native and enables
the native packet forwarding mode. The native packet forwarding mode leverages
the routing capabilities of the network Cilium runs on instead of performing
encapsulation.
.. image:: native_routing.png :align: center
In native routing mode, Cilium will delegate all packets which are not addressed to another local endpoint to the routing subsystem of the Linux kernel. This means that the packet will be routed as if a local process would have emitted the packet. As a result, the network connecting the cluster nodes must be capable of routing PodCIDRs.
Cilium automatically enables IP forwarding in the Linux kernel when native routing is configured.
In order to run the native routing mode, the network connecting the hosts on which Cilium is running on must be capable of forwarding IP traffic using addresses given to pods or other workloads.
The Linux kernel on the node must be aware on how to forward packets of pods or other workloads of all nodes running Cilium. This can be achieved in two ways:
The node itself does not know how to route all pod IPs but a router exists
on the network that knows how to reach all other pods. In this scenario,
the Linux node is configured to contain a default route to point to such a
router. This model is used for cloud provider network integration. See
:ref:gke_datapath, :ref:aws_eni_datapath, and :ref:ipam_azure for
more details.
Each individual node is made aware of all pod IPs of all other nodes and
routes are inserted into the Linux kernel routing table to represent this.
If all nodes share a single L2 network, then this can be taken care of by
enabling the option auto-direct-node-routes: true. Otherwise, an
additional system component such as a BGP daemon must be run to distribute
the routes. See the guide :ref:kube-router on how to achieve this using
the kube-router project.
The following configuration options must be set to run the datapath in native routing mode:
routing-mode: native: Enable native routing mode.ipv4-native-routing-cidr: x.x.x.x/y: Set the CIDR in which native routing
can be performed.The following configuration options are optional when running the datapath in native routing mode:
direct-routing-skip-unreachable: If a BGP daemon is running and there
is multiple native subnets to the cluster network,
direct-routing-skip-unreachable: true can be added alongside
auto-direct-node-routes to give each node L2 connectivity in each zone
without traffic always needing to be routed by the BGP routers... _aws_eni_datapath:
The AWS ENI datapath is enabled when Cilium is run with the option
--ipam=eni. It is a special purpose datapath that is useful when running
Cilium in an AWS environment.
Pods are assigned ENI IPs which are directly routable in the AWS VPC. This simplifies communication of pod traffic within VPCs and avoids the need for SNAT.
Pod IPs are assigned a security group. The security groups for pods are
configured per node which allows to create node pools and give different
security group assignments to different pods. See section :ref:ipam_eni for
more details.
The number of ENI IPs is limited per instance. The limit depends on the EC2 instance type. This can become a problem when attempting to run a larger number of pods on very small instance types.
Allocation of ENIs and ENI IPs requires interaction with the EC2 API which is
subject to rate limiting. This is primarily mitigated via the operator
design, see section :ref:ipam_eni for more details.
Ingress
1. Traffic is received on one of the ENIs attached to the instance which is
represented on the node as interface ``ethN``.
2. An IP routing rule ensures that traffic to all local pod IPs is done using
the main routing table::
20: from all to 192.168.105.44 lookup main
3. The main routing table contains an exact match route to steer traffic into a
veth pair which is hooked into the pod::
192.168.105.44 dev lxc5a4def8d96c5
4. All traffic passing ``lxc5a4def8d96c5`` on the way into the pod is subject
to Cilium's eBPF program to enforce network policies, provide service reverse
load-balancing, and visibility.
Egress
~~~~~~
1. The pod's network namespace contains a default route which points to the
node's router IP via the veth pair which is named ``eth0`` inside of the pod
and ``lxcXXXXXX`` in the host namespace. The router IP is allocated from the
ENI space, allowing for sending of ICMP errors from the router IP for Path
MTU purposes.
2. After passing through the veth pair and before reaching the Linux routing
layer, all traffic is subject to Cilium's eBPF program to enforce network
policies, implement load-balancing and provide networking features.
3. An IP routing rule ensures that traffic from individual endpoints are using
a routing table specific to the ENI from which the endpoint IP was
allocated::
30: from 192.168.105.44 to 192.168.0.0/16 lookup 92
4. The ENI specific routing table contains a default route which redirects
to the router of the VPC via the ENI interface::
default via 192.168.0.1 dev eth2
192.168.0.1 dev eth2
Configuration
-------------
The AWS ENI datapath is enabled by setting the following option:
.. code-block: yaml
ipam: eni
enable-endpoint-routes: "true"
auto-create-cilium-node-resource: "true"
egress-masquerade-interfaces: eth+
* ``ipam: eni`` Enables the ENI specific IPAM backend and indicates to the
datapath that ENI IPs will be used.
* ``enable-endpoint-routes: "true"`` enables direct routing to the ENI
veth pairs without requiring to route via the ``cilium_host`` interface.
* ``auto-create-cilium-node-resource: "true"`` enables the automatic creation of
the ``CiliumNode`` custom resource with all required ENI parameters. It is
possible to disable this and provide the custom resource manually.
* ``egress-masquerade-interfaces: eth+`` is the interface selector of all
interfaces which are subject to masquerading. Masquerading can be disabled
entirely with ``enable-ipv4-masquerade: "false"``.
See the section :ref:`ipam_eni` for details on how to configure ENI IPAM
specific parameters.
.. _gke_datapath:
Google Cloud
============
When running Cilium on Google Cloud via either Google Kubernetes Engine (GKE)
or self-managed, it is possible to utilize the `Google Cloud's networking layer
<https://cloud.google.com/products/networking>`_ with Cilium running in a
:ref:`native_routing` configuration. This provides native networking
performance while benefiting from many additional Cilium features such as
policy enforcement, load-balancing with DSR, efficient
NodePort/ExternalIP/HostPort implementation, extensive visibility features, and
so on.
.. image:: gke_datapath.png
:align: center
Addressing
Cilium will assign IPs to pods out of the PodCIDR assigned to the specific
Kubernetes node. By using `Alias IP ranges
<https://cloud.google.com/vpc/docs/alias-ip>`_, these IPs are natively
routable on Google Cloud's network without additional encapsulation or route
distribution.
Masquerading
All traffic not staying with the ``ipv4-native-routing-cidr`` (defaults to
the Cluster CIDR) will be masqueraded to the node's IP address to become
publicly routable.
Load-balancing
ClusterIP load-balancing will be performed using eBPF for all version of GKE.
Policy enforcement & visibility
All NetworkPolicy enforcement and visibility is provided using eBPF.
Configuration
-------------
The following configuration options must be set to run the datapath on GKE:
* ``gke.enabled: true``: Enables the Google Kubernetes Engine (GKE) datapath.
Setting this to ``true`` will enable the following options:
* ``ipam: kubernetes``: Enable :ref:`k8s_hostscope` IPAM
* ``routing-mode: native``: Enable native routing mode
* ``enable-endpoint-routes: true``: Enable per-endpoint routing on the node
(automatically disables the local node route).
* ``ipv4-native-routing-cidr: x.x.x.x/y``: Set the CIDR in which native routing
is supported.
See the getting started guide :ref:`k8s_install_quick` to install Cilium on
Google Kubernetes Engine (GKE).