design/gracefulrestart-bgp.md
This is not a design document but extra documenation for therefore Graceful Restart feature.
BGP Graceful Restart (GR) functionality (RFC-4724) defines the mechanism that allows the BGP routers to continue to forward data packets along known routes while the routing protocol information is being restored. GR can be applied when the control plane is independent from the forwarding plane and therefore a restart of the control plane can happen without affecting forwarding. This is the case for a most Kubernetes clusters where the control plane is a host-networked process (FRR) and the forwarding plane is the primary network CNI (Calico, Cilium, OVNK etc). This feature is implemented in MetalLB to minimize network disruptions during planned pod restarts that take place due to upgrades.
GR can be applied per BGP neighbor by setting the field enableGracefulRestart
to true.
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: example
namespace: metallb-system
spec:
myASN: 64512
peerASN: 64512
peerAddress: 172.30.0.3
enableGracefulRestart: true
GR is a capability that can only be applied in the OPEN message during BGP
handshake. This is defined the BGP protocol and cannot be changed. MetalLB does
not have a user facing mechanism to reset BGP session neither is resetting
internally the peering. Therefore the option was either allow the configuration
to pass through and warn the user to reset BGP peering externally (e.g. by executing a
BGP command the external router) or to make it immutable and therefore
user must delete/create peers. The latter option was preferred.
{{% notice info %}} BGP GR requires both ends to be well configured, therefore is recommended to verify in the external peer that BGP has GR enabled
$show bgp neighbor <peer>
...
Graceful Restart Capability: advertised and received
Remote Restart timer is 120 seconds
Address families by peer:
IPv4 Unicast(preserved)
Graceful restart information:
End-of-RIB send: IPv4 Unicast
End-of-RIB received: IPv4 Unicast
Local GR Mode: Helper*
Remote GR Mode: Restart
...
{{% /notice %}}
GR has a number of internal parameters, the traditional defaults of FRR are kept
with one exception.
bgp graceful-restart restart-time is 120 seconds (by FRR).no bgp graceful-restart notifications no GR support for BGP NOTIFICATION messages (by FRR)no bgp hard-administrative-reset (by FRR).bgp long-lived-graceful-restart stale-time 0 BGP long-lived graceful restart (LLGR/RFC-9494) is disabled (by FRR).bgp graceful-restart preserve-fw-state Forwarding State (F) bit is set (by MetalLB).In context of Kubernetes,
when the pod that runs the FRR process stops, it means that the API asks
kubelet to stop a pod and kubelet does that by sending a SIGTERM to the pid 0
of the pod. When FRR process receives the signal, it closes the TCP connection,
and that event triggers GR timer to the external peer. That is the desired
behavior when our Daemonset pod stops for an update, but it might be less
desirable if user reduces the set of node the pod can run or user removes
MetalLB instance when using operator. Nevertheless impact should be low because
dataplane continues to work. No other case has been identified where GR is triggered.
{{% notice info %}} When no TCP packet is lost, the external peer should start the GR timer, nevertheless that needs to validated/tested in the specific vendor device. Example logs from FRR
$ cat frr.log
...
2024/11/18 10:01:50.140 BGP: [NJ2F2-2W769] 172.18.0.2 [Event] BGP connection closed fd 23
2024/11/18 10:01:50.140 BGP: [NTX3S-9Q8YV] 172.18.0.2 [Event] BGP error 5 on fd 23
2024/11/18 10:01:50.140 BGP: [ZWCSR-M7FG9] 172.18.0.2 [FSM] TCP_connection_closed (Established->Clearing), fd 23
2024/11/18 10:01:50.140 BGP: [RPZW2-39GTY] 172.18.0.2(frr-k8s-control-plane) graceful restart timer started for 120 sec
2024/11/18 10:01:50.140 BGP: [TK2B6-ZF4MR] 172.18.0.2(frr-k8s-control-plane) graceful restart stalepath timer started for 360 sec
...
{{% /notice %}}
When GR is enabled and a cluster admin drains a node, the BGP peering towards that
node will remain because Daemonsets pods continue to run on Unschedulable nodes
(--ignore-daemonsets). If there is service of traffic policy "Cluster", then that
node might continue to forward traffic. If a node reboot/shutdown follows, then
traffic will be blackholed until the GR timer (2min) will be expired in the peer
and remove the route towards the stopped node.
info
* If service is of traffic policy "Local", then the routes will be removed because any pod that is endpoint to the service will be removed from the node.
According to the RFC-5881/BFD Shares Fate with the Control Plane,
If BFD shares fate with the control plane on either system (the "C" bit is clear in either direction), a BFD session failure cannot be disentangled from other events taking place in the control plane. In many cases, the BFD session will fail as a side effect of the restart taking place. As such, it would be best to avoid aborting any Graceful Restart taking place, if possible (since otherwise BFD and Graceful Restart cannot coexist).
and therefore GR and BFD can work together, and the helper router should ignore the BFD messages during GR timer (during the green box bellow).
{{<mermaid align="center">}} sequenceDiagram participant kubelet participant A as K8S FRR participant B as External Peer A-->>B: Sends Graceful Restart Capability in BGP OPEN message
kubelet->>A: SIGTERM A->>+B: TCP Close rect rgb(192,255,193) Note right of B: graceful restart timer started for 120 sec Note right of B: BFD events are ignored B-->>B: neigh went from Established to Clearing Note right of B: Routes are stale A->>A: Restarting A->>+B: BGP Peering B-->>B: A went from OpenConfirm to Established end Note over A,B: BGP Established Note right of B: graceful restart timer stops Note right of B: BFD events are NOT ignored A->>+B: Update/End-Of Rib Note right of B: Routes are NOT stale {{< /mermaid >}}
{{% notice warning %}} Whether BFD and GR can be used together is implementation specific. It is up to vendor's recommendation and needs to be tested. For example Juniper suggests not to combine them link.
One consideration to be taken into account is that doing GR/BFD between routers that are placed in the middle of a large BGP network is different than doing GR/BFD between server and ToR/DCGW routers. {{% /notice %}}
There are two known issues which have been fixed upstream but not being used yet by MetalLB.