design/bgp-bfd.md
The purpose of this enhancement is to use enable BFD to be used in conjunction with BGP.
BFD will be available only in the upcoming FRR based implementation.
Prior to adding BFD support, you can configure the hold-time parameter to speed up failure detection, but the minimum or fastest value is 3 seconds.
BFD provides a quicker path failure detection than BGP, and using them together will allow users to provide a better service.
BFD for BGP is supported by FRR out of the box.
As a cluster administrator, I want to declare a BGP session to be backed up by BFD.
As a cluster administrator, I want to be able to set the BFD parameters related to a BGP session.
The idea is to leverage FRR to enable BFD on a BGP session.
When declaring a BFD peer, all it takes to enable BFD is to add a bfd property on the neighbour:
neighbor <A.B.C.D|X:X::X:X|WORD> bfd profile BFDPROF
The proposal here is to define a bfd profile section in the config structure that looks like:
bfd-profiles:
- name: bfdprofile1
receive-interval: 150
transmit-interval: 150
detect-multiplier: 10
echo-receive-interval: 20
echo-transmit-interval: 20
echo-mode: true
passive-mode: true
minimum-ttl: 5
When a property of the profile is not set, MetalLB will honor FRR default values.
When setting a BGP peer, an optional bfd-profile property will enable BFD:
peers:
- peer-address: 10.0.0.1
peer-asn: 64501
my-asn: 64500
bfd-profile: bfdprofile1
A configuration of BFD sessions while running in legacy mode will result in a rejection of the configuration file with an error.
Keeping the operator as the guinea pig for the CRD implementation, a new BFDProfile CRD will be introduced, with the form of:
apiVersion: metallb.io/v1alpha1
kind: BFDProfile
metadata:
name: profile
namespace: metallb-system
spec:
receive-interval: 150
transmit-interval: 150
detect-multiplier: 10
echo-receive-interval: 20
echo-transmit-interval: 20
echo-mode: true
passive-mode: true
minimum-ttl: 5
Similarly, the BGPPeer CRD that is getting introduced will be configured with a new optional bfdProfile field.
Ideally, if/when the CRDs are moved back to MetalLB, it will be possible to enrich the BGPPeer CRD status with information containing the status of the BFD session.
Metrics describing the status (and the health) of the bfd session between two peers will be produced. FRR provides indication of the status of a given session in the form of
frr# show bfd peers
BFD Peers:
peer 192.168.0.1
ID: 1
Remote ID: 1
Status: up
Uptime: 1 minute(s), 51 second(s)
Diagnostics: ok
Remote diagnostics: ok
Peer Type: dynamic
Local timers:
Detect-multiplier: 3
Receive interval: 300ms
Transmission interval: 300ms
Echo receive interval: 50ms
Echo transmission interval: disabled
Remote timers:
Detect-multiplier: 3
Receive interval: 300ms
Transmission interval: 300ms
Echo receive interval: 50ms
but also provide indicators on the health of a given session:
frr# show bfd peer 192.168.0.1 counters
peer 192.168.0.1
Control packet input: 126 packets
Control packet output: 247 packets
Echo packet input: 2409 packets
Echo packet output: 2410 packets
Session up events: 1
Session down events: 0
Zebra notifications: 4
e2e tests: E2E tests will be expanded to cover bfd, using external container(s) running FRR. The test will need to cover the cases where a node is dropped, verifying that the broken route is detected by BFD. unit tests: Unit tests will be added to any additional code in the MetalLB repository.
The alternative is not to implement the feature and rely on separate instance of FRR in order to cover BFD. However, the integration is straightforward and would be a nice addition on top of BGP.
The only constraint for this enhancement is the dependency on the FRR integration.
1 - FRR integration is complete, and available for use. 2 - The BFD feature is added together with e2e tests. 3 - The BFD feature is added to the Documentation and to the Operator as specific CRDs.