design/0001-frr.md
The purpose of this enhancement is to use Free Range Routing (FRR) as an alternative BGP implementation in MetalLB. When directed to, MetalLB will publish prefixes via FRR rather than MetalLB’s current built-in BGP implementation.
The motivation for this enhancement is to provide an alternative production-ready BGP implementation for use in MetalLB. Overall, this should reduce the effort for adding additional features to the MetalLB project. For example, there are a number of issues in the current backlog that may be addressed by using FRR. Notably:
FRR is a mature Linux Foundation routing protocol suite based on Quagga that has been used in many production deployments. As such, it has been proven in terms of its maturity, flexibility (as can be seen by the broad range of features it supports), scalability, security, reliability and performance. It also provides detailed logging features to aid debugging.
As a developer, I want to be able to run a comprehensive set of BGP system tests, preferably based on Kubernetes e2e tests, in order to ensure the MetalLB BGP implementation works as expected.
As a cluster administrator, I want to be able to specify a configuration option when deploying MetalLB in order to select the MetalLB BGP implementation to be used by the MetalLB speaker.
As a developer, I want to be able to configure FRR from MetalLB using FRR Go bindings in order to improve performance, reliability and debugging capabilities.
FRR does not provide documented northbound Go bindings to allow configuration of BGP. There is an experimental gRPC interface. This interface may need to be productionized through the FRR community. When this has been satisfactorily achieved, we can start Story 3.
Until that time, FRR can be configured declaratively by modifying its config file and reloading it when the configuration changes.
The current design assumes one BGP implementation. This proposal proposes allowing for the selection of the BGP implementation type via a global configuration option that is passed at initialization of the speaker application. This BGP implementation type would be used for all BGP connections in the cluster. This should allow for the integration of the FRR BGP implementation and potentially other BGP implementation (if required by other organizations). The intention is to modify MetalLB primarily at or below the session interface (some small changes may be required above that interface) to minimize impact of this integration and maximize reuse of common infrastructure.
In the current implementation, new BGP sessions are established by calling newBGP() in the main package which calls the bgp.New() function which in turn returns a session interface. This would be modified to create sessions based on BGP implementation type, essentially creating a session factory that would return the correct session type based on the configured BGP implementation.
┌───────────────┐
│ │
│ config ├──┐
│ <<ConfigMap>> │ │ ┌───────────────┐
│ │ │reconcile │ │
└───────────────┘ │ ┌─┤ speakerlist │
│ │ │ <<package>> │
│ │ │ │
┌───────────────┐ │ ┌───────────────┐ ┌───────────────┐ │ └───────────────┘
│ │ │ │ │ │ │ │
│ services ├──┼─┤ k8s ├──┤ main ├──┤
│ <<Service>> │ │ │ <<package>> │ │ <<package>> │ │
│ │ │ │ │ │ │ │
└───────────────┘ │ └───────────────┘ └───────┬───────┘ │
│ <<use>> │ │ ┌───────────────┐
│ │ │ │ │
┌───────────────┐ │reconcile ┌───────▼───────┐ │ │ config │
│ │ │ │ │ └─┤ <<package>> │
│ nodes │ │ │ Protocol │ │ │
│ <<Node>> ├──┘ │ <<interface>> │ └───────────────┘
│ │ │ │
└───────────────┘ └───────▲───────┘
│<<implements>>
┌──────────┴─────────┐
│ │
┌──────────┴────────┐ ┌────────┴──────┐
│ main:: │ │ main:: │
│ layer2_controller │ │ bgp_controller│
│ <<class>> │ │ <<class>> │
│ │ │ │
└───────────────────┘ └───────┬───────┘
│
│ <<use>>
┌───────▼───────┐
│ │
│ session │
│ <<interface>> │
│ │
└───────▲───────┘
│ <<implement>>
┌───────────┴─────────────┐
│ │
┌───────┴───────┐ ┌───────┴───────┐
│ │ │ │
│ bgp_metallb │ │ bgp_frr │
│ <<package>> │ │ <<package>> │
│ │ │ │
└───────────────┘ └───────┬───────┘
│
│
┌───────┴───────┐
│ │
│ frr │
│ <<container>> │
│ │
└───────────────┘
FRR will be deployed in a container as part of the speaker Pod. This will simplify the deployment for end users as MetalLB will not need to manage another FRR Pod or deal with inter-Pod communication.
Deploying FRR as a separate Pod was also considered and it would give some advantages, such as the removal of the requirement for host networking from the speaker Pod and the separation of the FRR and speaker Pod lifecycles, but would add complexity to the end user.
Initially, control of the FRR container will be achieved by declaratively editing and reloading the FRR configuration file.
This configuration interface will be used by MetalLB to implement bgp.New() and bgp.Close() for session creation and bgp.Set() for prefix advertisement.
There may be some common functionality in the bgp package that could be reused between BGP implementations. For example, the new bgp package will require integration with the “metrics” struct which updates Prometheus in response to BGP events. It may be necessary to refactor these types of functionality into a separate package for reuse amongst BGP implementations.
We will allow independent upgrade of the FRR component in order to resolve any specific FRR bugs without upgrading MetalLB. MetalLB will ship with a default FRR version but it will be possible to configure the version somehow (e.g. through Helm).
Speaker Pods can be restarted for upgrades. However, Layer 2 memberlist code will see a node leave and rejoin the cluster. In BGP mode, these things can also happen (in particular "Connection reset by peer") and can be mitigated.
Upgrade/downgrade will remove a node by unlabelling the node. At this point the ‘speaker’ component on the node can be stopped and restarted with the desired version of FRR. It should be noted that as FRR only peers with peers outside the cluster, there is no requirement that each node within the cluster maintains the same version or a compatible version of FRR. However, it is required that each version of FRR is compatible with the BGP peer with which it is peering. After the ‘speaker’ component is restarted, the node can be labelled again.
e2e tests: The intention is to expand on the MetalLB ‘dev-env’ KIND environment. After deploying this environment, end-to-end tests will be run against this test cluster. Further investigation will be required as part of Story 1 in order to determine the possibility of reusing some of the code from the Kubernetes e2e tests. This work has begun here. unit tests: Unit tests will be added to any additional code in the MetalLB repository. upgrade: We may need to add tests to deal with an upgrade to a newer FRR version.
Version skew between versions of FRR should not be a concern to Kubernetes or MetalLB as long as FRR presents a stable interface to the MetalLB ‘speaker’ component. This is because FRR instances will only peer with BGP peers outside of the cluster and not with each other.
A number of alternative open-source routing stacks were considered (FRR, BIRD, GoBGP) as a first target for integration. They were evaluated across a number of categories. GoBGP was discounted due to its relatively limited feature set. For example, it does not support BFD. FRR and BIRD are well-known and mature stacks which have been deployed in production and have active development communities. FRR was selected as the first target for integration for the following reasons:
dplane
and FPM interfaces. This may ease integration with Kubernetes networking
providers. There seem to be no equivalents in BIRD, which makes BIRD less
extensible. This extensibility would enable FRR to integrate with dataplanes
other than the Linux kernel. Examples: 1) hardware dataplanes 2) an Open
Flow configured dataplane like Open vSwitch 3) DPDK dataplane.It should be noted that, although this enhancement deals with FRR, it will provide a template and a standard interface to ease integration of other implementations (such as BIRD) in the future.
As this enhancement leads to the eventual retirement of the current native BGP implementation in MetalLB, a phased implementation plan consisting of 4 releases is proposed:
The current e2e tests set up a bgp peer to all the nodes inside an FRR container, and try to hit the service.
In order to guarantee parity between the two implementations, the coverage of the e2e tests must be extended.
Every time we test an exposed service, we must verify that:
Every time the configuration is changed before testing the new scenario, we may need to wait an arbitrary time or to run the checks in an eventually loop, since there is no feedback of what configuration is being consumed.
Another possible option is to remove the configuration completely and verify that hitting the service does not work anymore. This should ensure that it's possible to apply a new configuration.
We must cover the following scenarios:
We need to ensure that the BGP parameters are correctly received by the peer.
The following parameters must be covered:
The current set of metrics must be covered by tests, in order to ensure that the new implementation does not regress from that perspective.