design/splitfrr-proposal.md
MetalLB relies on FRR under the hood, which has way more to offer than "just" announcing prefixes as MetalLB does. Here we propose to split MetalLB and create a new, possibly standalone, FRR daemonset with its own API which can be alimented both by MetalLB but also by other actors.
There are users who need running FRR (or an alternative implementation) on the nodes for other purpouses, receiving routes from their routers being the most popular one.
They require receiving routes via BGP for multiple reasons:
MetalLB is meant to announce routes (in particular, to reach Services of type LoadBalancer) only, so it is definetely not the right place to implement a broader FRR configuration.
At the same time, the approach of having a single FRR instance is optimal for both performances and for limiting the number of open sessions (see more on the Alternatives section).
As a cluster administrator, I want to continue using MetalLB with the current allowed API.
As a cluster administrator, I want to allow FRR to receive routes, but only for a specific prefix.
As a cluster administrator, I want to deploy only FRR to receive routes, connecting to different peers depending on the nodes.
The idea is to have a daemonset, with the pod running on each node, whose pods have the same structure the speaker pod has today (frr container, reloader, metrics, etc).
The speaker will be provided with a new frrk8s bgp mode which will translate the MetalLB api to the new controller's api.
Given that the new api will allow setting the configuration per node basis, each speaker will take care of configuring its own node.
Before describing the options, we are going to list the properties we want from the API exposed by the FRR Daemonset:
Additionally, we may want to enable the following scenarios:
In order to provide an abstraction of the FRR Configuration, we need to represent the following entities:
The router:
For each router, we must specify a neighbor with all the details we have in a session:
For each neighbor we must specify the route-map entries, and in particular:
If we split the configuration in multiple sub-entities (router, neighbor for router, allowed ips for any neighbor), the configuration of a single node becomes spread across multiple rows of multiple entities.
Because of this, a single CRD with substructures is going to provide better readability and it's going to be easier to reason about.
The spec side of a FRRConfiguration would look like (note that this might be subject to changes while proceeding with the implementation):
type FRRConfigurationSpec struct {
BGP BGPConfig `json:"bgp,omitempty"`
NodeSelector map[string]string `json:"nodeselector,omitempty"`
RawConfig string `json:"raw,omitempty"`
}
type BGPConfig struct {
Routers []Router `json:"routers"`
BFDProfiles []BFDProfile `json:"bfdProfiles,omitempty"`
}
// Router represent a neighbor router we want FRR to connect to
type Router struct {
ASN uint32 `json:"asn"`
ID string `json:"id,omitempty"`
VRF string `json:"vrf,omitempty"`
Neighbors []Neighbor `json:"neighbors,omitempty"`
PrefixesIPV4 []string `json:"prefixesIpV4,omitempty"`
PrefixesIPV6 []string `json:"prefixesIpV6,omitempty"`
}
type Neighbor struct {
Address string `json:"address"`
Port uint16 `json:"port,omitempty"`
PasswordSecret v1.SecretReference `json:"password,omitempty"`
HoldTime metav1.Duration `json:"holdTime,omitempty"`
KeepaliveTime metav1.Duration `json:"keepaliveTime,omitempty"`
EBGPMultiHop bool `json:"ebgpMultiHop,omitempty"`
BFDProfile string `json:"bfdProfile,omitempty"`
ToAdvertise Advertise `json:"toAdvertise,omitempty"`
ToReceive Receive `json:"toReceive,omitempty"`
}
type Advertise struct {
AllowedPrefixes `json:"allowed,omitempty"`
PrefixesWithLocalPref []LocalPrefPrefixes `json:"withLocalPref,omitempty"`
PrefixesWithCommunity []CommunityPrefixes `json:"withCommunity,omitempty"`
}
type Receive struct {
AllowedPrefixes `json:"allowed,omitempty"`
}
type AllowMode string
const (
AllowAll AllowMode = "all"
AllowRestricted AllowMode = "filtered"
)
type AllowedPrefixes struct {
Prefixes []string `json:"prefixes,omitempty"`
Mode string `json:"mode,omitempty"` // default is filtered. When "all" is specified all the prefixes
} // configured for the given router are advertised, regardless of the content of
// prefixes
type LocalPrefPrefixes struct {
Prefixes []string `json:"prefixes,omitempty"`
LocalPref int `json:"localPref,omitempty"`
}
type CommunityPrefixes struct {
Prefixes []string `json:"prefixes,omitempty"`
Community string `json:"community,omitempty"`
}
type BFDProfile struct {
Name string `json:"name"`
ReceiveInterval *uint32 `json:"receiveInterval,omitempty"`
TransmitInterval *uint32 `json:"transmitInterval,omitempty"`
DetectMultiplier *uint32 `json:"detectMultiplier,omitempty"`
EchoInterval *uint32 `json:"echoInterval,omitempty"`
EchoMode *bool `json:"echoMode,omitempty"`
PassiveMode *bool `json:"passiveMode,omitempty"`
MinimumTTL *uint32 `json:"minimumTtl,omitempty"`
}
This doesn't declare what fields are optional and what are not, but is a good approximation of the API.
Note: the allowAll is broader and takes precedence on any existing set of prefixes.
Multiple actors may add configurations selecting the same node. There are configurations that may conflict between themselves, leading to errors.
This include for example:
When the daemon finds an invalid configuration state of a given node, it will report the configuration as invalid and it will clean the FRR configuration.
Incompatibilities aside, merging is straightforward because all the elements of the configuration are incremental and allow, for example, to:
The merging process is always the union of all the configurations.
For example:
In general, any broader configuration takes precedence over narrower ones.
Each MetalLB speaker will generate the configuration related to the node its running on.
For example starting from a metallb configuration that looks like:
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: peer1
namespace: metallb-system
spec:
myASN: 64512
peerASN: 64512
peerAddress: 172.18.0.5
peerPort: 179
---
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: peer2
namespace: metallb-system
spec:
myASN: 64512
peerASN: 64512
peerAddress: 172.18.0.6
peerPort: 179
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: advertisement
namespace: metallb-system
spec:
peers: peer1
and a service like:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx LoadBalancer 10.96.54.101 192.168.10.0 80:31906/TCP 19h
MetalLB will generate a configuration like:
apiVersion: frrk8s.metallb.io/v1alpha1
kind: FRRConfiguration
metadata:
name: metallb-kind-worker
namespace: metallb-system
spec:
nodeSelector:
kubernetes.io/hostname: kind-worker
bgp:
routers:
- asn: 64512
prefixesIpV4:
- 192.168.10.0/32
neighbors:
- address: 172.18.0.5
asn: 64512
port: 179
- address: 172.18.0.6
asn: 64512
port: 179
toAdvertise:
allowed:
prefixes:
- 192.168.10.0/32
Given the constraint of having the same asn for a given FRR instance (per VRF), every user must agree on the ASN given to the FRR instance. Because of this, conflicting configurations will be marked as invalid.
Examples of invalid configurations are:
Under this premise, a configuration that allows incoming routes from a given peer would look like:
apiVersion: frrk8s.metallb.io/v1alpha1
kind: FRRConfiguration
metadata:
name: accept-all-router1
namespace: metallb-system
spec:
bgp:
routers:
- asn: 64512
neighbors:
- address: 172.18.0.6
asn: 64512
port: 179
toReceive:
allowed:
allowAll: true
At the same time, another user / controller might decide to advertise another IP to the same neighbor:
apiVersion: frrk8s.metallb.io/v1alpha1
kind: FRRConfiguration
metadata:
name: advertise-extra-router1
namespace: metallb-system
spec:
bgp:
routers:
- asn: 64512
prefixesIpV4:
- 192.168.11.0/32
neighbors:
- address: 172.18.0.6
asn: 64512
port: 179
toAdvertise:
allowed:
- 192.168.11.0/32
The resulting configuration will be something like:
bgp:
routers:
- asn: 64512
prefixesIpV4:
- 192.168.10.0/32
- 192.168.11.0/32
neighbors:
- address: 172.18.0.5
asn: 64512
port: 179
- address: 172.18.0.6
asn: 64512
port: 179
toAdvertise:
allowed:
- 192.168.10.0/32
- 192.168.11.0/32
toReceive:
allowed:
allowAll: true
The FRR configuration rendering mechanism is going to be the same as the one used today by MetalLB. The idea is to reuse the same packages with a layer that translates this API to the internal one.
Based on the API, the configuration will be rendered through a template, and the reload-frr.py script will
be invoked via a signal (the very same mechanism we have in MetalLB).
The structure of the frr configuration file is going to be something like:
We should be able to expose:
A prototype that covers the first two items might look like:
type FRRNodeStatus struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Status FRRConfigurationStatus `json:"status,omitempty"`
}
type ConfigurationStatus struct {
DesiredConfiguration string
RunningConfiguration string
Progress Status
}
type Status struct {
Phase string
Message string
}
For the item about the session status, it's basically covered in the metallb status proposal.
Additionally, we might consider either allowing MetalLB to probe directly the FRR instance or expose the same set of metrics that MetalLB is currently exposing.
Applying the wrong configuration / fetching the wrong routes might break the node. We should have a mechanism in place to avoid getting routes that overlap the pods / service CIDRs and any other sensible route that a cluster might require.
Moreover, we should explore the possibility of having sanity checks for the node and suspend the FRR configuration in that scenario.
This new FRR daemon will live in a separate repo, and ideally it's going to be deployable on its own.
On the other hand, the MetalLB Operator will be changed so it can deploy this new component too. All in one manifests will be provided to deploy both MetalLB and the FRR daemon.
We must also provide a mechanism to ensure that the version of MetalLB and this new daemonset are compatible.
We will have unit tests for each module that require it, plus
We need a comprehensive CI which leverages the same mechanisms as the MetalLB one, in order to ensure the new daemon behaves correctly.
Moreover, we might need a lane that verifies that MetalLB keeps working after changing the FRR daemon. A tradeoff must be made between coverage and the number of lanes we can run together.
Before coming to this version, a few alternatives were considered.
This version still leverages the integration between MetalLB and the FRR daemon via the API, but the API is a raw string where the full FRR configuration can be added.
type FRRConfigurationSpec struct {
NodeSelector map[string]string
RawFRR string
}
Merging is problematic and the only way to solve it is by concatenating all the configurations of a given node sequentially.
This has two consequences:
For example, in order to accept all the routes from a given neighbor, a user must be aware that metallb sets a deny rule with a given index and a given name:
route-map 10.2.2.254-in permit 10
Despite at first sight this version offers all the flexibility offered by FRR, the configuration is actually trickier because of the reasons mentioned above.
This potentially allows the user to do what he wants without worrying about what MetalLB does.
However:
In this scenario we would be able to run two instances of frr as one of them would run inside a container. The connection with the external routers would be handled by the instance running on the host, which would be under direct control of the user.
This would give the user the freedom to apply the configuration they want to the FRR instance running under their control, while MetalLB would be configured to peer with a process running on the host.
The general issue is that MetalLB will see one single peer (the local one) it will announce the routes to, whereas all the rest of the configuration for relying on those routes will be required to happen manually.
This includes:
Additionally, the problem of how to configure and manage the "local" FRR in a scalable manner still remains unaddressed.
The ultimate goal is to have the MetalLB CI pass when running with the new "BGP mode".
The phases will look like: