design/pool-configuration.md
Following @johananl's initiative in this issue this enhancement tries to collect and classify various configuration enhancement requests requested over time.
The way advanced users consume MetalLB is evolving and exposed a need for more complex configuration. For example, the need for isolation and multi-tenancy induced requests for being able to advertise some specific address pools from a subset of the nodes, or to allocate them only to a specific namespace, to mention some.
As part of the discussion over an initial version of this proposal we also decided to rethink the MetalLB datamodel, in order to separate the different concerns handled by MetalLB. This separation is intended to be implemented dropping the configmap support and using Kubernetes CRDs.
Before addressing the new concerns, we will propose a new scheme for configuring MetalLB, which will be used as a starting point in the following parts of the proposal. For this reason, it makes sense to have it described in this proposal instead of a separate one.
The new data model will be expressed in terms of kubernetes CRDs, deprecating the configuration via the metallb's ConfigMap.
In the new data model, the following entities will be defined (more details on the implementation section):
IPAddressPool, responsible of the ip allocation to servicesL2Advertisement, describing the intention to advertise an IP coming from an IPAddressPool via L2BGPAdvertisement, describing the intention to advertise a prefix coming from an IPAddressPool via BGPBGPPeer, representing an external BGP peer willing to interact with MetalLBOne of the goals is to maintain backward compatibility between the CRDs currently implemented in the operator, and the new version.
With that in mind, the current v1beta1.AddressPool would be mapped to an IPAddressPool and an L2Advertisement or a BGPAdvertisement, depending
on the protocol. Because of this 1 to many relationship, conversion webhooks can't be used, and we need to handle both the old and the new version
in the controller. In order to achieve that, the two versions must have different names.
For all the IPs belonging to an address pool, we need to address three main configuration degrees:
The most common request is being able to associate a given pool to a subset of nodes (which translates to selecting a subset of speakers so that these speakers - and only them - announce the IP).
This covers use cases where different set of nodes are exposed to different subnets, and the loadbalancer IP can be reached only via a subset of the nodes.
This was raised in the following issues / PRs:
With the exception of the BGP peer node selector, an IP assigned to a Service is currently advertised to all the configured BGP peers.
This is described in:
Given the IP, here we want to choose what BGP peers it must be announced to:
On top of that, different nodes must announce the IP to a different set of BGP peers:
This a degenerate case that was suggested by this discussion where there are scenarios in which only the IP pooling side of metallb is used, but the routes to those IPs are set statically.
On top of that, Calico suggests to deploy MetalLB without the speakers to leverage only the allocation feature exposed by the controller in its official documentation.
Currently, the options to make sure a given service is using a specific IP from a specific pool are:
metallb.universe.tf/address-pool annotation on the serviceauto-assign: false parameter on the pool, and then statically assign the IP to a given service.From this point of view, we can enhance the allocation logic allowing users to reserve a given pool for a set of namespaces or for a set of services. This is more flexible and moves the focus on the cluster administrator, who is setting the MetalLB configuration in a multi tenant scenario.
This, in combination with the other configuration degrees above, will provide a way to have per service (or per namespace) behavior (i.e. specifying the communities for a given service).
The issues related to this arguments:
The e2e test suite is wide enough to ensure we have good coverage on metallb's main features. Given the new data model will be functionally equivalent to the old one, converting the test suite should give us enough confidence in that regard.
Deprecating the configmap will have impact on the users' current configurations. To mitigate this, we may consider having a controller that converts the configmap into the CRDs, or provide some offline tool in charge of that (the latter option is more desirable, as it will remove the need of reconciling changes into the configmap).
The operator already exposes CRDs, mapped directly to the current configmap entitites. Changing the CRDs layout will impact them, and in order to mitigate the behaviour, conversion webhooks will be implemented to support both the old and the new versions of the API, until the old version is deprecated.
When a user specifies both a pool to be allocated to a specific service and a pool with no selector, the service may unintentionally receive an IP from the pool which has no selector. We need to implement an easy to understand and deterministic behaviour (see design details below).
Having an IP with no nodes advertising it is a risk. Some scenarios are:
It should be possible to detect non overlapping configurations, and raise an event when they happen. The metallb operator introduces a Validation WebHook that can be used to prevent those configurations from being created in first instance.
The peer discovery feature currently being implemented in this PR will introduce a way to fetch peers from the node's annotations.
With this new structure, the PR will need to be adjusted to extend the BGPPeer entity in order to consume the node's annotations instead of hardcoded values.
With the introduction of CRDs, external entities (such as a Cloud Controller Manager) would be able to create MetalLB CRs instead of enriching Node objects. In that scenario, the autodiscovery effort would not be needed.
All the BGP changes will happen from a "behavioural" point of view. Where needed, the changes will be reflected in both the native and the new implementation.
The risk of not covering all possible cases / permutations is still there.
That aside, this proposal covers the majority of the issues that were raised by the users (multiple times), and it won't prevent adding a finer level of configuration if the need ever arises. Moreover, providing a straightforward way to separate concerns will make it easier to accommodate new changes, and it will be more clear where those new changes must be applied.
Here we describe the new proposed data model in the form of Go structures, and non comprehensive sample CRDs.
The IPAddressPool is the entity responsible for the IP allocation part, regardless of how (and if) the ip is going to be advertised.
Hence, we keep the same structure but remove the bgpAdvertisement section and the Protocol field.
type IPAddressPool struct {
Name string
Addresses []string
AvoidBuggyIPs bool
AutoAssign *bool
}
Example CR:
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: ipaddresspool-sample3
namespace: metallb-system
spec:
addresses:
- 2002:2:2::1-2002:2:2::100
- 172.20.0.100/24
autoAssign: false
avoidBuggyIPs: true
The BGPAdvertisement entity represents the intention to advertise a given IP via the BGP Protocol. It contains properties related to the type of advertisement and a reference to the address pools we want we advertise. This initial version matches the current behaviour, where an IP is advertised to all the BGPPeers configured in MetalLB.
type bgpAdvertisement struct {
Name string
IPAddressPools []string
IPAddressPoolSelector *metav1.LabelSelector
AggregationLength *int
LocalPref *uint32
Communities []string
}
Example CR:
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: bgpadvertisement
namespace: metallb-system
spec:
ipaddresspools:
- pool1
- pool2
- communities:
- 65535:65282
aggregationLength: 32
localPref: 100
A BGPAdvertisement with no pool selector is intended to select all the available IPAddressPools.
The L2Advertisement entity represents the intention to advertise a given IP via L2. Currently, there are no knobs related to L2 advertisement.
type l2Advertisement struct {
Name string
IPAddressPools []string
IPAddressPoolSelector *metav1.LabelSelector
}
Example CR:
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: l2advertisement
namespace: metallb-system
spec:
ipaddresspools:
- pool1
- pool2
A L2Advertisement with no pool selector is intended to select all the available IPAddressPoolss.
The BgpPeer entity represents an external BGP peer we want to establish a session with. It contains the properties of the sessions, and it's a 1:1 mapping of the current bgp peer structure without the node selector. We also add the bfdProfile structure here, as it is part of the session definition.
Note that the node selector field represent the intention to connect to a given peer only from a subset of the nodes.
type bgpPeer struct {
Name string
MyASN uint32
ASN uint32
Addr string
SrcAddr string
Port uint16
HoldTime string
KeepaliveTime string
RouterID string
Password string
BFDProfile string
EBGPMultiHop bool
NodeSelector *metav1.LabelSelector
SecretPassword corev1.SecretRef
}
type bfdProfile struct {
Name string
ReceiveInterval *uint32
TransmitInterval *uint32
DetectMultiplier *uint32
EchoInterval *uint32
EchoMode bool
PassiveMode bool
MinimumTTL *uint32
}
Example CRs:
apiVersion: metallb.io/v1beta1
kind: BGPPeer
metadata:
name: peer-bfd
namespace: metallb-system
spec:
peerAddress: 10.0.0.2
peerASN: 64501
myASN: 64500
routerID: 10.10.10.10
peerPort: 1
holdTime: "10s"
sourceAddress: 1.1.1.1
bfdProfile: bfdprofilefull
apiVersion: metallb.io/v1beta1
kind: BFDProfile
metadata:
name: bfdprofilefull
namespace: metallb-system
spec:
receiveInterval: 35
transmitInterval: 35
detectMultiplier: 37
echoReceiveInterval: 38
echoTransmitInterval: 39
echoMode: true
passiveMode: true
minimumTtl: 10
The SecretPassword field is mutually exclusive with the Password field, and contains a reference to a secret containing the
password to use. Setting both fields will result in a failure of the parsing.
The two Advertisement entities are extended with a selector of the nodes to announce from:
A configuration would look like:
type bgpAdvertisement struct {
Name string
IPAddressPools []string
IPAddressPoolSelector *metav1.LabelSelector
AggregationLength *int
LocalPref *uint32
Communities []string
NodeSelector *metav1.LabelSelector
}
type l2Advertisement struct {
Name string
NodeSelector *metav1.LabelSelector
}
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
name: bgpadvertisement
namespace: metallb-system
spec:
ipaddresspools:
- pool1
- pool2
nodeSelector:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: [hostA, hostB]
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: l2advertisement
namespace: metallb-system
spec:
ipaddresspools:
- pool1
- pool2
nodeSelectors:
- matchExpression:
- key: kubernetes.io/hostname
operator: In
values: [hostA, hostB]
This concern is now addressed by adding a peers selector to the BGPAdvertisement entity.
By using both the pools and the peers selector, it will be possible to declare the intent of announcing the IPs of a given IPAddressPool only to a subset of BGP peers,
and only from a subset of nodes.
type bgpAdvertisement struct {
Name string
IPAddressPools []string
Peers []string
IPAddressPoolSelector *metav1.LabelSelector
PeerSelector *metav1.LabelSelector
AggregationLength *int
LocalPref *uint32
Communities []string
}
Example CR:
apiVersion: metallb.io/v1beta1
kGPd: BGPAdvertisement
metadata:
name: bgpadvertisement
namespace: metallb-system
spec:
ipaddresspools:
- pool1
- pool2
peers:
- peer1
- peer2
- communities:
- 65535:65282
aggregationLength: 32
localPref: 100
This concern will be addressed by defining an IPAddressPool not referenced by any Advertisement instance.
The idea is to be able to reduce the scope of a particular IPAddressPool to a set of namespaces and services, by adding an optional namespace / service selector.
type IPAddressPool struct {
Name string
Addresses []string
AvoidBuggyIPs bool
AutoAssign *bool
AllocateTo serviceAllocation
}
type serviceAllocation struct {
Priority int
Namespaces []string
NamespaceSelector *metav1.LabelSelector
ServiceSelector *metav1.LabelSelector
}
Example CR
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: ipaddresspool-sample3
namespace: metallb-system
spec:
addresses:
- 2002:2:2::1-2002:2:2::100
- 172.20.0.100/24
autoAssign: false
avoidBuggyIPs: true
allocateTo:
priority: 50
namespaces:
- namespaceA
- namespaceB
serviceSelector:
tenant: foo
The controller will use the pool with lowest value of priority in case of multiple matches. This is deterministic and will make the behaviour more clear to the users.
A pool with no priority set will be used only if the pools with priority can't be used.
Given a Service, if multiple matching IPAddressPools are available MetalLB will check for the availability of IPs sorting the matching IPAddressPools by priority, starting from the highest to the lowest. If multiple IPAddressPools have the same priority, the choice will be random.
In order to ensure no regressions in terms of functionality, all the current E2E tests must be converted and adapted to the new data model.
Every new configuration must be reflected in additional e2e tests that ensure MetalLB is handling the new configuration properly.
The code must be covered by unit tests. Especially the selection logic, to have a broader coverage and to understand if the logic is working.
This new logic may bring broader complexity while MetalLB is well known to work out of the box.
Besides that, all the additional selectors are optional, and the bare minimum configuration will continue to work. Power users will be able to leverage this extra configurability for their more complex scenarios.
We expect two phases, moving the CRDs as they are from the metallb operator to metallb first, and changing the data model later.
The CRDs will be moved from the MetalLB operator to MetalLB as they are. The tests already have a "CRD" mode that will make sure there are no (or limited :P) regressions.
The configmap will be deprecated.
The CRDs are removed from the operator, which is in charge of deploying metallb only.
During the new data model implementation (described in the "New Data Model" section), the e2e tests will be adjusted to use the new data model, and eventually a converter (like the one for configmap mode / crd mode) will be implemented to guarantee both the old and the new versions keep working.
Conversion webhooks will be put in place to guarantee backward compatibility for the current users of the MetalLB Operator.
The remaining configuration degrees seem now to be independent:
For this reason, from a development point of view they can be implemented in parallel / one at the time.
In order to declare a given configuration option fully supported, the following items must be covered: