design/unnumbered-bgp.md
BGP Unnumbered allows routers to peer with each other when direct connected without the need for IPs (the BGP configuration uses interface names). Once peering takes places, exchange IPv4 or IPv6 prefixes can take place. That feature can be used by MetalLB to simplify configuration both on the network fabric and on the cluster side.
Using BGP unnumbered peering, which dynamically discovers IPV6 neighbors, reduces the burden of the network administrator to configue all interfaces on the network fabric or on the cluster nodes to have IPv4 addressing just for the BGP peering. By using BGP unnumbered, cluster administrator avoids to specify the address of each neihbor. Unnumbered BGP utilizes IPv6 link local address to automatically decide which peer to connect to.
As cluster administrator, I want to configure metallb without configuring IP addresses.
As network administrator, I want to avoid configuring IP addresses for point-to-point connection just for the BPG peering reason.
As network administrator, I want to reduce the size of the configuration in the ToR router.
A new field will be introduced, Interface string which if defined then Unnumber
BGP takes place. In that case, the address must NOT have value. The API doc will
look like:
Interface, if defined, instructs Metallb to setup Unnumber BGP peering on the interface,
which means that the peering configuration only includes the interface and does
not include IP address. Address and Interface are mutually exclusive and one of
them must be specified.
type Neighbor struct {
Address string `json:"address"`
Interface string `json:"interface"`
}
type BGPPeerSpec struct {
Address string `json:"peerAddress"`
Interface string `json:"interface"`
}
Example CR
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: peer-unnumber
namespace: metallb-system
spec:
myASN: 64512
ASN: 645000
interface: net0
We can re-use Address string to either be IP or interface, if not valid IP we
indicate Unnumbered BGP peering. This looks the minimal API, but it will not
have clean error path. E.g. if the user input is not a valid IP, the
implementation will shadow all the existing errors and will fail in the
runtime.
We can introduce a new field Unnumber bool, which if true the we indicate
Unnumber BGP Peering. If true the Address field holds the interface value.
Bool option makes the configuration error-free, because the user declares
explicitly that they want Unnumber and there is no "guessing" what metallb
does. Unnumber case is disconnected from LLA support case.
To make unnumber BGP work, we need to modify the neighborsession as defined in the following template neighborsession.tmpl and make the following changes
before: neighbor 192.168.10.10 timers 180 540
after: neighbor net0 timers 180 540
remote-as directive must be replacedbefore: neighbor 192.168.10.10 remote-as 65004
after: neighbor net0 interface remote-as 65004
neighbor {{.neighbor.Addr}} disable-connected-check
According to the design of unnumber BGP peering, we need to enable RAs. By observation it seems that FRR/Zebra component is auto-enabling RAs when the BGP unnumber session is configured. This is the commit. Therefore the Metallb implementation does not require that RAs are explicitly enabled in the FRR config unless we have other reasons (e.g. have different period).
There is no need for the interface to have an IP in order the BGP session to be established. This is the core of unnumber BGP. Nevertheless, when doing BGP peering between a non-router component and a route, then an IP might be needed for the dataplane traffic to work.
Note: the rest of the section is outside the scope of Metallb and is related to the CNI that cluster is being using. We discuss it as that might be important.
When external client (outside peer router) talks to an LB address and there is BGP route learning (= local k8s nodes adds routing from peer) there is no need for an IP
# packet in/out of k8s node
net0 In IP 200.100.100.1.40524 > 5.5.5.5.80: Flags [S], seq 1203290349,
veth3e626b76 Out IP 200.100.100.1.40524 > 10.244.2.4.80: Flags [S], seq 1203290349,
veth3e626b76 In IP 10.244.2.4.80 > 200.100.100.1.40524: Flags [S.], seq 3181148864, ack 1203290350,
net0 Out IP 5.5.5.5.80 > 200.100.100.1.40524: Flags [S.], seq 3181148864, ack 1203290350,
# k8s node
kind-worker:/# ip --br add show
lo UNKNOWN 127.0.0.1/8 ::1/128
veth2c551257@if2 UP 10.244.2.1/32 fe80::1498:dbff:fe25:6e1b/64
eth0@if325 UP 172.20.20.4/24 2001:172:20:20::4/64 fe80::42:acff:fe14:1404/64
eth1@if330 UP 172.18.0.3/16 fc00:f853:ccd:e791::3/64 fe80::42:acff:fe12:3/64
net0@if3 UP fe80::dcad:beff:feff:1161/64
kind-worker:/# ip route
default via 172.20.20.1 dev eth0
10.244.0.0/24 via 172.20.20.3 dev eth0
10.244.1.0/24 via 172.20.20.2 dev eth0
10.244.2.3 dev veth2c551257 scope host
172.18.0.0/16 dev eth1 proto kernel scope link src 172.18.0.3
172.20.20.0/24 dev eth0 proto kernel scope link src 172.20.20.4
172.30.0.0/16 via 172.20.20.6 dev eth0
200.100.100.0/24 nhid 68 via inet6 fe80::2c5f:eff:fec4:cf7b dev net0 proto bgp metric 20
kind-worker:/# ip route get 200.100.100.1
200.100.100.1 via inet6 fe80::2c5f:eff:fec4:cf7b dev net0 src 172.20.20.4 uid 0
cache
That works as long as the NAT/Masq rule properly sets back the source IP to be the LB which depends at the end of the CNI/use or not kube-proxy.
This does NOT work without IP on the interface. Even if route is configured
with the interface ip route add 200.100.100.0/24 dev net0 the traffic breaks
due to ARP.
net0 In IP 200.100.100.1.43198 > 5.5.5.5.80: Flags [S], seq 3169002768,
veth3e626b76 Out IP 200.100.100.1.43198 > 10.244.2.4.80: Flags [S], seq 3169002768,
veth3e626b76 In IP 10.244.2.4.80 > 200.100.100.1.43198: Flags [S.], seq 490740861, ack 3169002769,
net0 Out ARP, Request who-has 200.100.100.1 tell 172.20.20.4, length 28
net0 Out ARP, Request who-has 200.100.100.1 tell 172.20.20.4, length 28
RFC 8950 explains why we do not need ARP in previous case.
When pod in a node that has the routing entry due BGP peering
kind-worker:/# ip route
200.100.100.0/24 nhid 68 via inet6 fe80::2c5f:eff:fec4:cf7b dev net0 proto bgp metric 20
kind-worker:/# ip route get 200.100.100.1
200.100.100.1 via inet6 fe80::2c5f:eff:fec4:cf7b dev net0 src 172.20.20.4 uid 0
kind-worker:/# ip a s eth0
324: eth0@if325: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:14:14:04 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.20.20.4/24 brd 172.20.20.255 scope global eth0
valid_lft forever preferred_lft forever
initiates a connection to the outside, the source IP of the packet is one of the other interfaces. What is the selection algorithm is unclear and kernel documentation could answer. The important information is that if IP address is added in the loopback interface, then that IP address is being used. There is not need for an IP in the peer interface.
kind-worker:/# ip a s lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet 12.12.12.12/32 scope global lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
kind-worker:/# ip route get 200.100.100.1
200.100.100.1 via inet6 fe80::cc61:c8ff:fefc:2dce dev net0 src 12.12.12.12 uid 0
cache
So the answer is IP address is needed, but configuring IP address in the loopback is enough to control it.
The interface needs IP address to setup source IP address and routing.
graph TD
subgraph Rack0
direction LR
subgraph ToR0
direction TB
A1[eth0] -.- A{Routing}
A -.- A3[eth00]
A -.- A4[eth01]
BGP(BGP)
end
%% Top Right Subgraph
subgraph Node1
direction TB
B1 --- B2(cni)
B2 --- B3
subgraph Pod1[pod]
B3[eth0]
end
BGP-N1(BGP process)
end
%% Bottom Right Subgraph
subgraph Node0
direction TB
C1 --- C2(cni)
C2 --- C3
subgraph Pod2[pod]
C3[eth0]
end
BGP-N0(BGP process)
end
%% Connections between subgraphs
A3 -- point-to-point ------ B1[eth0]
A4 -- point-to-point ------ C1[eth0]
end
%% Define styles for subgraphs and nodes
classDef node fill:#f0f0f0,stroke:#000000,stroke-width:1px;
class Node0 node;
class Node1 node;
There is no straightforward way to test unnumbered in the primary node interface. The testing setup (kind creation) creates bridge and connect containers. Ideally we need to setup a kind cluster, and replace the bridge with a container that runs FRR on each other side of the veth, and at the same time to provide switch/upstream. Until we find a way to automate that infra setup we will test that scenario manually.
graph TD
subgraph Rack0
direction LR
style Rack0 fill:#ffffff,stroke:#000000,stroke-width:2px
subgraph ToR0
style ToR0 fill:#ffffff,stroke:#000000,stroke-width:2px
direction TB
subgraph Default
direction LR
A1[net0] -.- A{Routing}
A -.- A3[eth00]
A -.- A4[eth01]
end
subgraph RED VRF*
direction LR
AA1[net00] -.- AA{Routing}
AA -.- AA3[eth00.vlanRED]
AA -.- AA4[eth01.vlanRED]
BGP-T0(BGP configured on RED)
end
end
subgraph Node0
direction TB
B1 --- B2(cni)
B2 --- B3
BB1
subgraph Pod1[pod]
B3[eth0]
end
BGP-N0(BGP process)
end
subgraph Node1
direction TB
C1 --- C2(cni)
C2 --- C3
CC1
subgraph Pod2[pod]
C3[eth0]
end
BGP-N1(BGP process)
end
%% Connections between subgraphs
A3 ------- B1[eth0]
AA3 -- BGP Peering ----- BB1[net0]
A4 ------- C1[eth0]
AA4 -- BGP Peering ----- CC1[net0]
end
%% Define styles for subgraphs and nodes
classDef node fill:#f0f0f0,stroke:#000000,stroke-width:1px;
class Node0 node;
class Node1 node;
We should modify the testing infra to include point-to-point links which means connect containers using VETH, and not through a bridge. We need NET-ADMIN cap (or sudo) when creating the VETH.
We could have an isolated test file that creates/deletes peer per that specific tests. In that option we do not to take resources from runners outside the runtime of the test. This new peering setup needs different configuration (not IPs but interface names) and therefore avoid extensive(?) refactoring of the help functions.
Benefit of that would be that we will cover all the feature/scenarios (e.g. uncordon node). Reason not to do that would be add extra time, create many skip rules (e.g. if Multihop skip that Peer), extensive(?) refactor of the test functions.
This how the testing is setup, and how we emulate a ToR Switch/Router
frr defaults datacenter
hostname tor
no ipv6 forwarding
!
interface eth00
ipv6 nd ra-interval 10
no ipv6 nd suppress-ra
exit
!
interface eth01
ipv6 nd ra-interval 10
no ipv6 nd suppress-ra
exit
!
interface lo
ip address 200.100.100.1/24 //for testing
exit
!
router bgp 65001
bgp router-id 11.11.11.254
neighbor MTLB peer-group
neighbor MTLB passive
neighbor MTLB remote-as external
neighbor MTLB description LEAF-MTLB
neighbor eth00 interface peer-group MTLB
neighbor eth01 interface peer-group MTLB
!
address-family ipv4 unicast
redistribute connected
neighbor MTLB activate
exit-address-family
!
address-family ipv6 unicast
redistribute connected
neighbor MTLB activate
exit-address-family
exit
> docker exec -it unnumbered-p2p-peer vtysh -c 'show bgp summary'
IPv4 Unicast Summary (VRF default):
BGP router identifier 11.11.11.254, local AS number 65004 vrf-id 0
BGP table version 4
RIB entries 3, using 288 bytes of memory
Peers 3, using 39 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
kind-control-plane(eth00) 4 65000 510 510 4 0 0 00:24:43 1 2 k8s-node
kind-worker(eth01) 4 65000 510 510 4 0 0 00:24:43 1 2 k8s-node
kind-worker2(eth02) 4 65000 510 510 4 0 0 00:24:43 1 2 k8s-node
Total number of neighbors 3
IPv6 Unicast Summary (VRF default):
BGP router identifier 11.11.11.254, local AS number 65004 vrf-id 0
BGP table version 3
RIB entries 1, using 96 bytes of memory
Peers 3, using 39 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
kind-control-plane(eth00) 4 65000 510 510 3 0 0 00:24:43 1 1 k8s-node
kind-worker(eth01) 4 65000 510 510 3 0 0 00:24:43 1 1 k8s-node
kind-worker2(eth02) 4 65000 510 510 3 0 0 00:24:43 1 1 k8s-node
> k -n metallb-system exec -it -c frr frr-k8s-daemon-kzpz2 -- vtysh -c 'show bgp summary'
IPv4 Unicast Summary (VRF default):
BGP router identifier 172.19.0.3, local AS number 65000 vrf-id 0
BGP table version 2
RIB entries 3, using 288 bytes of memory
Peers 1, using 13 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
unnumbered-p2p-peer(net0) 4 65004 512 512 2 0 0 00:25:15 1 2 TOR
Total number of neighbors 1
IPv6 Unicast Summary (VRF default):
BGP router identifier 172.19.0.3, local AS number 65000 vrf-id 0
BGP table version 1
RIB entries 1, using 96 bytes of memory
Peers 1, using 13 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
unnumbered-p2p-peer(net0) 4 65004 512 512 1 0 0 00:25:15 0 1 TOR
Total number of neighbors 1
> docker exec -it unnumbered-p2p-peer vtysh -c 'show bgp ipv4'
BGP table version is 3, local router ID is 11.11.11.254, vrf id 0
Default local pref 100, local AS 65004
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*= 5.5.5.5/32 eth02 0 0 65000 i
*= eth01 0 0 65000 i
*> eth00 0 0 65000 i
*> 200.100.100.0/24 0.0.0.0(unnumbered-p2p-peer)
0 32768 ?
Displayed 2 routes and 4 total paths
> docker exec -it unnumbered-p2p-peer ip route
5.5.5.5 nhid 21 proto bgp metric 20
nexthop via inet6 fe80::dcad:beff:feff:1160 dev eth00 weight 1
nexthop via inet6 fe80::dcad:beff:feff:1161 dev eth01 weight 1
nexthop via inet6 fe80::dcad:beff:feff:1162 dev eth02 weight 1