design/preferred-node-l2.md
A new preferredNodeSelectors field on L2Advertisement lets administrators express soft node
preferences for L2 leader election. The speakers try preferred nodes first when electing the
announcer for a LoadBalancer IP, falling back to any eligible node when preferred nodes are
unavailable. The API mirrors the Kubernetes PreferredSchedulingTerm pattern and builds on the existing
hash-based election algorithm.
Related: #2797 (feature request for this design doc), PR #1804 (rejected predecessor, annotation-based, single node).
Depends on: PR #3014, which refactors L2
election so serviceSelectors drives the candidate node set. The scoring step added here
slots into the post-#3014 ShouldAnnounce flow.
MetalLB L2 mode elects the announcing node for each LoadBalancer IP using
sha256(nodeName + "#" + ipString). The hash gives deterministic, even distribution, but sometimes you need control over which node announces without losing failover.
A cluster has three worker nodes and two "edge" nodes.
L2 traffic should land on the edge nodes when possible, but the workers are needed as failover targets when both edge nodes are down. Today, using nodeSelectors to
restrict to edge nodes means zero failover if both go down.
A bare-metal cluster spans two server rooms (zone-a, zone-b). Clients connect through a switch in zone-a. The L2 announcer should be in zone-a to minimize cross-zone hops, but zone-b should take over during zone-a maintenance windows.
Cluster upgrades often take nodes out of rotation, for example by draining and rebooting them in a rolling fashion. With announcements spread across many nodes, IP assignments get shuffled each time an announcing node goes down, and a full upgrade cycle can shuffle them repeatedly. Pinning announcements to a small set of "anchor" nodes (upgraded first or last) keeps IPs stable for the majority of the upgrade window. The remaining nodes still serve as failover targets when the anchor nodes themselves are patched.
preferredNodeSelectors behave identically to today.PreferredSchedulingTerm).One new field on L2AdvertisementSpec and one new type.
type L2AdvertisementSpec struct {
// ... existing fields unchanged ...
// PreferredNodeSelectors allows specifying soft node preferences for L2
// leader election. Nodes matching these selectors receive a higher score
// and are preferred as the announcing node. If no preferred node is
// available, any eligible node (per NodeSelectors) can announce.
// Modeled after Kubernetes PreferredSchedulingTerm.
// +optional
PreferredNodeSelectors []PreferredNodeSelector `json:"preferredNodeSelectors,omitempty"`
}
// PreferredNodeSelector expresses a weighted soft preference for nodes.
// This follows the Kubernetes PreferredSchedulingTerm pattern where Weight
// controls relative priority and Preference selects matching nodes.
type PreferredNodeSelector struct {
// Weight associated with matching the corresponding preference,
// in the range 1-100.
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=100
Weight int32 `json:"weight"`
// A node selector that applies to this weight.
Preference metav1.LabelSelector `json:"preference"`
}
The naming and structure follow Kubernetes'
PreferredSchedulingTerm:
preference matches PreferredSchedulingTerm.Preference to signal this is a soft selector.metav1.LabelSelector is consistent with existing MetalLB selector fields
(NodeSelectors, IPAddressPoolSelectors, etc.). We use this instead of NodeSelectorTerm, consistent with the rest of the MetalLB API.int32 weight, range 1-100, required (no omitempty), matching
PreferredSchedulingTerm.Weight type and validation.The field is optional and added to v1beta1. New controllers reading old CRDs see nil, identical
to today's behavior.
Safe upgrade order:
preferredNodeSelectors. Without this, older components can drop the field on write.preferredNodeSelectors.Only nodes with role: lb are eligible. Among those, prefer edge nodes:
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: prefer-edge
namespace: metallb-system
spec:
ipAddressPools:
- production-pool
nodeSelectors:
- matchLabels:
role: lb
preferredNodeSelectors:
- weight: 100
preference:
matchLabels:
node-role: edge
All nodes are eligible. Prefer edge nodes. Any node can announce as fallback:
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: prefer-edge-all-eligible
namespace: metallb-system
spec:
ipAddressPools:
- production-pool
preferredNodeSelectors:
- weight: 100
preference:
matchLabels:
node-role: edge
Minimal configuration, backward compatible (no preferences):
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: default
namespace: metallb-system
spec:
ipAddressPools:
- default-pool
nodeSelectors is the hard eligibility filter and determines which nodes can announce.
preferredNodeSelectors orders nodes within that eligible set. A node must pass
nodeSelectors (via some ad in the pool) before preference scores apply.
all cluster nodes
--> per ad at config-parse time:
selectedNodes(ad.nodeSelectors) = ad.Nodes (per ad)
--> for each service, filter by serviceSelectors only:
l2AdsForService(pool.L2Advertisements, svc) = adsForService
--> speakers from memberlist UsableSpeakers
--> speakersForAds via adsMatchNodeL2(adsForService, s):
node covered by at least one service-matching ad = candidate nodes
--> filtered by ETP:Local / healthy endpoints = available nodes
--> sum ad.PreferredNodes across adsForService = scored available nodes
--> sort: score DESC, sha256(node+"#"+ipString) ASC = election order
--> availableNodes[0] == myNode ? announce
The local participation guard is the same adsForService: if no ad in that set covers
myNode (adsMatchNodeL2(adsForService, myNode) is false), the speaker returns
noMatchingAdvertisement and sits out. Every speaker reaches this guard with the same
adsForService (no node argument on the filter), so all speakers that pass the guard see the
same candidate set and the same score map.
Scoring is ad-scoped. Within a single L2Advertisement, preferences apply only to nodes
matching that ad's own nodeSelectors. An ad with no nodeSelectors scores every node. An ad
never raises the score of a node outside its own eligible set.
Weights within one ad add up. A node matching both a weight-60 and a weight-50 selector in the same ad scores 110 from that ad.
For a given service, the final score sums per-ad scores across every L2Advertisement that
targets the service's pool and matches the service via serviceSelectors. Ads that don't match
the service (by pool or by serviceSelectors) contribute nothing. Nodes matching no preference
in any applicable ad score 0.
The scoring step slots in between availableNodes and the existing sort.Slice
call in ShouldAnnounce, reusing the adsForService slice already computed
earlier in the same function:
adsForService = l2AdsForService(pool.L2Advertisements, svc) // already computed in ShouldAnnounce
scores = map[nodeName]int64
for each ad in adsForService:
for node, weight in ad.PreferredNodes:
scores[node] += weight
sort availableNodes by:
1. scores[node] descending (higher weight wins)
2. sha256(node + "#" + ipString) ascending (deterministic tie-break)
A pool can have multiple L2Advertisement objects, and MetalLB already aggregates fields
across them. adsMatchNodeL2 ORs node eligibility across the service-matching ads at
election time, and ipAdvertisementFor unions advertised interfaces per service at
announcement time. preferredNodeSelectors aggregates per ad. Each ad scores only its own
eligible nodes, and a service's final score is the sum across the ads matching that service.
Example - preferences bound to the ad's own eligible set:
Ad1: nodeSelectors: [{role: lb}] → nodes A, B eligible for Ad1
Ad2: nodeSelectors: [{role: edge}] → nodes C, D eligible for Ad2
Ad3: nodeSelectors: [{role: gpu}] → node E eligible for Ad3
preferredNodeSelectors: [{weight: 100, preference: {zone: primary}}]
Combined eligible set: A, B, C, D, E (OR across all three ads)
If node C matches zone=primary → C still scores 0. Ad3's preference only applies to Ad3's
own eligible nodes (just E), so C is not lifted by a preference from an ad that does not
target it.
Example - two ads both targeting the pool with overlapping eligible sets, both contributing preferences to the shared nodes:
Ad1: nodeSelectors: [{role: lb}]
preferredNodeSelectors: [{weight: 70, preference: {zone: primary}}]
Ad2: nodeSelectors: [{role: lb}]
preferredNodeSelectors: [{weight: 30, preference: {gpu: "true"}}]
Node X (role=lb, zone=primary, gpu=true) → 70 (Ad1) + 30 (Ad2) = 100
Node Y (role=lb, zone=primary) → 70 (Ad1) + 0 = 70
Node Z (role=lb) → 0 + 0 = 0
The L2Advertisement validating webhook passes all existing L2Advertisements and IPAddressPools
through config.For(), so it supports both single-object and cross-object checks.
New validation rules:
preference label selector must be valid. Enforced by metav1.LabelSelectorAsSelector()
during config parsing.No cross-object validation is needed. You can combine preferredNodeSelectors with
serviceSelectors and with other L2Advertisements targeting the same pool. See
Sort Algorithm and
Multiple L2Advertisements Per Pool.
A new PreferredNodes map[string]int64 field on the internal L2Advertisement config struct
holds per-ad scores. l2AdvertisementFromCR fills it alongside the existing selectedNodes
call.
type L2Advertisement struct {
Nodes map[string]bool
Interfaces []string
AllInterfaces bool
PreferredNodes map[string]int64 // new: node name -> aggregated weight, ad-scoped
}
PreferredNodes keys are a subset of Nodes. A missing key means the node scored 0 under
this ad, not that it is ineligible. Nodes remains the eligibility map.
When preferredNodeSelectors is nil or empty, PreferredNodes is nil, all scores are 0, and
the sort falls through to pure hash-based ordering. Behavior is identical to today's release.
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: external-pool
namespace: metallb-system
spec:
addresses:
- 192.168.10.0/24
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: prefer-edge
namespace: metallb-system
spec:
ipAddressPools:
- external-pool
preferredNodeSelectors:
- weight: 100
preference:
matchLabels:
node-role: edge
Nodes labeled node-role: edge score 100 and are tried first. All other nodes score 0 and
serve as failover. If both edge nodes go down, the hash picks the best worker.
When an edge node recovers, the speakers re-elect it on the next reconciliation cycle.
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: tiered-zones
namespace: metallb-system
spec:
ipAddressPools:
- production-pool
preferredNodeSelectors:
- weight: 100
preference:
matchLabels:
zone: primary
- weight: 50
preference:
matchLabels:
zone: secondary
A node in zone: primary scores 100. A node in zone: secondary scores 50. A node in neither
zone scores 0. Within each tier, the hash provides deterministic ordering.
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: cumulative-example
namespace: metallb-system
spec:
ipAddressPools:
- production-pool
preferredNodeSelectors:
- weight: 60
preference:
matchLabels:
zone: primary
- weight: 50
preference:
matchLabels:
gpu: "true"
A node with both zone: primary and gpu: "true" scores 110. A node with only
zone: primary scores 60. A node with only gpu: "true" scores 50. The node scoring 110
is preferred over a node scoring 100 from a single high-weight selector.
A PreferredNodeSelector with an empty preference (no matchLabels or matchExpressions)
matches every node in the ad's own ad.Nodes set. If the ad has no nodeSelectors, every
cluster node gets the same weight bump and election order is unchanged relative to pure hash
order. If the ad has nodeSelectors, only nodes inside that eligible set get the bump, which
does lift them above nodes eligible only via other ads in the pool.
Volatile labels cause re-elections with nodeSelectors. preferredNodeSelectors increases the risk:
Mitigations:
You can combine preferredNodeSelectors with serviceSelectors. Election runs per-service and
considers only ads that both target the service's pool and match the service via
serviceSelectors (see #3014). Scoring runs on
that filtered set, so a service-scoped ad's preferences only influence election for services it
matches.
Per-service preference maps to a single CR:
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: zone-preference-for-frontend
namespace: metallb-system
spec:
ipAddressPools:
- shared-pool
serviceSelectors:
- matchLabels:
app: frontend
preferredNodeSelectors:
- weight: 100
preference:
matchLabels:
zone: primary
Services not matching app: frontend ignore this advertisement.
With the rule from #3025
disallowing allow-shared-ip + serviceSelectors on L2Advertisements,
shared-IP siblings always see the same ad set and converge on the same
announcer.
Operators need to understand weighted scoring. Misconfigured preferences can cause unexpected IP migration or reduce failover capacity.
preferredNodeSelectors:
- matchLabels:
zone: primary
A simpler version where preferred nodes are just a label selector with no weight. Matching nodes are preferred. Non-matching nodes are fallback.
Cannot express multi-tier prioritization (primary zone > secondary zone > everything else).
nodeSelectorPriority:
- priority: 10
nodeSelector:
matchLabels:
zone: primary
- priority: 20
nodeSelector:
matchLabels:
zone: secondary
An explicit priority number (lower wins) similar to IPAddressPool.serviceAllocation.priority.
Consistent with pool priority, but invents new naming rather than following the Kubernetes PreferredSchedulingTerm pattern.
prioritizedNodeSelectors:
- weight: 100
preference:
matchLabels:
node-role: edge
- weight: 0 # sentinel: eligible only as fallback
preference:
matchLabels:
role: lb
A single field replaces nodeSelectors and folds eligibility and preference into one list.
Any positive weight marks a preferred node. Zero marks a fallback-only eligible node.
Operators reason about one field instead of two.
This shape diverges from the Kubernetes required / preferred scheduling split that MetalLB
already follows via nodeSelectors, overloads one field with two distinct semantics, and
forces an API break or a compatibility shim on nodeSelectors. A sibling
preferredNodeSelectors field keeps the current shape and avoids a migration.
Scenarios to cover:
nodeSelectors.
A preference-only ad (no nodeSelectors) scores every nodeserviceSelectors + preferredNodeSelectors: preference applies only to services matching
the adnodeSelectors filtering, externalTrafficPolicy: LocalpreferredNodeSelectors produces identical behavior to
todayPreferredNodes map population, invalid label selectorsScenarios to cover:
nodeSelectors (required + preferred)serviceSelectors + preferredNodeSelectors scoped to a subset of services