design/crd-configstate.md
As a cluster user, I want to verify that MetalLB CRs successfully applied to the API are also successfully processed by MetalLB components. This CR surfaces configuration errors currently hidden in logs, making them visible via standard kubectl commands.
We define configuration errors as errors caused solely by user input that can only be resolved by correcting that input. This is in contrast to infrastructure errors (such as missing RBAC permissions to watch resources) which can be fixed by administrators through deployment changes or code bug fixes.
Note: While webhooks cover most configuration errors, they cannot be solely relied upon.
In corner cases (e.g., many concurrent API calls), two conflicting CRs could be stored in the
API. Another common case is that the webhook explicitly ignores transient errors related to
resource ordering by resetting fields that could cause such errors (via the
resetTransientErrorsFields function).
Note: This CRD specifically does not cover errors that arise when the configuration
applied by the speaker to the frr-k8s daemon conflicts with other external configurations
within that daemon.
One CRD definition, Multiple CRs, Namespaced.
$ kubectl get configurationstates -n metallb-system
NAME RESULT LASTERROR AGE
controller Valid 5m
speaker-kind-worker Valid 5m
speaker-kind-worker2 Valid 5m
...
# Query by component type using labels
$ kubectl get configurationstates -n metallb-system -l metallb.io/component-type=speaker
# Query by node using labels
$ kubectl get configurationstates -n metallb-system -l metallb.io/node-name=kind-worker
// ConfigurationState is a status-only CRD that reports configuration validation results
// from MetalLB components. The type and node information are conveyed through labels
// rather than spec fields, following Kubernetes best practices for status resources.
// Labels:
// - metallb.io/component-type: "controller" or "speaker"
// - metallb.io/node-name: node name (only for speaker)
// ConfigurationStateStatus defines the observed state of ConfigurationState.
type ConfigurationStateStatus struct {
// Result indicates the configuration validation result.
// Possible values:
// - "Valid": Configuration is successfully validated.
// - "Invalid": Configuration has errors.
// - "Unknown": Component has not reported state (e.g., during initialization or after a crash).
// +optional
// +kubebuilder:validation:Enum=Valid;Invalid;Unknown
Result string `json:"result,omitempty"`
// LastError contains the error message from the last reconciliation failure.
// This field is empty when Result is "Valid".
// +optional
LastError string `json:"lastError,omitempty"`
}
For Controller (Deployment - single instance) there are three reconcilers
apiVersion: metallb.io/v1beta1
kind: ConfigurationState
metadata:
name: controller
namespace: metallb-system
labels:
metallb.io/component-type: controller
status:
result: "Valid"
lastError: ""
# Failed example
apiVersion: metallb.io/v1beta1
kind: ConfigurationState
metadata:
name: controller
namespace: metallb-system
labels:
metallb.io/component-type: controller
status:
result: "Invalid"
lastError: "failed to parse configuration: CIDR \"192.168.10.100/32\" in pool \"client2-pool\" overlaps with already defined CIDR \"192.168.10.0/24\""
For Speaker (DaemonSet - one pod per node) there are
apiVersion: metallb.io/v1beta1
kind: ConfigurationState
metadata:
name: speaker-kind-worker
namespace: metallb-system
labels:
metallb.io/component-type: speaker
metallb.io/node-name: kind-worker
status:
result: "Valid"
lastError: ""
# Failed example:
apiVersion: metallb.io/v1beta1
kind: ConfigurationState
metadata:
name: speaker-kind-worker2
namespace: metallb-system
labels:
metallb.io/component-type: speaker
metallb.io/node-name: kind-worker2
status:
result: "Invalid"
lastError: "peer peer1 referencing non existing bfd profile my-bfd-profile"
Ideally like FRRNodeState pattern in metallb/frrk8s repo, but might not be ideal because the ConfigurationStateReconciler needs references to ConfigReconciler and NodeReconciler in-memory result. Alternative implementation based on condition instead of channel can be evaluated. TBD and discussed during the implementation PR.
The res := r.Handler(r.Logger, cfg) will not be refactored to return
an error. When handler returns SyncStateNoRetry result, that will be reported
in the ConfigurationState status.
E2E tests for transient errors can run without disabling webhooks. Tests for other validation errors might require disabling the webhook.
Transient errors occur due to interdependencies between CRDs where resources reference other resources that don't exist yet. These errors are temporary and resolve automatically when the missing resource is created. The webhook strips fields that can cause transient errors to avoid making assumptions based on object creation ordering.
BFD Profile Reference
peer %s referencing non existing bfd profile %sPassword Secret Reference
secret ref not found for peer config %q/%qCommunity Alias Reference
Given a MetalLB installation in namespace metallb-system
When user applies a BGPPeer that references a non-existent BFD profile:
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: peer1
namespace: metallb-system
spec:
myASN: 64512
peerASN: 64513
peerAddress: 192.168.1.1
bfdProfile: my-bfd-profile # References profile that doesn't exist
Then speaker/configReconciler fails to load configuration
kubectl apply -f bgppeer.yaml
# Output: bgppeer.metallb.io/peer1 created
kubectl logs -n metallb-system -l component=speaker -c speaker --tail=50 | grep "error"
# Output: {"caller":"config_controller.go:140","controller":"ConfigReconciler","error":"peer peer1 referencing non existing bfd profile my-bfd-profile","level":"error"}
Other validation errors can occur for example when the secret exists
but has incorrect content. The user must fix the secret to resolve these
errors. The webhook strips the passwordSecret field to avoid accessing secret
content during validation. The reconciler validates the actual secret content.
Secret Type Mismatch
kubernetes.io/basic-auth)Password Field Missing
password fieldIPv6 Pool with BFD Echo
Given a MetalLB installation in namespace metallb-system
When user applies a Secret with wrong type and BGPPeer that references it:
apiVersion: v1
kind: Secret
metadata:
name: bgp-password
namespace: metallb-system
type: Opaque # Wrong type - should be kubernetes.io/basic-auth
stringData:
password: "mypassword123"
---
apiVersion: metallb.io/v1beta2
kind: BGPPeer
metadata:
name: peer-with-secret
namespace: metallb-system
spec:
myASN: 64512
peerASN: 64513
peerAddress: 192.168.1.2
passwordSecret:
name: bgp-password
Then speaker/configReconciler fails to load configuration
kubectl apply -f secret-and-peer.yaml
# Output:
# secret/bgp-password created
# bgppeer.metallb.io/peer-with-secret created
kubectl logs -n metallb-system -l component=speaker -c speaker --tail=50 | grep "error"
# Output: {"caller":"config_controller.go:140","controller":"ConfigReconciler","error":"parsing peer peer-with-secret secret type mismatch on \"metallb-system\"/\"bgp-password\", type \"kubernetes.io/basic-auth\" is expected \nfailed to parse peer peer-with-secret password secret","level":"error"}