-
Notifications
You must be signed in to change notification settings - Fork 306
Description
What happened:
In multiple standard load balancer mode, local services (ExternalTrafficPolicy=Local) can trigger backend pool updates from two independent paths:
- The main service reconciliation loop (
EnsureLoadBalancer/UpdateLoadBalancer). - The background
backendPoolUpdaterdriven by EndpointSlice updates.
These paths do not share the same serialization mechanism, so they can concurrently call ARM BackendAddressPool CreateOrUpdate for the same backend pool. This can lead to intermittent ARM conflicts (e.g. ETag precondition failures / operation canceled), noisy warning events (LoadBalancerBackendPoolUpdateFailed), and slower convergence during EndpointSlice churn or service LB migration.
Code pointers:
- Updater write path:
pkg/provider/azure_local_services.go(loadBalancerBackendPoolUpdater.process->BackendAddressPoolClient.Get/CreateOrUpdate). - Main reconcile write path:
pkg/provider/azure_loadbalancer_backendpool.go(EnsureHostsInPool->CreateOrUpdateLBBackendPool). - Main reconcile serialization:
pkg/provider/azure_loadbalancer.go(serviceReconcileLock) and optionalazureResourceLocker(lease-based).
What you expected to happen:
Backend pool updates should be serialized with the main service reconciliation loop (e.g., share serviceReconcileLock and/or azureResourceLocker), or otherwise handle transient ARM conflicts without emitting failure events.
How to reproduce it (as minimally and precisely as possible):
- Enable multiple standard load balancers in cloud-provider-azure.
- Create a
LoadBalancerService withExternalTrafficPolicy=Local. - Cause rapid EndpointSlice changes (e.g., scale a Deployment up/down) while the service is being reconciled (initial LB creation, or while moving the service between LBs).
- Observe intermittent backend pool update failures in CCM logs and/or Service events.
Anything else we need to know?:
- The updater currently performs ARM calls while holding its internal queue mutex; if serialization is added, it likely should dequeue under the queue mutex and then do ARM operations outside of it.
- The updater does not currently use
azureResourceLocker, so in HA deployments multiple CCM replicas could also run the updater concurrently.
Environment:
- Kubernetes version (use
kubectl version): N/A - Cloud provider or hardware configuration: Azure (cloud-provider-azure), multiple standard load balancers enabled
- OS (e.g:
cat /etc/os-release): N/A - Kernel (e.g.
uname -a): N/A - Install tools: cloud-controller-manager + cloud-provider-azure
- Network plugin and version (if this is a network-related bug): N/A
- Others: Services with
ExternalTrafficPolicy=Local(EndpointSlice-driven updates)