Skip to content

Potential race: backendPoolUpdater concurrent with Service LB reconciliation for local services #9839

@nilo19

Description

@nilo19

What happened:

In multiple standard load balancer mode, local services (ExternalTrafficPolicy=Local) can trigger backend pool updates from two independent paths:

  1. The main service reconciliation loop (EnsureLoadBalancer / UpdateLoadBalancer).
  2. The background backendPoolUpdater driven by EndpointSlice updates.

These paths do not share the same serialization mechanism, so they can concurrently call ARM BackendAddressPool CreateOrUpdate for the same backend pool. This can lead to intermittent ARM conflicts (e.g. ETag precondition failures / operation canceled), noisy warning events (LoadBalancerBackendPoolUpdateFailed), and slower convergence during EndpointSlice churn or service LB migration.

Code pointers:

  • Updater write path: pkg/provider/azure_local_services.go (loadBalancerBackendPoolUpdater.process -> BackendAddressPoolClient.Get/CreateOrUpdate).
  • Main reconcile write path: pkg/provider/azure_loadbalancer_backendpool.go (EnsureHostsInPool -> CreateOrUpdateLBBackendPool).
  • Main reconcile serialization: pkg/provider/azure_loadbalancer.go (serviceReconcileLock) and optional azureResourceLocker (lease-based).

What you expected to happen:

Backend pool updates should be serialized with the main service reconciliation loop (e.g., share serviceReconcileLock and/or azureResourceLocker), or otherwise handle transient ARM conflicts without emitting failure events.

How to reproduce it (as minimally and precisely as possible):

  1. Enable multiple standard load balancers in cloud-provider-azure.
  2. Create a LoadBalancer Service with ExternalTrafficPolicy=Local.
  3. Cause rapid EndpointSlice changes (e.g., scale a Deployment up/down) while the service is being reconciled (initial LB creation, or while moving the service between LBs).
  4. Observe intermittent backend pool update failures in CCM logs and/or Service events.

Anything else we need to know?:

  • The updater currently performs ARM calls while holding its internal queue mutex; if serialization is added, it likely should dequeue under the queue mutex and then do ARM operations outside of it.
  • The updater does not currently use azureResourceLocker, so in HA deployments multiple CCM replicas could also run the updater concurrently.

Environment:

  • Kubernetes version (use kubectl version): N/A
  • Cloud provider or hardware configuration: Azure (cloud-provider-azure), multiple standard load balancers enabled
  • OS (e.g: cat /etc/os-release): N/A
  • Kernel (e.g. uname -a): N/A
  • Install tools: cloud-controller-manager + cloud-provider-azure
  • Network plugin and version (if this is a network-related bug): N/A
  • Others: Services with ExternalTrafficPolicy=Local (EndpointSlice-driven updates)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions