-
Notifications
You must be signed in to change notification settings - Fork 6.8k
fix: check namespace existence before fetching namespaced resources during sync #26148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: check namespace existence before fetching namespaced resources during sync #26148
Conversation
… sync Signed-off-by: Patroklos Papapetrou <ppapapetrou76@gmail.com>
🔴 Preview Environment stopped on BunnyshellSee: Environment Details | Pipeline Logs Available commands (reply to this comment):
|
blakepettersson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me, but would be good to get the thoughts of e.g @leoluz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the issue and why I think this fix resolves it:
The scenario in which Argo CD does not have RBAC to perform K8s operations on a resource, and relies on Kyverno to create the needed RBAC is a special scenario.
In this scenario, there is a logical deadlock Argo CD enters.
In order to create a ns during sync, the GetManagedLiveObjs must return successfully, but it does not since K8s API throws Forbidden error upon an attempt to get the resources in the (non-existing at this point) ns.
So the sync does not happen and the ns is not created. But in the user's scenario, the RBAC for Argo CD to create the other resources in that ns can only be provided AFTER the ns is created.
The core issue, in my understanding, is here:
argo-cd/gitops-engine/pkg/cache/cluster.go
Line 1438 in e3f616d
| if err != nil { |
When Argo CD has RBAC (regular cluster install with admin rolebinding/cluster rolebinding), the error upon getting a resource in a non-existent ns is
not found, which is treated as an OK scenario, allowing sync, but when Argo CD has no RBAC, the error upon getting a resource in a non-existent ns is forbidden which is treated as an error scenario and results in the sync not happening.
The proposed fix checks for the ns existence in the cache, and if it does not exist, the execution path of performing a kubectl.GetResource is avoided, so the sync continues normally and the ns would be created in both cases (user described case with argo cd ns scoped installation with partial RBAC and argo cd cluster scope installation with full RBAC).
One thing to note here is that since Kyverno is external and it's creation of the RBAC that Argo CD requires to create the resources in this new ns is async, this fix alone is not enough and the user is recommended to configure sync retries in addition to using this fix, otherwise the ns would be created successfully but the resources in it may not, till the next explicit sync.
reggie-k
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like my understanding was only partial, after further manual testing looks like there are additional execution paths that throw the forbidden error, not allowing the ns to be created.
The issue was that the
CompareAppStatefunction (specifically through the gitops-engineGetManagedLiveObjscluster cache method) would attempt to fetch namespaced resources if the namespace did not exist in the cluster. This could result inForbiddenerrors if the service account has no permission to access the non-existent namespace.This PR addresses the above by checking if a namespace exists in the cluster's cache before attempting a live fetch.
If the namespace is missing and the resource is namespaced, the function now assumes the resource is also missing and skips the fetch.
I also added a UT that fails without the introduced code and succeeds with the changes.
closes #26076
Checklist: