-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCE] fix unsafe webhook vpa-webhook-config #6428
Closed
Closed
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really matter, considering it runs with
failurePolicy: ignore
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with
failurePolicy: ignore
, GKE will still create an error for you and display it in your cluster, so I guess it makes sense to ignore those namespaces. This is a breaking change, though, if for some reason people would auto-scale stuff inkube-system
, it would stop working for them. Therefore, I think we should only do this conditionally and hide it behind some configuration parameter.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but if GKE raises error when it shouldn't that is a problem with GKE, not with the components that happen to trigger that behavior. I think we should have better reasoning than "Cloudprovider UI displays an error".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with both of you.
It's breaking and should be flag gated.
Not sure if it's about the UI, but agree that if it's really a cloud provider specific UI thing that shouldn't force us to fix it. I thought webhooks against system namespaces are considered "dangerous" either way.
Honestly I'm surprised to see this selfRegistration thing, was not aware that it's something people do. @alvaroaleman do you know if that's common?
I think a flag that lets you specify namespaces that should be ignored could be one option.
I'd like to make the default safer, prevent people from accidentally locking their clusters, but as the failurePolicy is ignore. that should be fine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its uncommon for projects to provide this IME. If it is provided, it is probably at least semi-common to use, because webhooks require you to generate certs and not everyone has some other mechanism in their cluster available to do that or knows how to do it through helm and friends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alvaroaleman
Absolutely agreed! I guess in this specific case, it is hard to say "the UI is just wrong, so if you want to get rid of this error, go talk to your cloud provider". As @kwiesmueller pointed out, even with
failurePolicy: ignore
, user deployed webhooks in system namespaces should be avoided. In our internal k8s platform we do a very similar thing and flag these things for users.@kwiesmueller
This is a double-edged feature. Users can fix validation errors like mentioned above themselves. However, at the same time it makes it easier/possible to configure something that leads to an endless eviction loop: If you only configure the webhook to ignore a certain namespace but happen to have objects under VPA control in those namespaces, Updater will permanently evict your workload.
Ideally, we wouldn't make it easy or even possible for people to create these situations and shoot themselves in the foot. It reminds me a bit of the discussions we had around making the labelSelector configurable in #5660.
I imagine the best compromise here is probable a namespace denylist, similar to how the VPA components currently have the concept of an allowlist for namespaces they work in (parameter
vpa-object-namespace
). The configuration parameter is the the same on all 3 components and you could argue that a user setting this parameter on one component should take care of setting it on the other two components with the same values. Denylist and allowlist should be made exclusive to each other, and hopefully we found a good solution for everyone. WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did they point that out? And could you elaborate on the rationale behind that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a quote from the GKE docs
The same is mentioned in the official k8s docs, but maybe with a bit softer wording
I guess that
failurePolicy: ignore
will help in most cases, but there are still scenarios where a failing webhook can stop k8s from working properly, so that's probably why even with thisfailurePolicy
webhooks are still flagged as problematic. However, I see that you were involved in the discussion around this, so you probably have a better understanding of this than I do.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the webhook returning a valid http response that is not parseable or such would be an issue. The other issue I am aware of as it caused an outage for us is if you validate anything related to leader election, because the client-side timeouts there are way below the 30 seconds, so all of these requests can fail before the webhook times out, which can brick a cluster.
However, we are only validating create pod requests, i think that should be fine.