Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCE] fix unsafe webhook vpa-webhook-config #6428

Closed
wants to merge 4 commits into from
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions vertical-pod-autoscaler/pkg/admission-controller/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,15 @@ func selfRegistration(clientset *kubernetes.Clientset, caCert []byte, namespace,
sideEffects := admissionregistration.SideEffectClassNone
failurePolicy := admissionregistration.Ignore
RegisterClientConfig.CABundle = caCert
namespaceSelector := metav1.LabelSelector{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really matter, considering it runs with failurePolicy: ignore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with failurePolicy: ignore, GKE will still create an error for you and display it in your cluster, so I guess it makes sense to ignore those namespaces. This is a breaking change, though, if for some reason people would auto-scale stuff in kube-system, it would stop working for them. Therefore, I think we should only do this conditionally and hide it behind some configuration parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but if GKE raises error when it shouldn't that is a problem with GKE, not with the components that happen to trigger that behavior. I think we should have better reasoning than "Cloudprovider UI displays an error".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with both of you.
It's breaking and should be flag gated.

Not sure if it's about the UI, but agree that if it's really a cloud provider specific UI thing that shouldn't force us to fix it. I thought webhooks against system namespaces are considered "dangerous" either way.
Honestly I'm surprised to see this selfRegistration thing, was not aware that it's something people do. @alvaroaleman do you know if that's common?

I think a flag that lets you specify namespaces that should be ignored could be one option.
I'd like to make the default safer, prevent people from accidentally locking their clusters, but as the failurePolicy is ignore. that should be fine, right?

Copy link
Member

@alvaroaleman alvaroaleman Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm surprised to see this selfRegistration thing, was not aware that it's something people do. @alvaroaleman do you know if that's common?

Its uncommon for projects to provide this IME. If it is provided, it is probably at least semi-common to use, because webhooks require you to generate certs and not everyone has some other mechanism in their cluster available to do that or knows how to do it through helm and friends.

Copy link
Contributor

@voelzmo voelzmo Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvaroaleman

I think we should have better reasoning than "Cloudprovider UI displays an error".

Absolutely agreed! I guess in this specific case, it is hard to say "the UI is just wrong, so if you want to get rid of this error, go talk to your cloud provider". As @kwiesmueller pointed out, even with failurePolicy: ignore, user deployed webhooks in system namespaces should be avoided. In our internal k8s platform we do a very similar thing and flag these things for users.

@kwiesmueller

I think a flag that lets you specify namespaces that should be ignored could be one option.

This is a double-edged feature. Users can fix validation errors like mentioned above themselves. However, at the same time it makes it easier/possible to configure something that leads to an endless eviction loop: If you only configure the webhook to ignore a certain namespace but happen to have objects under VPA control in those namespaces, Updater will permanently evict your workload.
Ideally, we wouldn't make it easy or even possible for people to create these situations and shoot themselves in the foot. It reminds me a bit of the discussions we had around making the labelSelector configurable in #5660.

I imagine the best compromise here is probable a namespace denylist, similar to how the VPA components currently have the concept of an allowlist for namespaces they work in (parameter vpa-object-namespace). The configuration parameter is the the same on all 3 components and you could argue that a user setting this parameter on one component should take care of setting it on the other two components with the same values. Denylist and allowlist should be made exclusive to each other, and hopefully we found a good solution for everyone. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failurePolicy: ignore, user deployed webhooks in system namespaces should be avoided

Where did they point that out? And could you elaborate on the rationale behind that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quote from the GKE docs

If a webhook is intercepting any resources in system-managed namespaces, or certain types of resources, GKE considers this unsafe and recommends that you update the webhooks to avoid intercepting these resources.

The same is mentioned in the official k8s docs, but maybe with a bit softer wording

If your admission webhooks don't intend to modify the behavior of the Kubernetes control plane, exclude the kube-system namespace from being intercepted using a namespaceSelector.

I guess that failurePolicy: ignore will help in most cases, but there are still scenarios where a failing webhook can stop k8s from working properly, so that's probably why even with this failurePolicy webhooks are still flagged as problematic. However, I see that you were involved in the discussion around this, so you probably have a better understanding of this than I do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the webhook returning a valid http response that is not parseable or such would be an issue. The other issue I am aware of as it caused an outage for us is if you validate anything related to leader election, because the client-side timeouts there are way below the 30 seconds, so all of these requests can fail before the webhook times out, which can brick a cluster.

However, we are only validating create pod requests, i think that should be fine.

MatchExpressions: []metav1.LabelSelectorRequirement{
{
Key: "kubernetes.io/metadata.name",
Operator: metav1.LabelSelectorOpNotIn,
Values: []string{"kube-system", "kube-node-lease"},
},
},
}
webhookConfig := &admissionregistration.MutatingWebhookConfiguration{
ObjectMeta: metav1.ObjectMeta{
Name: webhookConfigName,
Expand All @@ -91,10 +100,11 @@ func selfRegistration(clientset *kubernetes.Clientset, caCert []byte, namespace,
},
},
},
FailurePolicy: &failurePolicy,
ClientConfig: RegisterClientConfig,
SideEffects: &sideEffects,
TimeoutSeconds: &timeoutSeconds,
FailurePolicy: &failurePolicy,
ClientConfig: RegisterClientConfig,
SideEffects: &sideEffects,
TimeoutSeconds: &timeoutSeconds,
NamespaceSelector: &namespaceSelector,
},
},
}
Expand Down
Loading