-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disambiguate the resource usage node removal eligibility messages #6223
Conversation
|
Welcome @shapirus! |
/lgtm I see the use different messaging for UX, but I don't see a need for decimal to be more restrictive. |
@x13n you could review approval for this change. |
I agree, this justifies having them. |
I can see how
If going the verbose way, might be useful to also extend it like this:
(in case when |
If we are ok to change it a little more, what about this? klog.V(4).Infof("Node %s unremovable: %s request allocation %.2f%% is above the threshold", node.Name, utilInfo.ResourceName, utilInfo.Utilization * 100)
This makes it a small self-contained change aiming to solve one particular problem. On to your second suggestion, regarding the DaemonSet logic. I think it should be a matter of another PR, because it's going to introduce new logic where this log message is produced (checking if daemonsets are skipped or not), thus it should not block the initially proposed change. If we speak about further improvements, it would be beneficial for the user's experience to include the actual value of the menitoned threshold in the message. |
I'm in favor of improving readability of logs, but I don't think introducing new concepts is the right approach. In Kubernetes, there's a number of well defined quantities: pod request, pod limit, node allocatable, etc. My main point here is that if the current message is not using these concepts (i.e. "request allocation" as a percentage). I don't have a strong opinion on percentage vs fraction or how many digits the number has. |
What about the following: klog.V(4).Infof("Node %s unremovable: requested %s (%.2f%% of allocatable) is above the scale-down utilization threshold", node.Name, utilInfo.ResourceName, utilInfo.Utilization * 100) ? this will produce e.g.:
|
@x13n, I do have a preference on how many digits. I don't think we should lower the accuracy. However, no preference on percentage vs fraction. :) |
(meanwhile, I force-pushed the commit to the lasted version I ended up with) |
f86706a
to
112abe9
Compare
...one more update: However I personally think that such precision makes at least two least-significant digits meaningless. There is no way of making any practical use of it. Such precision is way above the practical measurement accuracy. 4 digits (12.34%, 1.234%) should be more than enough for any practical use case and it will be more readable. @x13n what do you think of having 4 significant digits for it? |
Yup. Looks good to me. With one caveat that it is actually not true when
Yeah, I struggle to find a use case for keeping 6 significant digits. @jayantjain93 - do you have a specific use case for when high precision would actually help? Does 4 sound reasonable to you? |
So we're talking about this code here: autoscaler/cluster-autoscaler/simulator/utilization/info.go Lines 84 to 127 in db5e83b
As far as I understand, they are excluded from both the numerator and denominator. Which essentially means that the message will still be correct, if it is understood that what it says implies something like "...where requests and allocatable are what they mean to CA in this context instead of raw values from k8s", but we obviously can't add these words as is. We could use something like "Node i-dfa1aa285b1240e88 unremovable: effective requested memory (70.33% of effective allocatable) is above the scale-down utilization threshold", but I think it will look ugly. To be strictly correct there, we could say That being said, I agree, it would be nice to indicate that the way that CA calculates utilization depends on certain settings, and then it'll be up to the user to understand what configuration options they set. If only we could make it simple and concise. Any ideas on possible wording to use without adding extra logic (that'll be partially duplicating what's already done in |
Well, all of that was neatly hidden behind the word "utilization" in the existing code. I understand it is easy to confuse with other notions of utilization, so maybe it would suffice to alter the term a bit? "Allocatable utilization" is almost correct, modulo the caveats discussed. How about a completely different term like "node efficiency"? It'd probably make sense to extend the FAQ to explain what that means, but it would definitely resolve the current confusion problem. Naming is hard... |
The sole purpose of the proposed change was to make the message clear enough for the user to understand at a glance what's going on, without opening a web browser to search for documentation or explanation (after looking at I believe it serves that purpose, while staying sufficiently correct for (arguably) most use cases. Even if the actual number reported may not match the number calculated outside of CA when it's affected by the daemonset-related condition, the message will still serve its main purpose and not confuse the user:
The difference caused by the different states of the skipDaemonSetPods attribute will come into play, if ever, only at a deeper level of debugging, at which point the user will very likely be reading the documentation anyway. (it must be noted here that a significant number of users use CA deployed with a default set of manifests, or even deployed, often by default, by tools like kops, with all the default settings, and never read the documentation.) Question is what is our goal about this: provide a better user experience with some technical caveats (until someone comes up with a good wording to cover both), or be strictly correct in the terminology, but make the logs significantly more verbose and potentially more confusing? |
Ok, I think the current phrasing is strictly better in the confusion dimension, so I'm ok to merge as-is. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shapirus, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Whoops, I actually don't think the hold is needed. /hold cancel |
Can you please elaborate on this? Do I need to change something? |
Yes - if you follow the |
@x13n that's fixed now. |
Thanks! /lgtm |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Improves user experience as described in #6222
Which issue(s) this PR fixes:
#6222
Special notes for your reviewer:
I would also change the
%f
to%.2f
in the messages to remove the extra significant digits that provide no added value from the output. Please let me know in the comments if it's a good idea, I'll update my commit if needed.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: