Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancements/monitoring: add proposal for early-monitoring-config-validation #1716
base: master
Are you sure you want to change the base?
enhancements/monitoring: add proposal for early-monitoring-config-validation #1716
Changes from all commits
bfb53ec
309d93a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this overlap with the on-going effort to move these configmaps to CRDs? Seems this work would be redundant once the migration to CRDs is complete. Would it not be better to focus on that effort rather than investing here?
What are the timelines for the migration project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, as I explained on Slack (I'll explicitly mention that in the proposal) the implementation of the change proposed here was already available when I started the proposal.
See the linked openshift/cluster-monitoring-operator#2490 (PR already merged now), I worked on that during the last shiftweek.
The changes (were meant to take) took less time than the CRD effort, as they only concerned CMO + they'll
prepare the way for it (CRD based config), provide a preview of what will happen with CRDs, educate users about it, and ease the migration.
+ CRD based config becoming GA may take some time and this would be helpful in the meantime.Also, as I mentioned
this proposal primarily serves an informational and documentary purpose for the various stakeholders.
, and of course, the reviews are intended to help us identify any overlooked side effects. If necessary, we can always revert the CMO PR.(I'll try to incorporate this into the proposal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of this, +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a user want to skip the validation webhook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/openshift/enhancements/pull/1716/files#r1860762082
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These configmaps exist in HyperShift clusters right? Are they in the guest or management control plane layer? What is the impact going to be of adding a new webhook in HyperShift?
This section needs to be thought through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ther any special considerations are needed for Hypershift; The early validation could be used wherever CMO is deployed.
I have included additional details under
### Topology Considerations
.That being said, please feel free to notify anyone from Hypershift who you think should be directly informed about this feature. I will also try to reach out to them on Slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to explore the idea of ratcheting validation in this section perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see answer to https://github.com/openshift/enhancements/pull/1716/files#r1860755304, do you this I should mention that here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does authentication matter in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a risk under
Risks and Mitigations
, just stating that the webhook does not perform any client authentication (everyone on the cluster can use that path).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why you need ratcheting, don't break existing users, but allow them to fix themselves over time as they adjust fields within the configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see answer to https://github.com/openshift/enhancements/pull/1716/files#r1860755304, does that answer your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All new features in openshift should be gated, have E2E added and prove stability in tech preview before they are promoted to default. Promotion from tech preview to GA can be within a single release (ie within 4.19 entirely), but you should start off by default to protect the payload stability while you are developing the feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this statement is a bit vague and not entirely accurate. I don't want to point fingers, but this isn't always strictly followed :) and sometimes it's better not to.
Allow me to explain why we're making this GA by default:
monitoringconfigmaps.openshift.io/skip-validate-webhook: true
label can be used to contain any issues.We believe this feature is well defined and its potential breakages can be easily managed. Thus, it is simpler and faster to proceed with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could backport a tombstone resource to the previous release which would make the CVO remove this resource if it were to see it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll look into that.
(also, I think I'm a little bit pessimistic as the server would just respond with "I don't know nothing about /validate-webhook/monitoringconfigmaps" in way less than 5s... I'll give that a try.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we're interested in different metrics for user/platform instance too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean to identify the concerned configmap? platform or the UW one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No existing metrics from the API server provide such detailed information (as this would result in high cardinality, particularly for other webhooks that may be responsible for all configmaps in a cluster, for example), so the metrics would need to be added on CMO side.
In our case, even though only 2 configmaps concern us, don't you think the debug logs (shown below) are sufficient? It's true that for this, the issue should be reproducible, but wouldn't that be easy since, after all, we only have 2 configmaps to consider?