Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancements/monitoring: add proposal for early-monitoring-config-validation #1716
base: master
Are you sure you want to change the base?
enhancements/monitoring: add proposal for early-monitoring-config-validation #1716
Changes from 1 commit
bfb53ec
309d93a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this overlap with the on-going effort to move these configmaps to CRDs? Seems this work would be redundant once the migration to CRDs is complete. Would it not be better to focus on that effort rather than investing here?
What are the timelines for the migration project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, as I explained on Slack (I'll explicitly mention that in the proposal) the implementation of the change proposed here was already available when I started the proposal.
See the linked openshift/cluster-monitoring-operator#2490 (PR already merged now), I worked on that during the last shiftweek.
The changes (were meant to take) took less time than the CRD effort, as they only concerned CMO + they'll
prepare the way for it (CRD based config), provide a preview of what will happen with CRDs, educate users about it, and ease the migration.
+ CRD based config becoming GA may take some time and this would be helpful in the meantime.Also, as I mentioned
this proposal primarily serves an informational and documentary purpose for the various stakeholders.
, and of course, the reviews are intended to help us identify any overlooked side effects. If necessary, we can always revert the CMO PR.(I'll try to incorporate this into the proposal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to make sure you employ a ratcheting validation technique to all updates, is that already part of the proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think the mechanism explained in
Upgrade / Downgrade Strategy
is sufficient?It'll help ensure the existing configmaps are in a good shape before upgrading to
4.18
(that would ship the validation webhook).Also, only two configmaps are concerned by this, with the informative error messages, along with the schema provided here https://docs.openshift.com/container-platform/4.17/observability/monitoring/config-map-reference-for-the-cluster-monitoring-operator.html it shouldn't be too cumbersome to adjust the configmaps if anything slipped through the mechanism in
Upgrade / Downgrade Strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of this, +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a user want to skip the validation webhook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/openshift/enhancements/pull/1716/files#r1860762082
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These configmaps exist in HyperShift clusters right? Are they in the guest or management control plane layer? What is the impact going to be of adding a new webhook in HyperShift?
This section needs to be thought through
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ther any special considerations are needed for Hypershift; The early validation could be used wherever CMO is deployed.
I have included additional details under
### Topology Considerations
.That being said, please feel free to notify anyone from Hypershift who you think should be directly informed about this feature. I will also try to reach out to them on Slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to explore the idea of ratcheting validation in this section perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see answer to https://github.com/openshift/enhancements/pull/1716/files#r1860755304, do you this I should mention that here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the code is the same, then won't this just mean that the operator then denies the config later? Skipping the validation in this case seems fruitless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the webhook runs a subset of the validations, but it also engages other code paths (server code, apiserver code, etc.). In the event of a bug in those paths, it's always better to have a fallback.
Also, the label is used to simulate and test scenarios where the webhook is skipped (e.g., CMO pod down)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does authentication matter in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a risk under
Risks and Mitigations
, just stating that the webhook does not perform any client authentication (everyone on the cluster can use that path).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why you need ratcheting, don't break existing users, but allow them to fix themselves over time as they adjust fields within the configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see answer to https://github.com/openshift/enhancements/pull/1716/files#r1860755304, does that answer your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a handful of examples of things that would fail the new validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All new features in openshift should be gated, have E2E added and prove stability in tech preview before they are promoted to default. Promotion from tech preview to GA can be within a single release (ie within 4.19 entirely), but you should start off by default to protect the payload stability while you are developing the feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this statement is a bit vague and not entirely accurate. I don't want to point fingers, but this isn't always strictly followed :) and sometimes it's better not to.
Allow me to explain why we're making this GA by default:
monitoringconfigmaps.openshift.io/skip-validate-webhook: true
label can be used to contain any issues.We believe this feature is well defined and its potential breakages can be easily managed. Thus, it is simpler and faster to proceed with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that is true, the migration path for the cluster monitoring CRDs I believe is ambiguous, and, we don't know exactly when the support for configmaps will go away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? What I'm trying to say is that "once CMO no longer uses Configmaps, the webhook logic will become useless and will be removed"
I changed the wording, tell me if it's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version numbers probably need to be updated here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
z
in4.17.z
is now known5
and we're still aiming for 4.18There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's already existing bad config, and the operator is running the same checks that the webhook will run, why is the operator going degraded/not upgradeable not already a thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMO already goes degraded on bad config, see
user stories and non goals
, the problem is that the resulting signals show up late and can easily be missed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could backport a tombstone resource to the previous release which would make the CVO remove this resource if it were to see it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll look into that.
(also, I think I'm a little bit pessimistic as the server would just respond with "I don't know nothing about /validate-webhook/monitoringconfigmaps" in way less than 5s... I'll give that a try.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we're interested in different metrics for user/platform instance too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean to identify the concerned configmap? platform or the UW one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No existing metrics from the API server provide such detailed information (as this would result in high cardinality, particularly for other webhooks that may be responsible for all configmaps in a cluster, for example), so the metrics would need to be added on CMO side.
In our case, even though only 2 configmaps concern us, don't you think the debug logs (shown below) are sufficient? It's true that for this, the issue should be reproducible, but wouldn't that be easy since, after all, we only have 2 configmaps to consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see more links to this alternative within the document (so those who aren't familiar can find the other EP), and, you should also expand on why this isn't the route we are taking, why has an alternative been dismissed. I've left questions on this earlier, and as an outsider, have no context on why we aren't doing this, explain it to me in this section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my answer to https://github.com/openshift/enhancements/pull/1716/files#r1860750216.
I'll add the links.