Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancements/monitoring: add proposal for early-monitoring-config-validation #1716

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
238 changes: 238 additions & 0 deletions enhancements/monitoring/early-monitoring-config-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
---
title: early-monitoring-config-validation
authors:
- "@machine424"
reviewers:
machine424 marked this conversation as resolved.
Show resolved Hide resolved
- "@openshift/openshift-team-monitoring"
- TBD
approvers:
- "@openshift/openshift-team-monitoring"
machine424 marked this conversation as resolved.
Show resolved Hide resolved
- TBD
api-approvers:
- TBD
machine424 marked this conversation as resolved.
Show resolved Hide resolved
creation-date: 2024-11-13
last-updated: 2024-11-13
tracking-link:
- TBD
---

# Early Monitoring Config Validation

## Summary

Introduce early validation for changes to monitoring configurations hosted in the
`openshift-monitoring/cluster-monitoring-config` and
`openshift-user-workload-monitoring/user-workload-monitoring-config` ConfigMaps to provide
shorter feedback loops and enhance user experience.
Comment on lines +22 to +25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this overlap with the on-going effort to move these configmaps to CRDs? Seems this work would be redundant once the migration to CRDs is complete. Would it not be better to focus on that effort rather than investing here?

What are the timelines for the migration project?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, as I explained on Slack (I'll explicitly mention that in the proposal) the implementation of the change proposed here was already available when I started the proposal.
See the linked openshift/cluster-monitoring-operator#2490 (PR already merged now), I worked on that during the last shiftweek.

The changes (were meant to take) took less time than the CRD effort, as they only concerned CMO + they'll prepare the way for it (CRD based config), provide a preview of what will happen with CRDs, educate users about it, and ease the migration. + CRD based config becoming GA may take some time and this would be helpful in the meantime.

Also, as I mentioned this proposal primarily serves an informational and documentary purpose for the various stakeholders., and of course, the reviews are intended to help us identify any overlooked side effects. If necessary, we can always revert the CMO PR.

(I'll try to incorporate this into the proposal)


## Motivation

CMO currently uses ConfigMaps to store configurations for the Platform and User Workload monitoring stacks. Due to the limitations of this approach, a migration to CRD based configs is planned ([OBSDA-212](https://issues.redhat.com/browse/OBSDA-212)). In the interim, enhancing the validation process for these ConfigMaps would be highly beneficial.

Insights show that in `2024`, there were more than `650` unique CMO failures related to parsing issues that lasted over `1h`, with some going unnoticed for over `215` days. The total duration of all failures exceeded `10` years.
machine424 marked this conversation as resolved.
Show resolved Hide resolved

### User Stories

As a user, if my configuration is invalid (malformed JSON/YAML, contains invalid, no longer supported, or duplicated fields), I do not want to have to check the operator’s status or logs or wait for an alert to be notified of the issue.

Such situations may lead me to suspect other issues incorrectly within the monitoring stack, causing me to solicit help from colleagues or support.

The existing signals take time to propagate and can easily be missed, resulting in a poor user experience. A shorter feedback loop, where invalid configurations are rejected when trying to push them into the Configmaps (as with CRDs), would be more user-friendly.

### Goals

- **Early Identification of Invalid Configurations**: Detect some invalid configurations (malformed YAML/JSON, unknown fields, duplicated fields, etc.) before CMO attempts to apply them.
- **Improved User Experience**: Empower users with more autonomy by enabling them to identify and correct errors earlier in the configuration process.

### Non-Goals

- The early validation will focus on detecting common errors and avoid computationally intensive deep checks that might impact performance or make the check itself fragile. This means it will not catch all issues that may only be detected when CMO tries to apply the config.
- This addition does not intend to replace or render obsolete the existing `UserWorkloadInvalidConfiguration`/`InvalidConfiguration` related signals in the operator status/logs/alerts.
machine424 marked this conversation as resolved.
Show resolved Hide resolved
- This proposal does not intend to prevent or postpone the planned transition to CRDs for enhanced validation capabilities. Instead, it will prepare the way for it, provide a preview of what will happen with CRDs, educate users about it, and ease the migration.
- Some ConfigMap changes may bypass the CMO validation logic if the CMO operator is down for some reason; these changes will not be validated (best-effort approach).
- ConfigMaps with invalid monitoring configurations deployed before the webhook is enabled (before upgrading to the version that enables the validation webhook on CMO) will not be flagged or adjusted. The webhook will only intervene on them during subsequent changes, if any.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to make sure you employ a ratcheting validation technique to all updates, is that already part of the proposal?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you think the mechanism explained in Upgrade / Downgrade Strategy is sufficient?
It'll help ensure the existing configmaps are in a good shape before upgrading to 4.18 (that would ship the validation webhook).
Also, only two configmaps are concerned by this, with the informative error messages, along with the schema provided here https://docs.openshift.com/container-platform/4.17/observability/monitoring/config-map-reference-for-the-cluster-monitoring-operator.html it shouldn't be too cumbersome to adjust the configmaps if anything slipped through the mechanism in Upgrade / Downgrade Strategy


## Proposal

Implement and expose a validation webhook in CMO. This webhook will intercept `CREATE` and `UPDATE` actions on the platform and UW monitoring ConfigMaps. It will attempt to fetch the configuration within the ConfigMap, unmarshal/parse it, identify potential errors (such as malformed JSON/YAML, unknown field names, or duplicated fields), and reject the request if such issues are found.

### Workflow Description

The webhook will be enabled by default.

The `ValidatingWebhookConfiguration` will ensure the webhook only intervenes on changes to the two ConfigMaps: `openshift-monitoring/cluster-monitoring-config` and `openshift-user-workload-monitoring/user-workload-monitoring-config`.
machine424 marked this conversation as resolved.
Show resolved Hide resolved

```yaml
matchConditions:
- name: 'monitoringconfigmaps'
expression: '(request.namespace == "openshift-monitoring" && request.name == "cluster-monitoring-config")
|| (request.namespace == "openshift-user-workload-monitoring" && request.name
== "user-workload-monitoring-config")'
Comment on lines +67 to +71
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of this, +1

```
The webhook will attempt to unmarshal/parse the config within these ConfigMaps.
If the unmarshalling fails, the action on the ConfigMap will be denied.
For example, if an incorrect field (with a subtle typo) is to be set, the change will fail with:
```
$ kubectl edit configmap cluster-monitoring-config -n openshift-monitoring
error: configmaps "cluster-monitoring-config" could not be patched: admission webhook "monitoringconfigmaps.openshift.io" denied the request: failed to parse data at key \"config.yaml\": error unmarshaling JSON: while decoding JSON: json: unknown field \"telemeterCliennt\"
```
### API Extensions
The following `ValidatingWebhookConfiguration` will be added:

```
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
...
name: monitoringconfigmaps.openshift.io
webhooks:
- admissionReviewVersions:
- v1
clientConfig:
service:
name: cluster-monitoring-operator
namespace: openshift-monitoring
path: /validate-webhook/monitoringconfigmaps
port: 8443
failurePolicy: Ignore
name: monitoringconfigmaps.openshift.io
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: In
values: ["openshift-monitoring","openshift-user-workload-monitoring"]
matchConditions:
- name: 'monitoringconfigmaps'
expression: '(request.namespace == "openshift-monitoring" && request.name == "cluster-monitoring-config")
|| (request.namespace == "openshift-user-workload-monitoring" && request.name
== "user-workload-monitoring-config")'
- name: 'not-skipped'
expression: '!has(object.metadata.labels)
|| !("monitoringconfigmaps.openshift.io/skip-validate-webhook" in object.metadata.labels)
|| object.metadata.labels["monitoringconfigmaps.openshift.io/skip-validate-webhook"] != "true"'
Comment on lines +115 to +118
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a user want to skip the validation webhook?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rules:
- apiGroups: [""]
apiVersions: ["v1"]
operations:
- CREATE
- UPDATE
resources:
- configmaps
scope: Namespaced
sideEffects: None
timeoutSeconds: 5
```
### Topology Considerations
#### Hypershift / Hosted Control Planes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These configmaps exist in HyperShift clusters right? Are they in the guest or management control plane layer? What is the impact going to be of adding a new webhook in HyperShift?

This section needs to be thought through

Copy link
Author

@machine424 machine424 Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ther any special considerations are needed for Hypershift; The early validation could be used wherever CMO is deployed.

I have included additional details under ### Topology Considerations.

That being said, please feel free to notify anyone from Hypershift who you think should be directly informed about this feature. I will also try to reach out to them on Slack.

#### Standalone Clusters
#### Single-node Deployments or MicroShift
### Implementation Details/Notes/Constraints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to explore the idea of ratcheting validation in this section perhaps

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see answer to https://github.com/openshift/enhancements/pull/1716/files#r1860755304, do you this I should mention that here as well?

To avoid any divergence (the validate webhook producing false positives), the webhook will be
running the same code (a subset of the checks) that CMO runs when laoding and applying the config.
machine424 marked this conversation as resolved.
Show resolved Hide resolved
CMO will expose the webhook at `:8443/validate-webhook/monitoringconfigmaps`.
### Risks and Mitigations
In the event that the validation webhook makes incorrect decisions (which we aim to keep rare, as the webhook will run the same code that CMO uses when applying the configuration), users will have the option to temporarily bypass the CMO webhook. This can be done by adding the label `monitoringconfigmaps.openshift.io/skip-validate-webhook: true` to the ConfigMaps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the code is the same, then won't this just mean that the operator then denies the config later? Skipping the validation in this case seems fruitless

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the webhook runs a subset of the validations, but it also engages other code paths (server code, apiserver code, etc.). In the event of a bug in those paths, it's always better to have a fallback.
Also, the label is used to simulate and test scenarios where the webhook is skipped (e.g., CMO pod down)."

Additionally, the webhook endpoint will not perform client authentication on `/validate-webhook/monitoringconfigmaps`. Another proposal will be initiated to discuss how to facilitate easier identification of requests from the apiserver for webhooks in OCP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does authentication matter in this context?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a risk under Risks and Mitigations, just stating that the webhook does not perform any client authentication (everyone on the cluster can use that path).

### Drawbacks
Some users who may have been relying on or exploiting the lack of pre-validation will need to adapt, as their invalid changes to the ConfigMaps will now be denied by the apiserver. This adaptation is necessary for the future transition to CRD-based configuration. Tightening the configurations now serves as a preparatory step for the upcoming CRD-based configuration effort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why you need ratcheting, don't break existing users, but allow them to fix themselves over time as they adjust fields within the configuration

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see answer to https://github.com/openshift/enhancements/pull/1716/files#r1860755304, does that answer your question?

## Open Questions [optional]
## Test Plan
Since the webhook will be enabled by default, all existing tests that create or update the ConfigMaps holding the monitoring configuration are considered tests for the webhook itself.
Additionally, unit and e2e tests will be added (or adjusted) to better highlight invalid configurations scenarios.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a handful of examples of things that would fail the new validation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Invalid configuration related tests outside the CMO repository will also need to be adjusted accordingly.
## Graduation Criteria
The webhook is intended to go directly to `GA` and be enabled by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All new features in openshift should be gated, have E2E added and prove stability in tech preview before they are promoted to default. Promotion from tech preview to GA can be within a single release (ie within 4.19 entirely), but you should start off by default to protect the payload stability while you are developing the feature

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All new features in openshift should be gated

Well, this statement is a bit vague and not entirely accurate. I don't want to point fingers, but this isn't always strictly followed :) and sometimes it's better not to.

Allow me to explain why we're making this GA by default:

  • We have ensured that the feature is thoroughly tested and passes all e2e and blocking payload tests, as well as many of the informing tests that we monitor.
  • Making it 'tech preview' would just limit the clusters on which this feature could be tested.
  • This is not the first time we are using validation webhooks in the monitoring stack; we already have https://github.com/openshift/cluster-monitoring-operator/tree/master/assets/admission-webhook for some of prometheus operator CRs.
  • We can easily revert the implementation PR if the tests or feedback suggest that this feature shouldn't be part of 4.18.0. Additionally, the monitoringconfigmaps.openshift.io/skip-validate-webhook: true label can be used to contain any issues.

We believe this feature is well defined and its potential breakages can be easily managed. Thus, it is simpler and faster to proceed with this approach.

End users will be informed of this change via an entry in the release notes.
We'll wait for the first instance where the webhook needs to be skipped (via the `monitoringconfigmaps.openshift.io/skip-validate-webhook: true` label) to document the procedure, probably in a KCS article. We'll avoid mentioning the opt-out mechanism in the official documentation to prevent abuse, as we want users to tighten up their use of the monitoring ConfigMaps in preparation for the CRD-based configuration migration.
### Dev Preview -> Tech Preview
N/A
### Tech Preview -> GA
N/A
### Removing a deprecated feature
Once CRD-based configuration is GA, configuration via ConfigMaps will no longer be allowed, and the webhook logic will be removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that is true, the migration path for the cluster monitoring CRDs I believe is ambiguous, and, we don't know exactly when the support for configmaps will go away

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? What I'm trying to say is that "once CMO no longer uses Configmaps, the webhook logic will become useless and will be removed"

I changed the wording, tell me if it's ok.

## Upgrade / Downgrade Strategy
Even after CMO is upgraded to a version with the webhook enabled, as long as the existing monitoring config ConfigMaps are not updated, they will not be flagged by the webhook.
A change in `4.17.z` will make CMO report `upgradeable=false` if the existing configs contain malformed JSON/YAML, invalid fields, no longer supported fields, or duplicated fields. We will ensure clusters reach that version before being able to upgrade to `4.18`. This will help avoid blocking implicit or unplanned changes to ConfigMaps with invalid configs during the upgrade.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version numbers probably need to be updated here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The z in 4.17.z is now known 5 and we're still aiming for 4.18

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's already existing bad config, and the operator is running the same checks that the webhook will run, why is the operator going degraded/not upgradeable not already a thing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMO already goes degraded on bad config, see user stories and non goals, the problem is that the resulting signals show up late and can easily be missed.

Upgrades will be covered by existing upgrade tests.
In case of a rollback, the CVO-managed `monitoringconfigmaps.openshift.io` `ValidatingWebhookConfiguration` may need to be deleted to avoid the unnecessary `timeoutSeconds: 5` overhead on each change to the monitoring config ConfigMaps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could backport a tombstone resource to the previous release which would make the CVO remove this resource if it were to see it

Copy link
Author

@machine424 machine424 Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll look into that.
(also, I think I'm a little bit pessimistic as the server would just respond with "I don't know nothing about /validate-webhook/monitoringconfigmaps" in way less than 5s... I'll give that a try.)

## Version Skew Strategy
The `matchConditions` fields of `ValidatingWebhookConfiguration` are used to limit the webhook to only the 2 monitoring config ConfigMaps and to implement the opt-out mechanism via the label.
`matchConditions` are considered stable in Kubernetes `v1.30`, which has been used since OCP `4.17`. This means that even in the case of a partial upgrade of the apiserver or a downgrade, having that resource around shouldn't cause any issues.
## Operational Aspects of API Extensions
The webhook is configured with `failurePolicy: Ignore`, making it best effort and avoiding having the single CMO replica as a single point of failure. Another protection is added by setting `timeoutSeconds: 5` in case CMO is overwhelmed.
`timeoutSeconds: 5` means that the webhook may add up to `5 seconds` to the two monitoring config ConfigMaps `CREATE` and `UPDATE` requests.
In reality, even for a scenario of `5` ConfigMap updates per second (which is likely an overestimate of actual usage), the `99th` percentile of the processing latency of the admission webhook is expected to be less than `5ms`.
Fewer failures due to `UserWorkloadInvalidConfiguration`/`InvalidConfiguration` should start to be seen for the `monitoring` cluster operator, as some invalid configs will be caught earlier via the webhook now.
## Support Procedures
The `apiserver_admission_webhook_*` metrics should provide insights into the status of the webhook from the apiserver's perspective. For example:
```
histogram_quantile(0.99, rate(apiserver_admission_webhook_admission_duration_seconds_bucket{name="monitoringconfigmaps.openshift.io"}[5m]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we're interested in different metrics for user/platform instance too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to identify the concerned configmap? platform or the UW one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No existing metrics from the API server provide such detailed information (as this would result in high cardinality, particularly for other webhooks that may be responsible for all configmaps in a cluster, for example), so the metrics would need to be added on CMO side.

In our case, even though only 2 configmaps concern us, don't you think the debug logs (shown below) are sufficient? It's true that for this, the issue should be reproducible, but wouldn't that be easy since, after all, we only have 2 configmaps to consider?

```
This allows us to monitor the processing latency.
From the CMO perspective, increasing the operator's log level (setting it to `-v=9`) reveals logs such as:
```
I1113 10:58:29.103668 1 handler.go:153] cluster-monitoring-operator: POST "/validate-webhook/monitoringconfigmaps" satisfied by nonGoRestful
I1113 10:58:29.103702 1 pathrecorder.go:243] cluster-monitoring-operator: "/validate-webhook/monitoringconfigmaps" satisfied by exact match
I1113 10:58:29.104042 1 http.go:117] "received request" logger="admission" object="openshift-monitoring/cluster-monitoring-config" namespace="openshift-monitoring" name="cluster-monitoring-config" resource={"group":"","version":"v1","resource":"configmaps"} user="system:admin" requestID="b154b96a-6fe6-4abd-a827-c662d8211719"
I1113 10:58:29.104687 1 http.go:163] "wrote response" logger="admission" code=403 reason="Forbidden" message="failed to parse data at key \"config.yaml\": error unmarshaling JSON: while decoding JSON: json: unknown field \"telemeterCliennt\"" requestID="b154b96a-6fe6-4abd-a827-c662d8211719" allowed=false
I1113 10:58:29.104762 1 httplog.go:134] "HTTP" verb="POST" URI="/validate-webhook/monitoringconfigmaps?timeout=5s" latency="1.42784ms" userAgent="kube-apiserver-admission" audit-ID="2570bcda-55eb-44f6-b319-5e29d58ad3f0" srcIP="10.128.0.2:48220" resp=200
```
Cross-referencing these with the apiserver logs should provide detailed insights on a per-request basis.
## Alternatives
Wait for CRD based configs to be GA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see more links to this alternative within the document (so those who aren't familiar can find the other EP), and, you should also expand on why this isn't the route we are taking, why has an alternative been dismissed. I've left questions on this earlier, and as an outsider, have no context on why we aren't doing this, explain it to me in this section

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Infrastructure Needed [optional]