Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export storageconsumer data as metrics and alerts #2227

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

leelavg
Copy link
Contributor

@leelavg leelavg commented Oct 23, 2023

  • encode the versions from storageconsumer CR status to comparable numbers
  • export these numbers are metrics
  • create alerts for version incompatbility and client checkin via heartbeats

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 23, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@leelavg
Copy link
Contributor Author

leelavg commented Oct 23, 2023

only last commit is reviewable for now.

@leelavg
Copy link
Contributor Author

leelavg commented Oct 23, 2023

@aruniiird, @jmolmo could you pls review last commit in this PR?

for _, storageConsumer := range storageConsumers {
ch <- prometheus.MustNewConstMetric(c.StorageConsumerMetadata,
prometheus.GaugeValue, 1,
storageConsumer.Name,
string(storageConsumer.Status.State))

ch <- prometheus.MustNewConstMetric(c.LastHeartbeat,
prometheus.GaugeValue, float64(storageConsumer.Status.LastHeartbeat.Time.Unix()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a heartbeat based on a timestamp would be of type "counter" instead "gauge". Do you expect to have decrements on this metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There'll be no decrements, however _count and _sum derived metrics from a counter type doesn't mean anything in this context and so I used Gauge to represent the instant value with no correlation to previous values.

Copy link
Member

@jmolmo jmolmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you need to replace the var in the "critical" alert

Please add documentation about the new alerts in the runbooks repo.
Here you have an example for CephClusterCriticallyFull alert

Besides that, Alerts in OpenShift will ship a link to the corresponding runbook in this repository, to make fixing problems even easier. Link should be passed as runbook_url field in alert annotations
See the example

metrics/mixin/alerts/storage-client.libsonnet Outdated Show resolved Hide resolved
@leelavg
Copy link
Contributor Author

leelavg commented Oct 25, 2023

Please add documentation about the new alerts in the runbooks repo.

  • @jmolmo, can this be deferred to a later point, ie, after 4.14 and before 4.14.z as we require the PR to be feature complate
  • And also we aren't merging to main as this will be a custom build until release and I'll definitely raise PRs for the runbooks after this PR gets merged as I don't want to update at two places in parallel if we require changes to the alerts wrt pr comments

@leelavg leelavg force-pushed the csi-metrics-1 branch 7 times, most recently from ade1461 to 02822e3 Compare October 30, 2023 09:47
Copy link
Contributor

@umangachapagain umangachapagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of unrelated changes in the alert yaml files, please remove it from commit.
Looks good otherwise.

@leelavg
Copy link
Contributor Author

leelavg commented Oct 30, 2023

Lots of unrelated changes in the alert yaml files, please remove it from commit. Looks good otherwise.

@jmolmo
Copy link
Member

jmolmo commented Oct 30, 2023

Lots of unrelated changes in the alert yaml files, please remove it from commit. Looks good otherwise.

I think, everything will be more clear if we merge first:
#2225

@umangachapagain can you review it and approve if it is ok?

@leelavg
Copy link
Contributor Author

leelavg commented Oct 30, 2023

@jmolmo both the branches differ and for now if they are generated during build time then this PR isn't dependent on any other.

@jmolmo
Copy link
Member

jmolmo commented Oct 30, 2023

@jmolmo both the branches differ and for now if they are generated during build time then this PR isn't dependent on any other.

It seems that the modifications in the rules file are not generated during the build, you need to execute explicitly the command "make gen-latest-prometheus-rules-yamls":
#2225 is set because it was forgotten to execute it.
If you want to add your rules, you will need also to execute the make command an you will generate your modifications and also the modifications in #2225.
This is why I say that in order to make thing easier for ourselves in the future, probably #2225 should be merged first.

@leelavg leelavg changed the title export storageconsumer data as metrics export storageconsumer data as metrics and alerts Oct 31, 2023
@leelavg leelavg marked this pull request as ready for review October 31, 2023 11:40
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 31, 2023
@leelavg
Copy link
Contributor Author

leelavg commented Oct 31, 2023

It seems that the modifications in the rules file are not generated during the build

  • ack, got it. Here we are in a different situation, Ran pending make command and fix obc alerts issue #2225 was raised for a different branch, for the HCI effort pls consider the current base branch (fusion-hci-4.14) is entirely different
  • so I added the changes generated after following README to this PR itself
  • when we take in changes from main, these changes shouldn't conflict (mostly) as they are generated by same tooling

  • I've rebased on the base branch and this PR can receive your votes for the merge if you are fine with it @jmolmo @umangachapagain

thanks.

metrics/deploy/prometheus-ocs-rules.yaml Outdated Show resolved Hide resolved
metrics/deploy/prometheus-ocs-rules.yaml Outdated Show resolved Hide resolved
metrics/deploy/prometheus-ocs-rules.yaml Outdated Show resolved Hide resolved
metrics/deploy/prometheus-ocs-rules.yaml Outdated Show resolved Hide resolved
metrics/deploy/rbac.yaml Show resolved Hide resolved
- encode the versions from storageconsumer CR status to comparable numbers
- export these numbers are metrics
- create alerts for version incompatbility and client checkin via heartbeats

Signed-off-by: Leela Venkaiah G <[email protected]>
Comment on lines +78 to +83
resources:
- clusterversions
verbs:
- get
- list
- watch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are still not added to the other file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, already added to exporter-role.yaml

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 31, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: leelavg, umangachapagain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2023
@openshift-ci openshift-ci bot merged commit c4698e0 into red-hat-storage:fusion-hci-4.14 Oct 31, 2023
6 checks passed
@leelavg leelavg deleted the csi-metrics-1 branch December 11, 2024 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants