-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add MimirContinuousTestFailingOnWrites and MimirContinuousTestFailing… #1355
Conversation
9d39c58
to
c375c42
Compare
helm/prometheus-rules/templates/platform/atlas/alerting-rules/mimir.rules.yml
Outdated
Show resolved
Hide resolved
helm/prometheus-rules/templates/platform/atlas/alerting-rules/mimir.rules.yml
Outdated
Show resolved
Hide resolved
helm/prometheus-rules/templates/platform/atlas/alerting-rules/mimir.rules.yml
Outdated
Show resolved
Hide resolved
helm/prometheus-rules/templates/platform/atlas/alerting-rules/mimir.rules.yml
Outdated
Show resolved
Hide resolved
This is not a draft anymore, I repeat, this is not a draft anymore |
7837d81
to
5d3aa9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we're testing the rate of fails, which is fine.
But the continuous tests also provide a counter of reads and writes. Should we test those as well?
For instance, if we can't write data, we won't get any fails increase. But we also won't get any count of new "continuous test writes", which should probably raise an alert?
8799704
to
4963d56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Note (to self): I'm not happy about the duplicate tests for CAPI. CAPI should now be the default target for tests, and vintage/aws the exception.
Towards: giantswarm/roadmap#3578
As discussed in the above issue, this PR adds 2 alerts based on the mimir's
continous-test
component's metrics so that we get alerted when something is wrong in the read or write path.Those 2 alerts are directly taken from upstream mixins as explained in the comments.
Before merging this PR, I still need to add UTs as well as create a dedicated dashboard for the continous-test component.
Checklist
oncall-kaas-cloud
GitHub group).