-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cortex_querier_blocks_consistency_checks_failed_total metric #7752
Conversation
@@ -113,6 +113,8 @@ func (c BlocksConsistencyTracker) Check(queriedBlocks []ulid.ULID) (missingBlock | |||
return missingBlocks | |||
} | |||
|
|||
// Complete should be called once the request tracked by this BlocksConsistencyTracker has been completed. | |||
// This function should NOT be called if the request is canceled or interrupted due to any error. | |||
func (c BlocksConsistencyTracker) Complete() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: I changed the contract of this function. Complete()
should be called only if the request was completed, and not cancelled or interrupted because an error occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move the logic for determining if the request completed to this method? Callers would pass the error to this method which would decide how to update metrics based on the result. This might make it easier to test this behavior with respect to the metrics emitted. I don't have a preference either way, up to you.
func (c BlocksConsistencyTracker) Complete(err error) {
// ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it at first, but the resulting design was pretty weird. BlocksConsistencyTracker
has no knowledge of good/bad errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A few comments but nothing blocking, this seems good functionality wise.
} | ||
|
||
func (e storeConsistencyCheckFailedErr) Error() string { | ||
return fmt.Sprintf("%v. The failed blocks are: %s", globalerror.StoreConsistencyCheckFailed.Message("failed to fetch some blocks"), strings.Join(convertULIDsToString(e.remainingBlocks), " ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Sprintf("%v. The failed blocks are: %s", globalerror.StoreConsistencyCheckFailed.Message("failed to fetch some blocks"), strings.Join(convertULIDsToString(e.remainingBlocks), " ")) | |
return fmt.Sprintf("%s. The failed blocks are: %s", globalerror.StoreConsistencyCheckFailed.Message("failed to fetch some blocks"), strings.Join(convertULIDsToString(e.remainingBlocks), " ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 1093b93
@@ -113,6 +113,8 @@ func (c BlocksConsistencyTracker) Check(queriedBlocks []ulid.ULID) (missingBlock | |||
return missingBlocks | |||
} | |||
|
|||
// Complete should be called once the request tracked by this BlocksConsistencyTracker has been completed. | |||
// This function should NOT be called if the request is canceled or interrupted due to any error. | |||
func (c BlocksConsistencyTracker) Complete() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to move the logic for determining if the request completed to this method? Callers would pass the error to this method which would decide how to update metrics based on the result. This might make it easier to test this behavior with respect to the metrics emitted. I don't have a preference either way, up to you.
func (c BlocksConsistencyTracker) Complete(err error) {
// ...
}
// Sort the blocks, so it's easier to test the error strings. | ||
sort.Slice(remainingBlocks, func(i, j int) bool { | ||
return remainingBlocks[i].Compare(remainingBlocks[j]) < 0 | ||
}) | ||
return fmt.Errorf("%v. The failed blocks are: %s", globalerror.StoreConsistencyCheckFailed.Message("failed to fetch some blocks"), strings.Join(convertULIDsToString(remainingBlocks), " ")) | ||
|
||
return storeConsistencyCheckFailedErr{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on why this is now returning the value instead of a pointer? This seems to require the extra check in Is()
for the pointer and value types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, improved in 1093b93
…failed_total metric if query has been canceled or interrued due to any error not related to blocks consistency check failed Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
fd3126b
to
1093b93
Compare
What this PR does
Earlier this week, during an incident I've noticed that
cortex_querier_blocks_consistency_checks_failed_total
metric was increased but no consistency check failed. While investigating, I've noticed that it gets tracked for any error querying the store-gateway (or if the query is cancelled), but it should just be tracked if the query fails because of blocks consistency check failed. In this PR I'm fixing it.Note to reviewers:
TestBlocksStoreQuerier_Select
. While doing it, I've noticed that some metrics are tracked differently whether chunks streaming is enabled or not in case of error (because some errors surface at a later stage when chunks streaming is enabled). For this reason, the expected value of some metrics are conditional based on whether chunks streaming is enabled or not.Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.