Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cortex_querier_blocks_consistency_checks_failed_total metric #7752

Merged
merged 2 commits into from
Apr 4, 2024

Conversation

pracucci
Copy link
Collaborator

What this PR does

Earlier this week, during an incident I've noticed that cortex_querier_blocks_consistency_checks_failed_total metric was increased but no consistency check failed. While investigating, I've noticed that it gets tracked for any error querying the store-gateway (or if the query is cancelled), but it should just be tracked if the query fails because of blocks consistency check failed. In this PR I'm fixing it.

Note to reviewers:

  • I've expanded the metrics assertion in TestBlocksStoreQuerier_Select. While doing it, I've noticed that some metrics are tracked differently whether chunks streaming is enabled or not in case of error (because some errors surface at a later stage when chunks streaming is enabled). For this reason, the expected value of some metrics are conditional based on whether chunks streaming is enabled or not.

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@@ -113,6 +113,8 @@ func (c BlocksConsistencyTracker) Check(queriedBlocks []ulid.ULID) (missingBlock
return missingBlocks
}

// Complete should be called once the request tracked by this BlocksConsistencyTracker has been completed.
// This function should NOT be called if the request is canceled or interrupted due to any error.
func (c BlocksConsistencyTracker) Complete() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I changed the contract of this function. Complete() should be called only if the request was completed, and not cancelled or interrupted because an error occurred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move the logic for determining if the request completed to this method? Callers would pass the error to this method which would decide how to update metrics based on the result. This might make it easier to test this behavior with respect to the metrics emitted. I don't have a preference either way, up to you.

func (c BlocksConsistencyTracker) Complete(err error) {
    // ...
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it at first, but the resulting design was pretty weird. BlocksConsistencyTracker has no knowledge of good/bad errors.

@pracucci pracucci marked this pull request as ready for review March 29, 2024 12:06
@pracucci pracucci requested a review from a team as a code owner March 29, 2024 12:06
Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few comments but nothing blocking, this seems good functionality wise.

}

func (e storeConsistencyCheckFailedErr) Error() string {
return fmt.Sprintf("%v. The failed blocks are: %s", globalerror.StoreConsistencyCheckFailed.Message("failed to fetch some blocks"), strings.Join(convertULIDsToString(e.remainingBlocks), " "))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Sprintf("%v. The failed blocks are: %s", globalerror.StoreConsistencyCheckFailed.Message("failed to fetch some blocks"), strings.Join(convertULIDsToString(e.remainingBlocks), " "))
return fmt.Sprintf("%s. The failed blocks are: %s", globalerror.StoreConsistencyCheckFailed.Message("failed to fetch some blocks"), strings.Join(convertULIDsToString(e.remainingBlocks), " "))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1093b93

@@ -113,6 +113,8 @@ func (c BlocksConsistencyTracker) Check(queriedBlocks []ulid.ULID) (missingBlock
return missingBlocks
}

// Complete should be called once the request tracked by this BlocksConsistencyTracker has been completed.
// This function should NOT be called if the request is canceled or interrupted due to any error.
func (c BlocksConsistencyTracker) Complete() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move the logic for determining if the request completed to this method? Callers would pass the error to this method which would decide how to update metrics based on the result. This might make it easier to test this behavior with respect to the metrics emitted. I don't have a preference either way, up to you.

func (c BlocksConsistencyTracker) Complete(err error) {
    // ...
}

// Sort the blocks, so it's easier to test the error strings.
sort.Slice(remainingBlocks, func(i, j int) bool {
return remainingBlocks[i].Compare(remainingBlocks[j]) < 0
})
return fmt.Errorf("%v. The failed blocks are: %s", globalerror.StoreConsistencyCheckFailed.Message("failed to fetch some blocks"), strings.Join(convertULIDsToString(remainingBlocks), " "))

return storeConsistencyCheckFailedErr{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why this is now returning the value instead of a pointer? This seems to require the extra check in Is() for the pointer and value types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, improved in 1093b93

pracucci added 2 commits April 4, 2024 17:57
…failed_total metric if query has been canceled or interrued due to any error not related to blocks consistency check failed

Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci force-pushed the fix-consistency-check-metric branch from fd3126b to 1093b93 Compare April 4, 2024 16:05
@pracucci pracucci merged commit b896aed into main Apr 4, 2024
29 checks passed
@pracucci pracucci deleted the fix-consistency-check-metric branch April 4, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants