Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify direct and caching RuleStores in ruler #9434

Merged
merged 1 commit into from
Oct 4, 2024

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Sep 26, 2024

What this PR does

Instead of using two different RuleStore implementations within the Ruler, use a single caching implementation and selectively disable caching when required.

This change removes the "direct" RuleStore implementation from the Ruler's gRPC and HTTP API layers. Instead, the caching implementation is used for all calls. In cases where caching returning stale results would not be acceptable, the caching is disabled just for that call.

This allows rule group contents to be safety cached with the understanding that it is safe to cache them because they will correctly invalidated when deleted or modified.

Which issue(s) this PR fixes or relates to

Part of #9386

Notes to reviewers

I had a bit of trouble getting my head around this change. I've found the following to be a good way of thinking about it:

  • We allow Bucket.Iter() calls to be cached already but use a non-caching bucket in some places where that's not acceptable.
  • This change removes the non-caching bucket and selectively disables caching where stale Bucket.Iter() results are not acceptable. The point of using a single bucket is to allow caches to be invalidated on mutations (uploads, deletes).
  • This allows us to add caching to rule group contents in the future but does not configure caching for them yet.
  • When we add caching to rule group contents, it will be unconditional because cached rule group contents can be easily invalidated when the rule group is deleted or modified. Invalidating Bucket.Iter() calls is not so easy which is why this change adds the ability to disable caching of them.

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

opts = append(opts, rulestore.WithCacheDisabled())
}

users, err := r.store.ListAllUsers(ctx, opts...)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the cache/don't cache option here doesn't actually affect anything since we specifically configure tenant listing calls to not be cached but it doesn't hurt to make that explicit here in case we change the config in the future.

@@ -294,7 +294,7 @@ std.manifestYamlDoc({

memcached:: {
memcached: {
image: 'memcached:1.6.19-alpine',
image: 'memcached:1.6.28-alpine',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated but I needed to be able to watch deletions which was only available in a newer version and this matches what we run in production.

@56quarters 56quarters force-pushed the 56quarters/ruler-caching branch 3 times, most recently from f6e69a2 to c1b4c85 Compare September 30, 2024 20:09
@56quarters 56quarters marked this pull request as ready for review September 30, 2024 20:09
@56quarters 56quarters requested review from a team as code owners September 30, 2024 20:09
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there was a reason why I picked the 2 cache implementations instead of a single one and selectively disable cache on a per request basis (I remember I considered that as an option but then I found an issue). Unfortunately I don't remember the issue. Try to look at nested calls. Don't want to block on this if you don't find a good reason why we should keep it disabled, but I remember it's something I considered back in time.

@56quarters
Copy link
Contributor Author

I remember there was a reason why I picked the 2 cache implementations instead of a single one and selectively disable cache on a per request basis (I remember I considered that as an option but then I found an issue). Unfortunately I don't remember the issue. Try to look at nested calls. Don't want to block on this if you don't find a good reason why we should keep it disabled, but I remember it's something I considered back in time.

I have been looking at the original PR that split the rule store implementations (#4950) and reading some discussion from you and @pstibrany . From what I can tell, we ended up with the current design (split store implementations) because

  • There was opposition to the original RuleStore.WithCache() design.
  • The ruler API is expected to be consistent and reflect changes made immediately.
  • The ruler itself is permitted to sometimes return stale results. We specifically allow stale rule group list calls during periodic syncs from object storage or syncs triggered by the ruler ring changing.
  • Cache invalidation is difficult and likely to have edge cases and so we wanted to avoid it.
  • GEM has a RuleStore implementation for self-monitoring which is a consideration.

I believe this design satisfies most of our requirements:

There was opposition to the original RuleStore.WithCache() design.

This change adds per-method options to disable caching when required, no .WithCache() needed.

The ruler API is expected to be consistent and reflect changes made immediately.

The ruler API will usually reflect changes made immediately due to a combination of disabling caching for listing of rule groups and the way the invalidation added in #9387 works: We invalidate the cache of rule group contents after writing the updated rule group to object storage successfully (or deleting it completely) and we invalidate the cache by deleting the entries. We don't attempt to modify the cache contents.

The ruler itself is permitted to sometimes return stale results. We specifically allow stale rule group list calls during periodic syncs from object storage or syncs triggered by the ruler ring changing.

We preserve the ability to disable caching for some ruler syncs but not others.

Cache invalidation is difficult and likely to have edge cases and so we wanted to avoid it.

This PR does not start caching rule groups but I plan to cache them in a future PR. It's true that there is a potential race condition when the same rule group is being updated and read at the same time from different requests who's execution is interleaved. It's possible an object is not in cache, fetched from object storage by request A, updated in object storage by request B, invalidated in cache (no-op) by request B, and stored to the cache by request A (stale). This will be mitigated to some extent by the short TTL used for the ruler cache. I do not have a perfect solution for this but I think this acceptable enough as a tradeoff to reduce the number of object storage operations performed.

GEM has a RuleStore implementation for self-monitoring which is a consideration.

The RuleStore in GEM for self-monitoring is pretty simple and shouldn't be hard to adapt to work with these changes. The only issue in GEM is the usual issue of duplicate API code because of support for the remote write ruler.

@56quarters
Copy link
Contributor Author

Cache invalidation is difficult and likely to have edge cases and so we wanted to avoid it.

I do not have a perfect solution for this but I think this acceptable enough as a tradeoff to reduce the number of object storage operations performed.

Actually, I think this can be mostly mitigated with some clever use of the Memcached add command to prevent stale values from being cached immediately following an update to a rule group. With changes to dskit, this could be done entirely within the CachingBucket bucket implementation. Since this will require dskit changes I'd like to do it as a follow-up to this PR.

56quarters added a commit that referenced this pull request Oct 3, 2024
Proof-of-concept for discussion as part of:

#9434

grafana/dskit#591

Signed-off-by: Nick Pillitteri <[email protected]>
Instead of using two different `RuleStore` implementations within the Ruler,
use a single caching implementation and selectively disable caching when
required.

This change removes the "direct" `RuleStore` implementation from the Ruler's
gRPC and HTTP API layers. Instead, the caching implementation is used for all
calls. In cases where caching returning stale results would not be acceptable,
the caching is disabled _just_ for that call.

This allows rule group contents to be safety cached with the understanding
that it is safe to cache them because they will correctly invalidated when
deleted or modified.

Part of #9386

Signed-off-by: Nick Pillitteri <[email protected]>
@56quarters 56quarters force-pushed the 56quarters/ruler-caching branch from c1b4c85 to d6abaff Compare October 3, 2024 20:20
56quarters added a commit that referenced this pull request Oct 3, 2024
Proof-of-concept for discussion as part of:

#9434

grafana/dskit#591

Signed-off-by: Nick Pillitteri <[email protected]>
@56quarters
Copy link
Contributor Author

56quarters commented Oct 3, 2024

I've put a PR together that shows how the "locking" might work based on some changes to dskit:

Mimir: #9516

dskit: grafana/dskit#591

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you. From what I can see, this is a refactoring without any change in logic (right?).

@56quarters
Copy link
Contributor Author

Looks good, thank you. From what I can see, this is a refactoring without any change in logic (right?).

Yes, there should not be any changes in logic. This change does not start caching rule groups, it just continues to cache .Iter() results but disables that caching in all the places where we previously used the non-caching RuleStore.

@56quarters 56quarters merged commit 5c89d3e into main Oct 4, 2024
29 checks passed
@56quarters 56quarters deleted the 56quarters/ruler-caching branch October 4, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants