Unify direct and caching `RuleStore`s in ruler #9434

56quarters · 2024-09-26T16:54:33Z

What this PR does

Instead of using two different RuleStore implementations within the Ruler, use a single caching implementation and selectively disable caching when required.

This change removes the "direct" RuleStore implementation from the Ruler's gRPC and HTTP API layers. Instead, the caching implementation is used for all calls. In cases where caching returning stale results would not be acceptable, the caching is disabled just for that call.

This allows rule group contents to be safety cached with the understanding that it is safe to cache them because they will correctly invalidated when deleted or modified.

Which issue(s) this PR fixes or relates to

Part of #9386

Notes to reviewers

I had a bit of trouble getting my head around this change. I've found the following to be a good way of thinking about it:

We allow Bucket.Iter() calls to be cached already but use a non-caching bucket in some places where that's not acceptable.
This change removes the non-caching bucket and selectively disables caching where stale Bucket.Iter() results are not acceptable. The point of using a single bucket is to allow caches to be invalidated on mutations (uploads, deletes).
This allows us to add caching to rule group contents in the future but does not configure caching for them yet.
When we add caching to rule group contents, it will be unconditional because cached rule group contents can be easily invalidated when the rule group is deleted or modified. Invalidating Bucket.Iter() calls is not so easy which is why this change adds the ability to disable caching of them.

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

56quarters · 2024-09-26T17:23:10Z

pkg/ruler/ruler.go

+		opts = append(opts, rulestore.WithCacheDisabled())
+	}
+
+	users, err := r.store.ListAllUsers(ctx, opts...)


Passing the cache/don't cache option here doesn't actually affect anything since we specifically configure tenant listing calls to not be cached but it doesn't hurt to make that explicit here in case we change the config in the future.

56quarters · 2024-09-26T20:13:16Z

development/mimir-microservices-mode/docker-compose.jsonnet

@@ -294,7 +294,7 @@ std.manifestYamlDoc({

  memcached:: {
    memcached: {
-      image: 'memcached:1.6.19-alpine',
+      image: 'memcached:1.6.28-alpine',


Unrelated but I needed to be able to watch deletions which was only available in a newer version and this matches what we run in production.

pracucci

I remember there was a reason why I picked the 2 cache implementations instead of a single one and selectively disable cache on a per request basis (I remember I considered that as an option but then I found an issue). Unfortunately I don't remember the issue. Try to look at nested calls. Don't want to block on this if you don't find a good reason why we should keep it disabled, but I remember it's something I considered back in time.

56quarters · 2024-10-01T15:12:27Z

I remember there was a reason why I picked the 2 cache implementations instead of a single one and selectively disable cache on a per request basis (I remember I considered that as an option but then I found an issue). Unfortunately I don't remember the issue. Try to look at nested calls. Don't want to block on this if you don't find a good reason why we should keep it disabled, but I remember it's something I considered back in time.

I have been looking at the original PR that split the rule store implementations (#4950) and reading some discussion from you and @pstibrany . From what I can tell, we ended up with the current design (split store implementations) because

There was opposition to the original RuleStore.WithCache() design.
The ruler API is expected to be consistent and reflect changes made immediately.
The ruler itself is permitted to sometimes return stale results. We specifically allow stale rule group list calls during periodic syncs from object storage or syncs triggered by the ruler ring changing.
Cache invalidation is difficult and likely to have edge cases and so we wanted to avoid it.
GEM has a RuleStore implementation for self-monitoring which is a consideration.

I believe this design satisfies most of our requirements:

There was opposition to the original RuleStore.WithCache() design.

This change adds per-method options to disable caching when required, no .WithCache() needed.

The ruler API is expected to be consistent and reflect changes made immediately.

The ruler API will usually reflect changes made immediately due to a combination of disabling caching for listing of rule groups and the way the invalidation added in #9387 works: We invalidate the cache of rule group contents after writing the updated rule group to object storage successfully (or deleting it completely) and we invalidate the cache by deleting the entries. We don't attempt to modify the cache contents.

The ruler itself is permitted to sometimes return stale results. We specifically allow stale rule group list calls during periodic syncs from object storage or syncs triggered by the ruler ring changing.

We preserve the ability to disable caching for some ruler syncs but not others.

Cache invalidation is difficult and likely to have edge cases and so we wanted to avoid it.

This PR does not start caching rule groups but I plan to cache them in a future PR. It's true that there is a potential race condition when the same rule group is being updated and read at the same time from different requests who's execution is interleaved. It's possible an object is not in cache, fetched from object storage by request A, updated in object storage by request B, invalidated in cache (no-op) by request B, and stored to the cache by request A (stale). This will be mitigated to some extent by the short TTL used for the ruler cache. I do not have a perfect solution for this but I think this acceptable enough as a tradeoff to reduce the number of object storage operations performed.

GEM has a RuleStore implementation for self-monitoring which is a consideration.

The RuleStore in GEM for self-monitoring is pretty simple and shouldn't be hard to adapt to work with these changes. The only issue in GEM is the usual issue of duplicate API code because of support for the remote write ruler.

56quarters · 2024-10-01T19:22:50Z

Cache invalidation is difficult and likely to have edge cases and so we wanted to avoid it.

I do not have a perfect solution for this but I think this acceptable enough as a tradeoff to reduce the number of object storage operations performed.

Actually, I think this can be mostly mitigated with some clever use of the Memcached add command to prevent stale values from being cached immediately following an update to a rule group. With changes to dskit, this could be done entirely within the CachingBucket bucket implementation. Since this will require dskit changes I'd like to do it as a follow-up to this PR.

Proof-of-concept for discussion as part of: #9434 grafana/dskit#591 Signed-off-by: Nick Pillitteri <[email protected]>

Instead of using two different `RuleStore` implementations within the Ruler, use a single caching implementation and selectively disable caching when required. This change removes the "direct" `RuleStore` implementation from the Ruler's gRPC and HTTP API layers. Instead, the caching implementation is used for all calls. In cases where caching returning stale results would not be acceptable, the caching is disabled _just_ for that call. This allows rule group contents to be safety cached with the understanding that it is safe to cache them because they will correctly invalidated when deleted or modified. Part of #9386 Signed-off-by: Nick Pillitteri <[email protected]>

Proof-of-concept for discussion as part of: #9434 grafana/dskit#591 Signed-off-by: Nick Pillitteri <[email protected]>

56quarters · 2024-10-03T20:22:35Z

I've put a PR together that shows how the "locking" might work based on some changes to dskit:

Mimir: #9516

dskit: grafana/dskit#591

pstibrany

Looks good, thank you. From what I can see, this is a refactoring without any change in logic (right?).

56quarters · 2024-10-04T12:51:58Z

Looks good, thank you. From what I can see, this is a refactoring without any change in logic (right?).

Yes, there should not be any changes in logic. This change does not start caching rule groups, it just continues to cache .Iter() results but disables that caching in all the places where we previously used the non-caching RuleStore.

56quarters commented Sep 26, 2024

View reviewed changes

56quarters mentioned this pull request Sep 26, 2024

Ruler: add support for caching rule group contents #9386

Closed

5 tasks

56quarters force-pushed the 56quarters/ruler-caching branch from 3e4ad55 to f7039db Compare September 26, 2024 20:06

56quarters commented Sep 26, 2024

View reviewed changes

56quarters force-pushed the 56quarters/ruler-caching branch 3 times, most recently from f6e69a2 to c1b4c85 Compare September 30, 2024 20:09

56quarters marked this pull request as ready for review September 30, 2024 20:09

56quarters requested review from a team as code owners September 30, 2024 20:09

pracucci reviewed Sep 30, 2024

View reviewed changes

56quarters requested review from pracucci and pstibrany October 2, 2024 15:29

56quarters added a commit that referenced this pull request Oct 3, 2024

Prevent stale results from being cached using cache "locks"

cfeb942

Proof-of-concept for discussion as part of: #9434 grafana/dskit#591 Signed-off-by: Nick Pillitteri <[email protected]>

56quarters force-pushed the 56quarters/ruler-caching branch from c1b4c85 to d6abaff Compare October 3, 2024 20:20

56quarters added a commit that referenced this pull request Oct 3, 2024

Prevent stale results from being cached using cache "locks"

29c560e

Proof-of-concept for discussion as part of: #9434 grafana/dskit#591 Signed-off-by: Nick Pillitteri <[email protected]>

56quarters mentioned this pull request Oct 3, 2024

Prevent stale results from being cached using cache "locks" #9516

Closed

4 tasks

pstibrany approved these changes Oct 4, 2024

View reviewed changes

56quarters merged commit 5c89d3e into main Oct 4, 2024
29 checks passed

56quarters deleted the 56quarters/ruler-caching branch October 4, 2024 15:24

56quarters mentioned this pull request Oct 9, 2024

Add more robust cache invalidation to CachingBucket #9575

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify direct and caching `RuleStore`s in ruler #9434

Unify direct and caching `RuleStore`s in ruler #9434

56quarters commented Sep 26, 2024 •

edited

Loading

56quarters Sep 26, 2024

56quarters Sep 26, 2024

pracucci left a comment

56quarters commented Oct 1, 2024

56quarters commented Oct 1, 2024

56quarters commented Oct 3, 2024 •

edited

Loading

pstibrany left a comment

56quarters commented Oct 4, 2024

Unify direct and caching RuleStores in ruler #9434

Unify direct and caching RuleStores in ruler #9434

Conversation

56quarters commented Sep 26, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Notes to reviewers

Checklist

56quarters Sep 26, 2024

Choose a reason for hiding this comment

56quarters Sep 26, 2024

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

56quarters commented Oct 1, 2024

56quarters commented Oct 1, 2024

56quarters commented Oct 3, 2024 • edited Loading

pstibrany left a comment

Choose a reason for hiding this comment

56quarters commented Oct 4, 2024

Unify direct and caching `RuleStore`s in ruler #9434

Unify direct and caching `RuleStore`s in ruler #9434

56quarters commented Sep 26, 2024 •

edited

Loading

56quarters commented Oct 3, 2024 •

edited

Loading