-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify direct and caching RuleStore
s in ruler
#9434
Conversation
opts = append(opts, rulestore.WithCacheDisabled()) | ||
} | ||
|
||
users, err := r.store.ListAllUsers(ctx, opts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing the cache/don't cache option here doesn't actually affect anything since we specifically configure tenant listing calls to not be cached but it doesn't hurt to make that explicit here in case we change the config in the future.
3e4ad55
to
f7039db
Compare
@@ -294,7 +294,7 @@ std.manifestYamlDoc({ | |||
|
|||
memcached:: { | |||
memcached: { | |||
image: 'memcached:1.6.19-alpine', | |||
image: 'memcached:1.6.28-alpine', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated but I needed to be able to watch deletions
which was only available in a newer version and this matches what we run in production.
f6e69a2
to
c1b4c85
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there was a reason why I picked the 2 cache implementations instead of a single one and selectively disable cache on a per request basis (I remember I considered that as an option but then I found an issue). Unfortunately I don't remember the issue. Try to look at nested calls. Don't want to block on this if you don't find a good reason why we should keep it disabled, but I remember it's something I considered back in time.
I have been looking at the original PR that split the rule store implementations (#4950) and reading some discussion from you and @pstibrany . From what I can tell, we ended up with the current design (split store implementations) because
I believe this design satisfies most of our requirements:
This change adds per-method options to disable caching when required, no
The ruler API will usually reflect changes made immediately due to a combination of disabling caching for listing of rule groups and the way the invalidation added in #9387 works: We invalidate the cache of rule group contents after writing the updated rule group to object storage successfully (or deleting it completely) and we invalidate the cache by deleting the entries. We don't attempt to modify the cache contents.
We preserve the ability to disable caching for some ruler syncs but not others.
This PR does not start caching rule groups but I plan to cache them in a future PR. It's true that there is a potential race condition when the same rule group is being updated and read at the same time from different requests who's execution is interleaved. It's possible an object is not in cache, fetched from object storage by request A, updated in object storage by request B, invalidated in cache (no-op) by request B, and stored to the cache by request A (stale). This will be mitigated to some extent by the short TTL used for the ruler cache. I do not have a perfect solution for this but I think this acceptable enough as a tradeoff to reduce the number of object storage operations performed.
The |
Actually, I think this can be mostly mitigated with some clever use of the Memcached |
Proof-of-concept for discussion as part of: #9434 grafana/dskit#591 Signed-off-by: Nick Pillitteri <[email protected]>
Instead of using two different `RuleStore` implementations within the Ruler, use a single caching implementation and selectively disable caching when required. This change removes the "direct" `RuleStore` implementation from the Ruler's gRPC and HTTP API layers. Instead, the caching implementation is used for all calls. In cases where caching returning stale results would not be acceptable, the caching is disabled _just_ for that call. This allows rule group contents to be safety cached with the understanding that it is safe to cache them because they will correctly invalidated when deleted or modified. Part of #9386 Signed-off-by: Nick Pillitteri <[email protected]>
c1b4c85
to
d6abaff
Compare
Proof-of-concept for discussion as part of: #9434 grafana/dskit#591 Signed-off-by: Nick Pillitteri <[email protected]>
I've put a PR together that shows how the "locking" might work based on some changes to dskit: Mimir: #9516 dskit: grafana/dskit#591 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you. From what I can see, this is a refactoring without any change in logic (right?).
Yes, there should not be any changes in logic. This change does not start caching rule groups, it just continues to cache |
What this PR does
Instead of using two different
RuleStore
implementations within the Ruler, use a single caching implementation and selectively disable caching when required.This change removes the "direct"
RuleStore
implementation from the Ruler's gRPC and HTTP API layers. Instead, the caching implementation is used for all calls. In cases where caching returning stale results would not be acceptable, the caching is disabled just for that call.This allows rule group contents to be safety cached with the understanding that it is safe to cache them because they will correctly invalidated when deleted or modified.
Which issue(s) this PR fixes or relates to
Part of #9386
Notes to reviewers
I had a bit of trouble getting my head around this change. I've found the following to be a good way of thinking about it:
Bucket.Iter()
calls to be cached already but use a non-caching bucket in some places where that's not acceptable.Bucket.Iter()
results are not acceptable. The point of using a single bucket is to allow caches to be invalidated on mutations (uploads, deletes).Bucket.Iter()
calls is not so easy which is why this change adds the ability to disable caching of them.Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.