Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't propagate cancel signal to the Prometheus rules manager context #6326

Merged

Conversation

rapphil
Copy link
Contributor

@rapphil rapphil commented Nov 8, 2024

What this PR does:

  • Allow rules that are being evaluated to successfully complete/timeout when rulers are shutdown. Before this PR, we were canceling the context right away, wasting all the work performed by the queries.

This is done by removing the cancel signal of the context passed to the Prometheus rules manager. Instead we are going to rely on the shutdown mechanism of the rules manager and timeout of the queries executed by the rules.

In the unit tests I tried to avoid adding sleeps and instead used channels to coordinate the execution of the queries to simulate the problem.

Checklist

  • Tests updated
  • Documentation added - N.A.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@dosubot dosubot bot added the component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. label Nov 8, 2024
@@ -341,11 +341,15 @@ func DefaultTenantManagerFactory(cfg Config, p Pusher, q storage.Queryable, engi
queryFunc = metricsQueryFunc
}

// We let the Prometheus rules manager control the context so that there is a chance
// for graceful shutdown of rules that are still in execution even in case the cortex context is canceled.
prometheusContext := user.InjectOrgID(context.WithoutCancel(ctx), userID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

til! :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was added in go1.21.0

@alanprot
Copy link
Member

alanprot commented Nov 8, 2024

LGTM!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 8, 2024
This change allows the rules that are still  executing queries to complete
before cortex if sully shutdown.

Signed-off-by: Raphael Silva <[email protected]>
@rapphil rapphil force-pushed the rapphil-avoid-context-cancelation-ruler branch from 65bc9e8 to 6f53f67 Compare November 8, 2024 20:09
Use atomic counter to keep track of the successful queries

Signed-off-by: Raphael Silva <[email protected]>
@rapphil rapphil force-pushed the rapphil-avoid-context-cancelation-ruler branch from 69a93db to bd038d5 Compare November 8, 2024 23:57
@yeya24 yeya24 merged commit 10d1517 into cortexproject:master Nov 9, 2024
14 of 16 checks passed
@rapphil rapphil deleted the rapphil-avoid-context-cancelation-ruler branch November 9, 2024 00:27
CharlieTLe pushed a commit to CharlieTLe/cortex that referenced this pull request Dec 3, 2024
…cortexproject#6326)

* Don't propagate cancel signal to the Prometheus rules manager context

This change allows the rules that are still  executing queries to complete
before cortex if sully shutdown.

Signed-off-by: Raphael Silva <[email protected]>

* Make ruler unit tests to run faster

Signed-off-by: Raphael Silva <[email protected]>

* Avoid tests to fail due to race condition

Use atomic counter to keep track of the successful queries

Signed-off-by: Raphael Silva <[email protected]>

---------

Signed-off-by: Raphael Silva <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. lgtm This PR has been approved by a maintainer size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants