-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ratelimits: Auto pause zombie clients #7763
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looking good! Most of my comments are quite minor.
I do have one big overall question though: What do we expect to set the burst/count/period to in prod?
Fundamentally, this rate limit does not have the ability to measure the same thing that we measured by hand when doing the first rounds of manual pausing. It cannot measure "has this domain+acct been failing for X months". It can only measure "has this domain+acct failed X times in a row".
In order for the "reset when they succeed" mechanism to make sense, it seems like we need to have a very long period and a very high count. For example, if we were hoping to catch folks who fail 4 times per day, every day for six months, we could set the Period at 180d and the Burst and Count both at 720. But that also means that someone failing at the maximum rate allowed by the existing FailedAuthorizations limit (5 times per hour) would get paused after just 6 days. So what values do we intend to set to try to strike the right balance here?
Okay, I found the discussion of burst/count/period in #7738, and I chatted with Samantha, and now I have a concrete proposal for how this limit should be configured:
This maybe seems wild, but I think the math works out:
Those last two bullet points make it seem like we could set the count even lower (or equivalently, set the period higher), but I think "one freebie per day" is a reasonable place to start. I think this math is highly unintuitive, and should be documented in the code near where the limit is defined. |
I think this is perfectly reasonable. Here's a small table that expands on these estimates:
|
…o-pause-zombie-clients
…usingPerDomainPerAccount
3687ef5
to
97aebfc
Compare
@aarongable tests not running. Issue at limiter.go line 163. Didn't finish writing TestResetAccountPausingLimit test. |
|
I figured it out. I did another merge main and found the two tests that were failing in the github actions CI, but didn't exist on my local branch. |
CPS Compliance Review:
|
Should we increment a metric each time an account is paused? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really clean. I've just got a few comments, mostly around log line formatting and some code that I think can go away.
// AutomaticallyPauseZombieClients configures the RA to automatically track | ||
// limiter to be the authoritative source of rate limiting information for | ||
// automatically pausing clients who systemically fail every validation | ||
// attempt. When disabled, only manually paused accountID:identifier pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't parse this sentence. Specifically "to automatically track limiter to be the authoritative source ..." seems to be an editing error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird this looks like some kind of hybrid of what was there and what I put in my suggestion: #7763 (comment)
// countFailedValidation increments the failed authorizations per domain per | ||
// account rate limit. There is no reason to surface errors from this function | ||
// to the Subscriber, spends against this limit are best effort. | ||
func (ra *RegistrationAuthorityImpl) countFailedValidation(ctx context.Context, regId int64, name string) { | ||
// account rate limit. If the AutomaticallyPauseZombieClients feature has been | ||
// enabled, it also increments the failed authorizations for pausing per domain | ||
// per account rate limit. There is no reason to surface errors from this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// countFailedValidation increments the FailedAuthorizationsPerDomainPerAccount. If the AutomaticallyPauseZombieClients feature has been enabled, it also increments the FailedAuthorizationsForPausingPerDomainPerAccountTransaction rate limit
if features.Get().AutomaticallyPauseZombieClients { | ||
txn, err = ra.txnBuilder.FailedAuthorizationsForPausingPerDomainPerAccountTransaction(regId, ident.Value) | ||
if err != nil { | ||
ra.log.Warningf("building rate limit transaction for the %s rate limit: %s", ratelimits.FailedAuthorizationsForPausingPerDomainPerAccount, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this is a copy of some existing code, but I think both line 1833 and this line should instead return an error. I know this function tries to avoid returning errors to the caller, but a failure to build the rate limit transaction represents some sort of internal logic error, and that should become a 500 (which helps ensure it shows up in our metrics, and gets logged in the WFE with some useful context).
Also, now that there are two places within this function where we look for and discard Canceled
/ DeadlineExceeded
errors, it makes more sense to return those to the caller as well, and have the caller look for Canceled
/ DeadlineExceeded
(so we won't have to duplicate logic as much).
}, | ||
}) | ||
if err != nil { | ||
ra.log.Warningf("failed to pause %d/%q: %s", regId, ident.Value, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another place where we should simply return the error, and let the caller filter out Canceled
/ DeadlineExceeded
if it wants to.
@@ -241,6 +242,12 @@ func NewRegistrationAuthorityImpl( | |||
}) | |||
stats.MustRegister(certCSRMismatch) | |||
|
|||
pauseCounter := prometheus.NewCounterVec(prometheus.CounterOpts{ | |||
Name: "paused_pairs", | |||
Help: "Number of times a pause operation is performed, labeled by paused=[bool], repaused=[bool], grace=[bool]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's too much for the help string here, but it would be good to document what repaused
and grace
mean. Perhaps in a comment either here or where they are incremented?
Alternately, we could remove paused=[bool], repaused=[bool], grace=[bool]
from the help string since that information is available direct from Prometheus, and use some of the saved space to explain the two less obvious labels.
Return an error and do logging in the caller. This adds early returns on a number of error conditions, which can prevent nil pointer dereference in those cases. Also update the description for AutomaticallyPauseZombieClients. Follows up #7763.
FailedAuthorizationsForPausingPerDomainPerAccount
which is incremented each time a client fails a validation.account:identifier
. Further validation attempts will be rejected by the WFE.AutomaticallyPauseZombieClients
, which enables automatic pausing of zombie clients in the RA.paused_pairs{"paused":[bool], "repaused":[bool], "grace":[bool]}
to monitor use of this new functionality.ra_test.go
initAuthorities
to allow accessing the*ratelimits.RedisSource
for checking that the new ratelimit functions as intended.Co-authored-by: @pgporada
Fixes #7738