-
-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Key-value based rate limiting #5545
Comments
I would recommend we split this ticket apart. Rate limit data is very well suited for key-value stores, which often have Redis-compatible command sets (we probably shouldn't use Redis itself when there are memory-safe alternatives that speak Redis-protocol). OCSP data can be represented reasonably well both as key-value and as tabular data. I think for performance reasons we should consider changes to how OCSP gets stored, too, but it's fundamentally different data, has a lot more tooling already around it that would have to change if we were to change its primary representation, and there's room for something Redis-like being used as a "cache" to unload the DBs without changing a primary source-of-truth. I would like thus to consider it separately. |
As JC said, we should split this. OCSP data in Redis is now fully supported. So let's repurpose this ticket for @beautifulentropy's investigation into creating a new lookaside rate limit service, and the database system that will back that service. |
This design seeks to reduce read-pressure on our DB by moving rate limit tabulation to a key-value datastore. This PR provides the following: - (README.md) a short guide to the schemas, formats, and concepts introduced in this PR - (source.go) an interface for storing, retrieving, and resetting a subscriber bucket - (name.go) an enumeration of all defined rate limits - (limit.go) a schema for defining default limits and per-subscriber overrides - (limiter.go) a high-level API for interacting with key-value rate limits - (gcra.go) an implementation of the Generic Cell Rate Algorithm, a leaky bucket-style scheduling algorithm, used to calculate the present or future capacity of a subscriber bucket using spend and refund operations Note: the included source implementation is test-only and currently accomplished using a simple in-memory map protected by a mutex, implementations using Redis and potentially other data stores will follow. Part of #5545
I've just quote some time trying to concoct a query against the I realise this ticket and the PR do not explicitly call for public APIs to query quota assignments or quota consumption, and I'm not trying to drive-by scope-hijack this ticket into doing so. But I do suggest that it's worth considering what a future public API might require when designing and testing the new quota system so that it leaves room for the possibility as follow-on work. Past posts about quota checking have tended to point people at Relevant past threads on checking quota usage include:
and I recently started a May I also suggest as part of this that messages relating to quota issues report the quota limit along with a quota exceeded message? E.g. at https://github.com/letsencrypt/boulder/blob/7d66d67054616867121e822fdc8ae58b10c1d71a/ra/ra.go#L1442C3-L1442C3 the web-api front-end replies
which informs the caller that the relevant quota has been exceeded, and for which domain, but not what the numeric value of the quota for the domain is. |
Regarding rate-limit data being ephemeral and not requiring consistency guarantees - while it can be derived from the main SA db cert logs, it might still be expensive to compute from scratch on a service restart. So the ability to checkpoint it and re-compute it forwards from the last checkpoint is likely to be desirable. I didn't see anywhere in the Boulder SA code that limits how far back in time the current LE storage engine code will look to find a past issue of the same FQDN-set when it's deciding whether a given cert order is a renewal or not. That seems to put the entire LE cert history in scope for the quota checker datastore because it has to be able to check if any given FQDN-set was ever issued before. Even if I missed some limit on this, or if LE defines a look-back time limit as part of the new quota system work, it'll presumably have to be at least the 90 day cert validity window plus some reasonable slush margin. That's a lot of data to scan and rebuild a unique-FQDN-set cache from if the ephemeral store is lost to a crash, restart etc. That could have a significant effect on backend load during reconstruction and/or time-to-availability. Check-pointing the rate-limit state wouldn't have to be consistent or atomic; it could just err in favor of under-counting usage rather than over-counting so it never falsely reported quota to be exceeded. Recording the latest-issued cert after saving a checkpoint of the rate limit state would probably do; then it'd only have to scan-forward from there. If some certs issued are not counted because they got issued after the snapshot started and before the last-issued position was recorded in the DB, who cares? It won't be lots and any error in the computed quota will age out of the quota window for certs-per-domain and renewals quotas within 7 days anyway. I don't understand how the store can be re-initialized with a starting state or efficiently "replayed" to recover past state based on the store in https://github.com/letsencrypt/boulder/blob/861161fc9f76f7b0cdb27c0b7e81d1572e4c5061/ratelimits/source.go . Similarly, if quota state recording and lookup is going for non-atomic "close enough" measurement then erring on side of under-counting usage would make sense. |
Hi @ringerc! A few notes in no particular order:
|
@aarongable That makes a lot of sense. The only issue I see with ephemerality is that LE quotas currently exclude "renewals" for one domain, where the FQDN set checked for renewals seems to be back in time forever. If that's lost, all incoming cert orders count against the new-domains quota not the renewals-quota, which could easily cause an org to exceed quota for what would normally be a way-lower-than-limits load. There's an effectively unlimited number of domains that could be renewed within the quota window so no amount of padding of the new-domains quota to compensate for "forgotten" renewals would guarantee correctness. A domain-account could've issued 50 certs 180 days ago, then renewed those 50 and issued another 50 certs 90 days ago, and now want to renew all 100 + issue 50 more; it'd usually be able to rely on being able to do this for an indefinite number of 90 day cert validity cycles so there's no sensible quota to apply if knowledge of which FQDN-sets are renewals is suddenly lost. That problem won't go away after one quota cycle, so enforcement of that quota can't just be turned off until one 7 day cycle is past to solve the issue. I don't see an obvious solution for that with the current quota definitions unless the unique-FQDN-sets for past certs issued can be safely stored or reconstructed on loss. Redefining the quotas themselves w.r.t to renewals is one option. For example, change the renewal quota rules to say that only renewals of certs for fqdn-sets that were last issued or renewed in the past (say) 100 days are guaranteed to be exempted from the new-certs quota. Most people will only renew certs to maintain continuous rolling validity so this should be the immense majority of renewals. Then if unique-fqdn-set info for renewal detection is lost, increase the new-domains quota to 3x the usual value for the first 90 day cert validity window after loss of renewal history. That would ensure that even if every renewal is miscounted as a new domain no previously-valid request would be rejected, and it wouldn't require any reconstruction of unique FQDN set info from the main SA DB. Or it could be reconstructed lazily after recovery, and the normal limits re-imposed once the FQDN-set info was reconstructed. |
@ringerc Thanks for the feedback; I believe you're correct on these points. Our plan to account for this is to continue using our MariaDB database as the source of truth when determining whether a newOrder request is a renewal. |
@beautifulentropy Makes sense. And the quota layer can be used as a write-through cache over the SA MariaDB FQDN set tracking so it can still offload a lot of the cost of the FQDN-set checks. If a domain-set is added to the FQDN-set once it's never removed, so the cache doesn't need any complicated invalidation logic and is easy to pre-warm on restart. So the idea here is to add some kind of ephemeral no-consistency-guarantees layer for storing and looking the quotas that enforce limits on new orders, unique FQDN-sets per TLD+1, and most others. The FQDN-set uniqueness checks used to detect renewals will continue to use the current SA code, either directly as now, or possibly via write-through caching via the quota store. Client visibility into quota status will be exposed via a rich set of quota status headers on all responses and/or via dedicated quota-lookup API endpoints. Reasonable summary of the intent as it stands? |
- Emit override utilization only when resource counts are under threshold. - Override utilization accounts for anticipated issuance. - Correct the limit metric label for `CertificatesPerName` and `CertificatesPerFQDNSet/Fast`. Part of #5545
- Move default and override limits, and associated methods, out of the Limiter to new limitRegistry struct, embedded in a new public TransactionBuilder. - Export Transaction and add corresponding Transaction constructor methods for each limit Name, making Limiter and TransactionBuilder the API for interacting with the ratelimits package. - Implement batched Spends and Refunds on the Limiter, the new methods accept a slice of Transactions. - Add new boolean fields check and spend to Transaction to support more complicated cases that can arise in batches: 1. the InvalidAuthorizations limit is checked at New Order time in a batch with many other limits, but should only be spent when an Authorization is first considered invalid. 2. the CertificatesPerDomain limit is overridden by CertficatesPerDomainPerAccount, when this is the case, spends of the CertificatesPerDomain limit should be "best-effort" but NOT deny the request if capacity is lacking. - Modify the existing Spend/Refund methods to support Transaction.check/spend and 0 cost Transactions. - Make bucketId private and add a constructor for each bucket key format supported by ratelimits. - Move domainsForRateLimiting() from the ra.go to ratelimits. This avoids a circular import issue in ra.go. Part of #5545
- Update parsing of overrides with Ids formatted as 'fqdnSet' to produce a hexadecimal string. - Update validation for Ids formatted as 'fqdnSet' when constructing a bucketKey for a transaction to validate before identifier construction. - Skip CertificatesPerDomain transactions when the limit is disabled. Part of #5545
Add non-blocking checks of New Order limits to the WFE using the new key-value based rate limits package. Part of #5545
…7344) - Update the failed authorizations limit to use 'enum:regId:domain' for transactions while maintaining 'enum:regId' for overrides. - Modify the failed authorizations transaction builder to generate a transaction for each order name. - Rename the `FailedAuthorizationsPerAccount` enum to `FailedAuthorizationsPerDomainPerAccount` to align with its corrected implementation. This change is possible because the limit isn't yet deployed in staging or production. Blocks #7346 Part of #5545
…PerDomain (#7513) - Rename `NewOrderRequest` field `LimitsExempt` to `IsARIRenewal` - Introduce a new `NewOrderRequest` field, `IsRenewal` - Introduce a new (temporary) feature flag, `CheckRenewalExemptionAtWFE` WFE: - Perform renewal detection in the WFE when `CheckRenewalExemptionAtWFE` is set - Skip (key-value) `NewOrdersPerAccount` and `CertificatesPerDomain` limit checks when renewal detection indicates the the order is a renewal. RA: - Leave renewal detection in the RA intact - Skip renewal detection and (legacy) `NewOrdersPerAccount` and `CertificatesPerDomain` limit checks when `CheckRenewalExemptionAtWFE` is set and the `NewOrderRequest` indicates that the order is a renewal. Fixes #7508 Part of #5545
Default code paths that depended on this flag to be true. Part of #5545
MySQL-compatible databases are inefficient for some of Boulder's most demanding data storage needs:
OCSP responses are perfect for a simpler and more performant key-value data model.In order for Boulder to support large and growing environments, it would be nice for it to support Redis as a storage backend for these purposes. For OCSP responses, there will need to be some ability to insert/update data in both MySQL and Redis in parallel (e.g. during a transition).
---- (project tracking added by @beautifulentropy) ----
berrors
which indicate the specific limit violated, the period of time to wait, and provides links to the relevant documentation. #7577CheckRenewalExemptionAtWFE
feature flag #7511The text was updated successfully, but these errors were encountered: