Key-value based rate limiting #5545

jprenken · 2021-07-22T04:14:31Z

MySQL-compatible databases are inefficient for some of Boulder's most demanding data storage needs:

~~OCSP responses are perfect for a simpler and more performant key-value data model.~~
Rate limit data (a broad category) is ephemeral, and requires no ACID guarantees.

In order for Boulder to support large and growing environments, it would be nice for it to support Redis as a storage backend for these purposes. For OCSP responses, there will need to be some ability to insert/update data in both MySQL and Redis in parallel (e.g. during a transition).

---- (project tracking added by @beautifulentropy) ----

The text was updated successfully, but these errors were encountered:

jcjones · 2021-07-27T17:33:06Z

I would recommend we split this ticket apart.

Rate limit data is very well suited for key-value stores, which often have Redis-compatible command sets (we probably shouldn't use Redis itself when there are memory-safe alternatives that speak Redis-protocol).

OCSP data can be represented reasonably well both as key-value and as tabular data. I think for performance reasons we should consider changes to how OCSP gets stored, too, but it's fundamentally different data, has a lot more tooling already around it that would have to change if we were to change its primary representation, and there's room for something Redis-like being used as a "cache" to unload the DBs without changing a primary source-of-truth. I would like thus to consider it separately.

aarongable · 2023-05-16T18:53:00Z

As JC said, we should split this. OCSP data in Redis is now fully supported. So let's repurpose this ticket for @beautifulentropy's investigation into creating a new lookaside rate limit service, and the database system that will back that service.

This design seeks to reduce read-pressure on our DB by moving rate limit tabulation to a key-value datastore. This PR provides the following: - (README.md) a short guide to the schemas, formats, and concepts introduced in this PR - (source.go) an interface for storing, retrieving, and resetting a subscriber bucket - (name.go) an enumeration of all defined rate limits - (limit.go) a schema for defining default limits and per-subscriber overrides - (limiter.go) a high-level API for interacting with key-value rate limits - (gcra.go) an implementation of the Generic Cell Rate Algorithm, a leaky bucket-style scheduling algorithm, used to calculate the present or future capacity of a subscriber bucket using spend and refund operations Note: the included source implementation is test-only and currently accomplished using a simple in-memory map protected by a mutex, implementations using Redis and potentially other data stores will follow. Part of #5545

ringerc · 2023-07-24T00:48:36Z

I've just quote some time trying to concoct a query against the crt.sh public certificate transparency database that will expose a reasonable estimate of current Let's Encrypt quota usage for a given domain as a workaround for the unavailability of quota-info on the public Let's Encrypt APIs, so I'm delighted to see this ticket and PoC PR.

I realise this ticket and the PR do not explicitly call for public APIs to query quota assignments or quota consumption, and I'm not trying to drive-by scope-hijack this ticket into doing so. But I do suggest that it's worth considering what a future public API might require when designing and testing the new quota system so that it leaves room for the possibility as follow-on work.

Past posts about quota checking have tended to point people at crt.sh - to use the public certificate transparency log and associated database maintained there to compute quota usage. But this method won't account for Let's Encrypt's FQDN-set renewal-detection logic, and it's tricky to get right, let alone in a performant way. Plus it's not ideal to shift load onto crt.sh for people trying to monitor LE quota. So it'd be ideal if this quota rework could serve as the foundation of future public quota-access APIs.

Relevant past threads on checking quota usage include:

and I recently started a crt.sh discussion thread where I posted a probably-horribly-wrong query against the crt.sh database for LE quota checking purposes.

May I also suggest as part of this that messages relating to quota issues report the quota limit along with a quota exceeded message? E.g. at https://github.com/letsencrypt/boulder/blob/7d66d67054616867121e822fdc8ae58b10c1d71a/ra/ra.go#L1442C3-L1442C3 the web-api front-end replies

		return berrors.RateLimitError(retryAfter, "too many certificates already issued for %q. Retry after %s", namesOutOfLimit[0], retryString)

which informs the caller that the relevant quota has been exceeded, and for which domain, but not what the numeric value of the quota for the domain is.

ringerc · 2023-07-24T01:02:50Z

Regarding rate-limit data being ephemeral and not requiring consistency guarantees - while it can be derived from the main SA db cert logs, it might still be expensive to compute from scratch on a service restart. So the ability to checkpoint it and re-compute it forwards from the last checkpoint is likely to be desirable.

I didn't see anywhere in the Boulder SA code that limits how far back in time the current LE storage engine code will look to find a past issue of the same FQDN-set when it's deciding whether a given cert order is a renewal or not. That seems to put the entire LE cert history in scope for the quota checker datastore because it has to be able to check if any given FQDN-set was ever issued before. Even if I missed some limit on this, or if LE defines a look-back time limit as part of the new quota system work, it'll presumably have to be at least the 90 day cert validity window plus some reasonable slush margin. That's a lot of data to scan and rebuild a unique-FQDN-set cache from if the ephemeral store is lost to a crash, restart etc. That could have a significant effect on backend load during reconstruction and/or time-to-availability.

Check-pointing the rate-limit state wouldn't have to be consistent or atomic; it could just err in favor of under-counting usage rather than over-counting so it never falsely reported quota to be exceeded. Recording the latest-issued cert after saving a checkpoint of the rate limit state would probably do; then it'd only have to scan-forward from there. If some certs issued are not counted because they got issued after the snapshot started and before the last-issued position was recorded in the DB, who cares? It won't be lots and any error in the computed quota will age out of the quota window for certs-per-domain and renewals quotas within 7 days anyway.

I don't understand how the store can be re-initialized with a starting state or efficiently "replayed" to recover past state based on the store in https://github.com/letsencrypt/boulder/blob/861161fc9f76f7b0cdb27c0b7e81d1572e4c5061/ratelimits/source.go .

Similarly, if quota state recording and lookup is going for non-atomic "close enough" measurement then erring on side of under-counting usage would make sense.

aarongable · 2023-07-24T16:24:15Z

Hi @ringerc! A few notes in no particular order:

We're not committing to anything I say here, this whole message is just the current state of our thinking, which will certainly continue to evolve.
We do plan to expose rate limit info directly to clients. We're considering two avenues for doing so:
- Including comprehensive rate limit headers in all responses. So not just a Retry-After header, but also a "how many tokens do you have left" header, and a "how long until you're back to max capacity" header, etc.
- Providing a "tell me about all of my current rate limit quotas" endpoint that clients could query to a json document detailing their current state
Don't think too hard about the way the current rate limits are implemented in the SA. The whole point of @beautifulentropy's PR linked above is that we plan to get rid of all of that -- the rate limit data will be stored entirely outside of the database. The comment about ephemerality is basically saying: if the rate limits storage system gets dropped, that's not the end of the world. We'll get some extra traffic for a bit until the limits start kicking in again, but we won't fall over and we won't run into any compliance problems. So it's fundamentally okay if the rate limits data (or some subset of it) just vanishes, and we don't intend to go to any effort to reconstruct it from the database. We're considering various different backing stores, based on mix of criteria like easy of deployability, reliability, speed, and more.

ringerc · 2023-07-25T01:53:44Z

@aarongable That makes a lot of sense.

The only issue I see with ephemerality is that LE quotas currently exclude "renewals" for one domain, where the FQDN set checked for renewals seems to be back in time forever. If that's lost, all incoming cert orders count against the new-domains quota not the renewals-quota, which could easily cause an org to exceed quota for what would normally be a way-lower-than-limits load.

There's an effectively unlimited number of domains that could be renewed within the quota window so no amount of padding of the new-domains quota to compensate for "forgotten" renewals would guarantee correctness. A domain-account could've issued 50 certs 180 days ago, then renewed those 50 and issued another 50 certs 90 days ago, and now want to renew all 100 + issue 50 more; it'd usually be able to rely on being able to do this for an indefinite number of 90 day cert validity cycles so there's no sensible quota to apply if knowledge of which FQDN-sets are renewals is suddenly lost.

That problem won't go away after one quota cycle, so enforcement of that quota can't just be turned off until one 7 day cycle is past to solve the issue.

I don't see an obvious solution for that with the current quota definitions unless the unique-FQDN-sets for past certs issued can be safely stored or reconstructed on loss.

Redefining the quotas themselves w.r.t to renewals is one option. For example, change the renewal quota rules to say that only renewals of certs for fqdn-sets that were last issued or renewed in the past (say) 100 days are guaranteed to be exempted from the new-certs quota. Most people will only renew certs to maintain continuous rolling validity so this should be the immense majority of renewals. Then if unique-fqdn-set info for renewal detection is lost, increase the new-domains quota to 3x the usual value for the first 90 day cert validity window after loss of renewal history. That would ensure that even if every renewal is miscounted as a new domain no previously-valid request would be rejected, and it wouldn't require any reconstruction of unique FQDN set info from the main SA DB. Or it could be reconstructed lazily after recovery, and the normal limits re-imposed once the FQDN-set info was reconstructed.

beautifulentropy · 2023-07-25T18:29:59Z

@ringerc Thanks for the feedback; I believe you're correct on these points. Our plan to account for this is to continue using our MariaDB database as the source of truth when determining whether a newOrder request is a renewal.

ringerc · 2023-07-25T22:12:51Z

@beautifulentropy Makes sense. And the quota layer can be used as a write-through cache over the SA MariaDB FQDN set tracking so it can still offload a lot of the cost of the FQDN-set checks. If a domain-set is added to the FQDN-set once it's never removed, so the cache doesn't need any complicated invalidation logic and is easy to pre-warm on restart.

So the idea here is to add some kind of ephemeral no-consistency-guarantees layer for storing and looking the quotas that enforce limits on new orders, unique FQDN-sets per TLD+1, and most others. The FQDN-set uniqueness checks used to detect renewals will continue to use the current SA code, either directly as now, or possibly via write-through caching via the quota store. Client visibility into quota status will be exposed via a rich set of quota status headers on all responses and/or via dedicated quota-lookup API endpoints.

Reasonable summary of the intent as it stands?

- Emit override utilization only when resource counts are under threshold. - Override utilization accounts for anticipated issuance. - Correct the limit metric label for `CertificatesPerName` and `CertificatesPerFQDNSet/Fast`. Part of #5545

- Move default and override limits, and associated methods, out of the Limiter to new limitRegistry struct, embedded in a new public TransactionBuilder. - Export Transaction and add corresponding Transaction constructor methods for each limit Name, making Limiter and TransactionBuilder the API for interacting with the ratelimits package. - Implement batched Spends and Refunds on the Limiter, the new methods accept a slice of Transactions. - Add new boolean fields check and spend to Transaction to support more complicated cases that can arise in batches: 1. the InvalidAuthorizations limit is checked at New Order time in a batch with many other limits, but should only be spent when an Authorization is first considered invalid. 2. the CertificatesPerDomain limit is overridden by CertficatesPerDomainPerAccount, when this is the case, spends of the CertificatesPerDomain limit should be "best-effort" but NOT deny the request if capacity is lacking. - Modify the existing Spend/Refund methods to support Transaction.check/spend and 0 cost Transactions. - Make bucketId private and add a constructor for each bucket key format supported by ratelimits. - Move domainsForRateLimiting() from the ra.go to ratelimits. This avoids a circular import issue in ra.go. Part of #5545

- Update parsing of overrides with Ids formatted as 'fqdnSet' to produce a hexadecimal string. - Update validation for Ids formatted as 'fqdnSet' when constructing a bucketKey for a transaction to validate before identifier construction. - Skip CertificatesPerDomain transactions when the limit is disabled. Part of #5545

Make NewRegistration more consistent with the implementation in NewOrder (#7201): - Construct transactions just once, - use batched spending instead of multiple spend calls, and - do not attempt a refund for requests that fail due to RateLimit errors. Part of #5545

Add non-blocking checks of New Order limits to the WFE using the new key-value based rate limits package. Part of #5545

…7344) - Update the failed authorizations limit to use 'enum:regId:domain' for transactions while maintaining 'enum:regId' for overrides. - Modify the failed authorizations transaction builder to generate a transaction for each order name. - Rename the `FailedAuthorizationsPerAccount` enum to `FailedAuthorizationsPerDomainPerAccount` to align with its corrected implementation. This change is possible because the limit isn't yet deployed in staging or production. Blocks #7346 Part of #5545

Part of #5545

…PerDomain (#7513) - Rename `NewOrderRequest` field `LimitsExempt` to `IsARIRenewal` - Introduce a new `NewOrderRequest` field, `IsRenewal` - Introduce a new (temporary) feature flag, `CheckRenewalExemptionAtWFE` WFE: - Perform renewal detection in the WFE when `CheckRenewalExemptionAtWFE` is set - Skip (key-value) `NewOrdersPerAccount` and `CertificatesPerDomain` limit checks when renewal detection indicates the the order is a renewal. RA: - Leave renewal detection in the RA intact - Skip renewal detection and (legacy) `NewOrdersPerAccount` and `CertificatesPerDomain` limit checks when `CheckRenewalExemptionAtWFE` is set and the `NewOrderRequest` indicates that the order is a renewal. Fixes #7508 Part of #5545

Default code paths that depended on this flag to be true. Part of #5545

aarongable changed the title ~~Support Redis for OCSP & rate limit data storage~~ Create new rate limit data storage system May 16, 2023

aarongable assigned beautifulentropy May 16, 2023

aarongable modified the milestones: Sprint 2023-05-16, Sprint 2023-05-23 May 16, 2023

aarongable modified the milestones: Sprint 2023-05-23, Sprint 2023-05-30 May 30, 2023

aarongable modified the milestones: Sprint 2023-05-30, Sprint 2023-06-06, Sprint 2023-06-13 Jun 6, 2023

beautifulentropy mentioned this issue Jun 20, 2023

Initial implementation of key-value rate limits #6947

Merged

aarongable modified the milestones: Sprint 2023-06-13, Sprint 2023-06-27 Jun 27, 2023

aarongable modified the milestones: Sprint 2023-06-27, Sprint 2023-07-11 Jul 11, 2023

aarongable modified the milestones: Sprint 2023-07-11, Sprint 2023-07-18 Jul 18, 2023

pgporada modified the milestones: Sprint 2023-07-18, Sprint 2023-07-25 Jul 25, 2023

aarongable removed this from the Sprint 2023-07-25 milestone Aug 1, 2023

aarongable modified the milestones: Sprint 2023-11-07, Sprint 2023-11-14 Nov 14, 2023

beautifulentropy mentioned this issue Nov 14, 2023

RA: Fix legacy override utilization metrics #7124

Merged

aarongable modified the milestones: Sprint 2023-11-14, Sprint 2023-11-28 Nov 28, 2023

beautifulentropy modified the milestones: Sprint 2023-11-28, Sprint 2023-12-05 Dec 5, 2023

This was referenced Dec 20, 2023

WFE: Check NewOrder rate limits #7201

Merged

ratelimits: Fix two transaction construction bugs #7200

Merged

beautifulentropy mentioned this issue Jan 11, 2024

WFE: Two changes to NewRegistration key-value rate limits #7258

Merged

aarongable modified the milestones: Sprint 2023-12-05, Sprint 2024-01-16 Jan 16, 2024

beautifulentropy added a commit that referenced this issue Jan 27, 2024

WFE: Check NewOrder rate limits (#7201)

97a19b1

Add non-blocking checks of New Order limits to the WFE using the new key-value based rate limits package. Part of #5545

beautifulentropy changed the title ~~Create new rate limit data storage system~~ Key-value based rate limiting Feb 27, 2024

beautifulentropy modified the milestones: Sprint 2024-01-23, Sprint 2024-02-27 Feb 27, 2024

This was referenced Feb 29, 2024

ratelimits: Fix transaction building for Failed Authorizations Limit #7344

Merged

RA: Count failed authorizations using key-value rate limits #7346

Merged

beautifulentropy added a commit that referenced this issue Mar 11, 2024

RA: Count failed authorizations using key-value rate limits (#7346)

7e5c1ca

Part of #5545

beautifulentropy removed this from the Sprint 2024-03-05 milestone Mar 14, 2024

beautifulentropy mentioned this issue May 29, 2024

Remove legacy rate limits implementation #7512

Open

2 tasks

beautifulentropy mentioned this issue Jun 27, 2024

ratelimits: Exempt renewals from NewOrdersPerAccount and CertificatesPerDomain #7513

Merged

beautifulentropy mentioned this issue Oct 23, 2024

wfe/features: Deprecate UseKvLimitsForNewAccount #7765

Merged

beautifulentropy added a commit that referenced this issue Oct 23, 2024

wfe/features: Deprecate UseKvLimitsForNewOrder (#7765)

e5edb70

Default code paths that depended on this flag to be true. Part of #5545

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Key-value based rate limiting #5545

Key-value based rate limiting #5545

jprenken commented Jul 22, 2021 •

edited by beautifulentropy

Loading

jcjones commented Jul 27, 2021

aarongable commented May 16, 2023

ringerc commented Jul 24, 2023

ringerc commented Jul 24, 2023 •

edited

Loading

aarongable commented Jul 24, 2023

ringerc commented Jul 25, 2023

beautifulentropy commented Jul 25, 2023

ringerc commented Jul 25, 2023 •

edited

Loading

Key-value based rate limiting #5545

Key-value based rate limiting #5545

Comments

jprenken commented Jul 22, 2021 • edited by beautifulentropy Loading

jcjones commented Jul 27, 2021

aarongable commented May 16, 2023

ringerc commented Jul 24, 2023

ringerc commented Jul 24, 2023 • edited Loading

aarongable commented Jul 24, 2023

ringerc commented Jul 25, 2023

beautifulentropy commented Jul 25, 2023

ringerc commented Jul 25, 2023 • edited Loading

jprenken commented Jul 22, 2021 •

edited by beautifulentropy

Loading

ringerc commented Jul 24, 2023 •

edited

Loading

ringerc commented Jul 25, 2023 •

edited

Loading