Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key-value based rate limiting #5545

Open
16 of 18 tasks
jprenken opened this issue Jul 22, 2021 · 8 comments
Open
16 of 18 tasks

Key-value based rate limiting #5545

jprenken opened this issue Jul 22, 2021 · 8 comments
Assignees

Comments

@jprenken
Copy link
Contributor

jprenken commented Jul 22, 2021

MySQL-compatible databases are inefficient for some of Boulder's most demanding data storage needs:

  • OCSP responses are perfect for a simpler and more performant key-value data model.
  • Rate limit data (a broad category) is ephemeral, and requires no ACID guarantees.

In order for Boulder to support large and growing environments, it would be nice for it to support Redis as a storage backend for these purposes. For OCSP responses, there will need to be some ability to insert/update data in both MySQL and Redis in parallel (e.g. during a transition).

---- (project tracking added by @beautifulentropy) ----

@jcjones
Copy link
Contributor

jcjones commented Jul 27, 2021

I would recommend we split this ticket apart.

Rate limit data is very well suited for key-value stores, which often have Redis-compatible command sets (we probably shouldn't use Redis itself when there are memory-safe alternatives that speak Redis-protocol).

OCSP data can be represented reasonably well both as key-value and as tabular data. I think for performance reasons we should consider changes to how OCSP gets stored, too, but it's fundamentally different data, has a lot more tooling already around it that would have to change if we were to change its primary representation, and there's room for something Redis-like being used as a "cache" to unload the DBs without changing a primary source-of-truth. I would like thus to consider it separately.

@aarongable
Copy link
Contributor

As JC said, we should split this. OCSP data in Redis is now fully supported. So let's repurpose this ticket for @beautifulentropy's investigation into creating a new lookaside rate limit service, and the database system that will back that service.

@aarongable aarongable changed the title Support Redis for OCSP & rate limit data storage Create new rate limit data storage system May 16, 2023
beautifulentropy added a commit that referenced this issue Jul 21, 2023
This design seeks to reduce read-pressure on our DB by moving rate limit
tabulation to a key-value datastore. This PR provides the following:

- (README.md) a short guide to the schemas, formats, and concepts
introduced in this PR
- (source.go) an interface for storing, retrieving, and resetting a
subscriber bucket
- (name.go) an enumeration of all defined rate limits
- (limit.go) a schema for defining default limits and per-subscriber
overrides
- (limiter.go) a high-level API for interacting with key-value rate
limits
- (gcra.go) an implementation of the Generic Cell Rate Algorithm, a
leaky bucket-style scheduling algorithm, used to calculate the present
or future capacity of a subscriber bucket using spend and refund
operations

Note: the included source implementation is test-only and currently
accomplished using a simple in-memory map protected by a mutex,
implementations using Redis and potentially other data stores will
follow.

Part of #5545
@ringerc
Copy link

ringerc commented Jul 24, 2023

I've just quote some time trying to concoct a query against the crt.sh public certificate transparency database that will expose a reasonable estimate of current Let's Encrypt quota usage for a given domain as a workaround for the unavailability of quota-info on the public Let's Encrypt APIs, so I'm delighted to see this ticket and PoC PR.

I realise this ticket and the PR do not explicitly call for public APIs to query quota assignments or quota consumption, and I'm not trying to drive-by scope-hijack this ticket into doing so. But I do suggest that it's worth considering what a future public API might require when designing and testing the new quota system so that it leaves room for the possibility as follow-on work.

Past posts about quota checking have tended to point people at crt.sh - to use the public certificate transparency log and associated database maintained there to compute quota usage. But this method won't account for Let's Encrypt's FQDN-set renewal-detection logic, and it's tricky to get right, let alone in a performant way. Plus it's not ideal to shift load onto crt.sh for people trying to monitor LE quota. So it'd be ideal if this quota rework could serve as the foundation of future public quota-access APIs.

Relevant past threads on checking quota usage include:

and I recently started a crt.sh discussion thread where I posted a probably-horribly-wrong query against the crt.sh database for LE quota checking purposes.


May I also suggest as part of this that messages relating to quota issues report the quota limit along with a quota exceeded message? E.g. at https://github.com/letsencrypt/boulder/blob/7d66d67054616867121e822fdc8ae58b10c1d71a/ra/ra.go#L1442C3-L1442C3 the web-api front-end replies

		return berrors.RateLimitError(retryAfter, "too many certificates already issued for %q. Retry after %s", namesOutOfLimit[0], retryString)

which informs the caller that the relevant quota has been exceeded, and for which domain, but not what the numeric value of the quota for the domain is.

@ringerc
Copy link

ringerc commented Jul 24, 2023

Regarding rate-limit data being ephemeral and not requiring consistency guarantees - while it can be derived from the main SA db cert logs, it might still be expensive to compute from scratch on a service restart. So the ability to checkpoint it and re-compute it forwards from the last checkpoint is likely to be desirable.

I didn't see anywhere in the Boulder SA code that limits how far back in time the current LE storage engine code will look to find a past issue of the same FQDN-set when it's deciding whether a given cert order is a renewal or not. That seems to put the entire LE cert history in scope for the quota checker datastore because it has to be able to check if any given FQDN-set was ever issued before. Even if I missed some limit on this, or if LE defines a look-back time limit as part of the new quota system work, it'll presumably have to be at least the 90 day cert validity window plus some reasonable slush margin. That's a lot of data to scan and rebuild a unique-FQDN-set cache from if the ephemeral store is lost to a crash, restart etc. That could have a significant effect on backend load during reconstruction and/or time-to-availability.

Check-pointing the rate-limit state wouldn't have to be consistent or atomic; it could just err in favor of under-counting usage rather than over-counting so it never falsely reported quota to be exceeded. Recording the latest-issued cert after saving a checkpoint of the rate limit state would probably do; then it'd only have to scan-forward from there. If some certs issued are not counted because they got issued after the snapshot started and before the last-issued position was recorded in the DB, who cares? It won't be lots and any error in the computed quota will age out of the quota window for certs-per-domain and renewals quotas within 7 days anyway.

I don't understand how the store can be re-initialized with a starting state or efficiently "replayed" to recover past state based on the store in https://github.com/letsencrypt/boulder/blob/861161fc9f76f7b0cdb27c0b7e81d1572e4c5061/ratelimits/source.go .

Similarly, if quota state recording and lookup is going for non-atomic "close enough" measurement then erring on side of under-counting usage would make sense.

@aarongable
Copy link
Contributor

Hi @ringerc! A few notes in no particular order:

  • We're not committing to anything I say here, this whole message is just the current state of our thinking, which will certainly continue to evolve.
  • We do plan to expose rate limit info directly to clients. We're considering two avenues for doing so:
    • Including comprehensive rate limit headers in all responses. So not just a Retry-After header, but also a "how many tokens do you have left" header, and a "how long until you're back to max capacity" header, etc.
    • Providing a "tell me about all of my current rate limit quotas" endpoint that clients could query to a json document detailing their current state
  • Don't think too hard about the way the current rate limits are implemented in the SA. The whole point of @beautifulentropy's PR linked above is that we plan to get rid of all of that -- the rate limit data will be stored entirely outside of the database. The comment about ephemerality is basically saying: if the rate limits storage system gets dropped, that's not the end of the world. We'll get some extra traffic for a bit until the limits start kicking in again, but we won't fall over and we won't run into any compliance problems. So it's fundamentally okay if the rate limits data (or some subset of it) just vanishes, and we don't intend to go to any effort to reconstruct it from the database. We're considering various different backing stores, based on mix of criteria like easy of deployability, reliability, speed, and more.

@ringerc
Copy link

ringerc commented Jul 25, 2023

@aarongable That makes a lot of sense.

The only issue I see with ephemerality is that LE quotas currently exclude "renewals" for one domain, where the FQDN set checked for renewals seems to be back in time forever. If that's lost, all incoming cert orders count against the new-domains quota not the renewals-quota, which could easily cause an org to exceed quota for what would normally be a way-lower-than-limits load.

There's an effectively unlimited number of domains that could be renewed within the quota window so no amount of padding of the new-domains quota to compensate for "forgotten" renewals would guarantee correctness. A domain-account could've issued 50 certs 180 days ago, then renewed those 50 and issued another 50 certs 90 days ago, and now want to renew all 100 + issue 50 more; it'd usually be able to rely on being able to do this for an indefinite number of 90 day cert validity cycles so there's no sensible quota to apply if knowledge of which FQDN-sets are renewals is suddenly lost.

That problem won't go away after one quota cycle, so enforcement of that quota can't just be turned off until one 7 day cycle is past to solve the issue.

I don't see an obvious solution for that with the current quota definitions unless the unique-FQDN-sets for past certs issued can be safely stored or reconstructed on loss.

Redefining the quotas themselves w.r.t to renewals is one option. For example, change the renewal quota rules to say that only renewals of certs for fqdn-sets that were last issued or renewed in the past (say) 100 days are guaranteed to be exempted from the new-certs quota. Most people will only renew certs to maintain continuous rolling validity so this should be the immense majority of renewals. Then if unique-fqdn-set info for renewal detection is lost, increase the new-domains quota to 3x the usual value for the first 90 day cert validity window after loss of renewal history. That would ensure that even if every renewal is miscounted as a new domain no previously-valid request would be rejected, and it wouldn't require any reconstruction of unique FQDN set info from the main SA DB. Or it could be reconstructed lazily after recovery, and the normal limits re-imposed once the FQDN-set info was reconstructed.

@beautifulentropy
Copy link
Member

@ringerc Thanks for the feedback; I believe you're correct on these points. Our plan to account for this is to continue using our MariaDB database as the source of truth when determining whether a newOrder request is a renewal.

@ringerc
Copy link

ringerc commented Jul 25, 2023

@beautifulentropy Makes sense. And the quota layer can be used as a write-through cache over the SA MariaDB FQDN set tracking so it can still offload a lot of the cost of the FQDN-set checks. If a domain-set is added to the FQDN-set once it's never removed, so the cache doesn't need any complicated invalidation logic and is easy to pre-warm on restart.

So the idea here is to add some kind of ephemeral no-consistency-guarantees layer for storing and looking the quotas that enforce limits on new orders, unique FQDN-sets per TLD+1, and most others. The FQDN-set uniqueness checks used to detect renewals will continue to use the current SA code, either directly as now, or possibly via write-through caching via the quota store. Client visibility into quota status will be exposed via a rich set of quota status headers on all responses and/or via dedicated quota-lookup API endpoints.

Reasonable summary of the intent as it stands?

@aarongable aarongable removed this from the Sprint 2023-07-25 milestone Aug 1, 2023
pgporada pushed a commit that referenced this issue Nov 20, 2023
- Emit override utilization only when resource counts are under
threshold.
- Override utilization accounts for anticipated issuance.
- Correct the limit metric label for `CertificatesPerName` and
`CertificatesPerFQDNSet/Fast`.

Part of #5545
beautifulentropy added a commit that referenced this issue Dec 7, 2023
- Move default and override limits, and associated methods, out of the
Limiter to new limitRegistry struct, embedded in a new public
TransactionBuilder.
- Export Transaction and add corresponding Transaction constructor
methods for each limit Name, making Limiter and TransactionBuilder the
API for interacting with the ratelimits package.
- Implement batched Spends and Refunds on the Limiter, the new methods
accept a slice of Transactions.
- Add new boolean fields check and spend to Transaction to support more
complicated cases that can arise in batches:
1. the InvalidAuthorizations limit is checked at New Order time in a
batch with many other limits, but should only be spent when an
Authorization is first considered invalid.
2. the CertificatesPerDomain limit is overridden by
CertficatesPerDomainPerAccount, when this is the case, spends of the
CertificatesPerDomain limit should be "best-effort" but NOT deny the
request if capacity is lacking.
- Modify the existing Spend/Refund methods to support
Transaction.check/spend and 0 cost Transactions.
- Make bucketId private and add a constructor for each bucket key format
supported by ratelimits.
- Move domainsForRateLimiting() from the ra.go to ratelimits. This
avoids a circular import issue in ra.go.

Part of #5545
beautifulentropy added a commit that referenced this issue Jan 8, 2024
- Update parsing of overrides with Ids formatted as 'fqdnSet' to produce
a hexadecimal string.
- Update validation for Ids formatted as 'fqdnSet' when constructing a
bucketKey for a transaction to validate before identifier construction.
- Skip CertificatesPerDomain transactions when the limit is disabled.

Part of #5545
beautifulentropy added a commit that referenced this issue Jan 23, 2024
Make NewRegistration more consistent with the implementation in NewOrder
(#7201):
- Construct transactions just once,
- use batched spending instead of multiple spend calls, and
- do not attempt a refund for requests that fail due to RateLimit
errors.

Part of #5545
beautifulentropy added a commit that referenced this issue Jan 27, 2024
Add non-blocking checks of New Order limits to the WFE using the new
key-value based rate limits package.

Part of #5545
@beautifulentropy beautifulentropy changed the title Create new rate limit data storage system Key-value based rate limiting Feb 27, 2024
beautifulentropy added a commit that referenced this issue Mar 6, 2024
…7344)

- Update the failed authorizations limit to use 'enum:regId:domain' for
transactions while maintaining 'enum:regId' for overrides.
- Modify the failed authorizations transaction builder to generate a
transaction for each order name.
- Rename the `FailedAuthorizationsPerAccount` enum to
`FailedAuthorizationsPerDomainPerAccount` to align with its corrected
implementation. This change is possible because the limit isn't yet
deployed in staging or production.

Blocks #7346
Part of #5545
@beautifulentropy beautifulentropy removed this from the Sprint 2024-03-05 milestone Mar 14, 2024
beautifulentropy added a commit that referenced this issue Jun 27, 2024
…PerDomain (#7513)

- Rename `NewOrderRequest` field `LimitsExempt` to `IsARIRenewal`
- Introduce a new `NewOrderRequest` field, `IsRenewal`
- Introduce a new (temporary) feature flag, `CheckRenewalExemptionAtWFE`

WFE:
- Perform renewal detection in the WFE when `CheckRenewalExemptionAtWFE`
is set
- Skip (key-value) `NewOrdersPerAccount` and `CertificatesPerDomain`
limit checks when renewal detection indicates the the order is a
renewal.

RA:
- Leave renewal detection in the RA intact
- Skip renewal detection and (legacy) `NewOrdersPerAccount` and
`CertificatesPerDomain` limit checks when `CheckRenewalExemptionAtWFE`
is set and the `NewOrderRequest` indicates that the order is a renewal.

Fixes #7508
Part of #5545
beautifulentropy added a commit that referenced this issue Oct 23, 2024
Default code paths that depended on this flag to be true.

Part of #5545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants