-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry Requests with HTTP Status 429 #4048
Conversation
@@ -207,7 +214,7 @@ extension HTTPClient: @unchecked Sendable {} | |||
|
|||
// MARK: - Private | |||
|
|||
private extension HTTPClient { | |||
internal extension HTTPClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the access modified here so I could do some testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using the @Testable import RevenueCat
thing have made this visible for testing? I can't exactly remember the correc usecase for it but figured I'd ask 😇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish! Unfortunately @testable
only grants you access to internal stuff, not private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly any time we're making private stuff internal for testing, it's a sign that it should likely be extracted into a separate internal entity instead. Not only for easier testing, but also to reduce the amount of stuff a single object (in this case, HTTPClient) is doing
} | ||
|
||
/// The number of times that we have retried the request | ||
var retryCount: UInt = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using a UInt
instead of an Int
here so there won't be any negative values 😎
} | ||
|
||
internal extension HTTPClient.ExponentialRetryOptions { | ||
static let `default` = HTTPClient.ExponentialRetryOptions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally open to changing these default values to something else if we want - this is just what came to the top of my head first!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would make sense to have different defaults per endpoint type... As in, maybe we don't need to retry some non-critical requests like diagnostics, product entitlement mapping,...
@@ -582,6 +624,87 @@ private extension HTTPClient { | |||
|
|||
} | |||
|
|||
// MARK: - Request Retry Logic | |||
extension HTTPClient { | |||
internal func retryRequestIfNeeded( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only function I wasn't able to directly test - let me know if y'all have any ideas! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double-checking that this method has been tested directly now and the comment is outdated, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I added some unit tests to test it directly, and it's also covered under the integration tests now!
For end-to-end testing, I'd suggest using some proxy server like this one that you can control and set stuff like "give me a 429 for the next request only" |
@@ -196,6 +202,7 @@ extension HTTPClient { | |||
case isLoadShedder = "X-RevenueCat-Fortress" | |||
case requestID = "X-Request-ID" | |||
case amazonTraceID = "X-Amzn-Trace-ID" | |||
case retryAfter = "Retry-After" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: confirm header name with backend devs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed!
// retryCount == self.retryOptions.maxNumberOfRetries | ||
guard request.retryCount > self.retryOptions.maxNumberOfRetries else { return } | ||
guard let httpURLResponse = httpURLResponse else { return } | ||
guard shouldRetryRequest(withStatusCode: httpURLResponse.httpStatusCode) else { return } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably make this the first check since conceptually it seems more important than the others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're always returning on these, maybe we could just chain them with &&
and ,
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then again we may wanna log, especially for the case where we've reached max retries on 429
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging here when we've reached the max retry count is a great idea - I'll add one in and combine the other two guard statements.
retryCount: UInt | ||
) -> TimeInterval { | ||
// If this is our first retry and there's no eTag header, ensure there's no delay | ||
if httpURLResponse.allHeaderFields[ResponseHeader.eTag] == nil && retryCount == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we just make the min backoff time zero we can skip this bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine that most of the time, if you hit a 429 and retry immediately, by the time you hit the backend again it'll already be solved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, removed!
internal func calculateDefaultExponentialBackoffTimeInterval(withRetryCount retryCount: UInt) -> TimeInterval { | ||
// Never wait for an original request | ||
guard retryCount > 0 else { return 0 } | ||
|
||
let baseBackoff = self.retryOptions.baseBackoff | ||
if baseBackoff < 1 { | ||
// Since base is less than 1, we need to calculate the backoff by multiplying the baseBackoff by | ||
// 2^(retryCount-1). This will result in: | ||
// | ||
// retryCount 1: backoff = baseBackoff * 2^0 = 0.25 | ||
// retryCount 2: backoff = baseBackoff * 2^1 = 0.5 | ||
// retryCount 3: backoff = baseBackoff * 2^2 = 1.0 | ||
return min(baseBackoff * pow(2.0, Double(retryCount - 1)), self.retryOptions.maxBackoff) | ||
} else { | ||
// If baseBackoff is greater than or equal to 1, just use baseBackoff^retryCount: | ||
// retryCount 1: backoff = baseBackoff^1 | ||
// retryCount 2: backoff = baseBackoff^2 | ||
// ... | ||
return min(pow(baseBackoff, Double(retryCount)), self.retryOptions.maxBackoff) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only have 3 possible values in that we only retry up to three times, we might as well just explicitly set the values
).to(equal(1)) | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use OHTTPStubs to fake responses from the backend and test.
The files named somethingBackendTests
have plenty of examples you can mostly reuse, especially BackendPostReceiptDataTests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like we should get at least one backend post receipt test in with the exact case we want to cover (first request is a 429, second goes through).
We can also use our BackendIntegrationTests suite to actually force a 429, remove the forced 429, post something to the backend, ensure it went through.
OfflineStoreKitIntegrationTests
has some examples of doing something similar
great work on this! |
let retriableStatusCodes: Set<HTTPStatusCode> | ||
let baseBackoff: TimeInterval | ||
let maxBackoff: TimeInterval | ||
let maxNumberOfRetries: UInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering, do we need both a maxBackoff
and maxNumberOfRetries
? One or the other could be computed of the other + baseBackoff right? Edit: Sorry I was reviewing this. I see it's different. So I think it's ok to leave both!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my updates to some of the other PR comments, I've hardcoded the backoff intervals. After that change was done, most of this struct was redundant so it's been removed!
@aboedo I've updated the PR with updates from your comments! Specifically, I've made the following changes:
|
case .timeInterval(let timeInterval): | ||
return max(timeInterval, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would this be 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of any instances where the timeInterval would be <0 in the scope of this PR, but this is just defensive programming in case someone ever tries to pass in a negative TimeInterval
as a delay in the future
TimeInterval(0.25), | ||
TimeInterval(0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't say that I have evidence to back this up, but I'd maybe have the backoff be a bit more aggressive, this does in the best case make three requests in the same second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be closer to 0.5 and 1 sec, or even 0.75 and 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me! I'll make the adjustments to use 0sec, 0.75sec, and 3sec. 👍
if !requestRetryScheduled { | ||
request.completionHandler?(response) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sending a retry attempt number header or something so we can debug these in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also maybe we can add this to SDK diagnostics (cc @tonidero )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, but this is an excellent thing to add. I'll look into adding the retry attempt number to the headers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we are already tracking http requests response codes:
func trackHttpRequestPerformed(endpointName: String, |
We could add the retry count there. I'm a bit reticent to add another tag to the metric in Grafana (we are being mindful of increasing the cardinality of the data), but it might be useful especially if/when we move to using snowflake or similar for this data.
If we want to do this, we just need to append the new property as a valid property for the http_request_performed diagnostic event in the backend and we can start sending it from the SDK. Lmk if you want me to tackle this!
@@ -582,6 +624,87 @@ private extension HTTPClient { | |||
|
|||
} | |||
|
|||
// MARK: - Request Retry Logic | |||
extension HTTPClient { | |||
internal func retryRequestIfNeeded( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double-checking that this method has been tested directly now and the comment is outdated, right?
/// - Parameter retryCount: The count of the retry attempt, starting from 1. | ||
/// - Returns: The backoff time interval for the given retry count. If the retry count exceeds | ||
/// the predefined list of backoff intervals, it returns 0. | ||
internal func defaultExponentialBackoffTimeInterval(withRetryCount retryCount: UInt) -> TimeInterval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of stuff here is made internal for testing.
If we really need to test these directly, we should move to a different dedicated entity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[can be addressed as a follow-up]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this is now simple enough that maybe it doesn't merit its own method - it just looks up a value in an array and has a default if it goes out of bounds.
it could even be simplified further like
guard retryCount >= 0 && retryCount < retryBackoffIntervals.count else { return retryBackoffIntervals.last ?? 0 }
return self.retryBackoffIntervals[backoffIntervalIndex]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the function!
@@ -108,6 +109,8 @@ private extension Delay { | |||
case .none: return 0 | |||
case .`default`: return 0 | |||
case .long: return Self.maxJitter | |||
case .timeInterval(let timeInterval): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't need the jitter, we might as well make another method in operation dispatcher that just takes an exact value as TimeInterval
rather as an overload to the one we're using instead of passing in a min and max value that are the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the argument can be made either way. If we were to make that change, we'd then have two similar dispatch functions that happen after a period of time:
func dispatchOnWorkerThread(delay: Delay)
func dispatchOnWorkerThread(after timeInterval: TimeInterval)
When I read these function signatures, it isn't clear which one I should choose because the function signatures signify the same thing.
If we were to go through with this change, I'd recommend renaming Delay
to something like JitterableDelay
to indicate the difference that one of these APIs includes a jitter and the other doesn't. I'm down to make this change - any objections to that renaming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I've gone ahead and renamed Delay
to JitterableDelay
and created the new func dispatchOnWorkerThread(after timeInterval: TimeInterval)
function!
self.logger.verifyMessageWasLogged("Queued request GET /v1/subscribers/identify for retry in 0.0 seconds.") | ||
self.logger.verifyMessageWasLogged("Queued request GET /v1/subscribers/identify for retry in 0.25 seconds.") | ||
self.logger.verifyMessageWasLogged("Queued request GET /v1/subscribers/identify for retry in 0.5 seconds.") | ||
self.logger.verifyMessageWasLogged("Request GET /v1/subscribers/identify failed all 3 retries.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are quite dangerous in that the minute we change the logs we mess up tests. That'd be fine if we were trying to specifically test that we're logging, but here we're testing "Performs all retries if always gets retryable status code", so we should instead make sure to test for that.
We test for request count to hit 4, maybe we could also mock the operation dispatcher as a separate test and ensure that it's getting the right delays in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this! I've moved the log assertions to their own tests and created separate test to verify that the correct delay times are being sent to the OperationDispatcher.
).to(equal(1)) | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like we should get at least one backend post receipt test in with the exact case we want to cover (first request is a 429, second goes through).
We can also use our BackendIntegrationTests suite to actually force a 429, remove the forced 429, post something to the backend, ensure it went through.
OfflineStoreKitIntegrationTests
has some examples of doing something similar
@@ -131,10 +134,25 @@ extension NetworkStrings: LogMessage { | |||
case let .request_handled_by_load_shedder(path): | |||
return "Request was handled by load shedder: \(path.relativePath)" | |||
|
|||
case let .api_request_queued_for_retry(httpMethod, path, backoffInterval): | |||
return "Queued request \(httpMethod) \(path) for retry in \(backoffInterval) seconds." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to log the retry number in this message as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this, will add it in!
@MarkVillacampa pointed out that the forcedError stuff in the integration tests can probably be accomplished much more easily and safely with OhHTTPStubs, so I'm going to look at doing that! :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! But might be good to get some other checks as well
@MarkVillacampa I've made the following changes from your review:
|
] | ||
} | ||
|
||
func testVerifyPurchaseDoesntGrantEntitlementsAfter429RetriesExhausted() async throws { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can test that exactly 3 retries were made here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this with Mark in Slack - we added in verifications for the number of times that the request was failed before eventually succeeding with the backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small nits/thoughts but otherwise looks good 👍
func dispatchOnWorkerThread(delay: JitterableDelay = .none, block: @escaping @Sendable () -> Void) { | ||
if delay.hasDelay { | ||
self.workerQueue.asyncAfter(deadline: .now() + delay.random(), execute: block) | ||
} else { | ||
self.workerQueue.async(execute: block) | ||
} | ||
} | ||
|
||
func dispatchOnWorkerThread(delay: Delay = .none, block: @escaping @Sendable () async -> Void) { | ||
func dispatchOnWorkerThread(after timeInterval: TimeInterval, block: @escaping @Sendable () -> Void) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a small Josh nit but delay
and after
seem like very similar argument names and I would probably be confused by trying to figure out the difference between these 😅
Not a thing that has to change in the PR but just a thought 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree with you! The only difference is that delay: JitterableDelay
doesn't let you specify the exact delay and includes a random jitter, while after timeInterval: TimeInterval
allows you to specify the exact waiting time without a random jitter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can rename the param to jitterableDelay:
It might be slightly confusing on first glance but at least you'd hesitate long enough to actually think of which one to use
@@ -207,7 +214,7 @@ extension HTTPClient: @unchecked Sendable {} | |||
|
|||
// MARK: - Private | |||
|
|||
private extension HTTPClient { | |||
internal extension HTTPClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using the @Testable import RevenueCat
thing have made this visible for testing? I can't exactly remember the correc usecase for it but figured I'd ask 😇
import OHHTTPStubs | ||
import OHHTTPStubsSwift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I know very little about what our standard is for organizing import but seems like these could be grouped without a new line inbetween 🤷♂️
func dispatchOnWorkerThread(delay: JitterableDelay = .none, block: @escaping @Sendable () -> Void) { | ||
if delay.hasDelay { | ||
self.workerQueue.asyncAfter(deadline: .now() + delay.random(), execute: block) | ||
} else { | ||
self.workerQueue.async(execute: block) | ||
} | ||
} | ||
|
||
func dispatchOnWorkerThread(delay: Delay = .none, block: @escaping @Sendable () async -> Void) { | ||
func dispatchOnWorkerThread(after timeInterval: TimeInterval, block: @escaping @Sendable () -> Void) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can rename the param to jitterableDelay:
It might be slightly confusing on first glance but at least you'd hesitate long enough to actually think of which one to use
@@ -207,7 +214,7 @@ extension HTTPClient: @unchecked Sendable {} | |||
|
|||
// MARK: - Private | |||
|
|||
private extension HTTPClient { | |||
internal extension HTTPClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly any time we're making private stuff internal for testing, it's a sign that it should likely be extracted into a separate internal entity instead. Not only for easier testing, but also to reduce the amount of stuff a single object (in this case, HTTPClient) is doing
} | ||
|
||
// MARK: - Request Retry Logic | ||
extension HTTPClient { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extension has some logic that might be better suited to its own file, such that the only stuff that's remaining in HTTPClient is retryRequestIfNeeded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't need to be addressed in this PR, but would be a good follow-up
Motivation
In the past, the SDK has ignored HTTP Status 429s. This PR introduces a retry ability with an exponential backoff policy to the SDK's HTTPClient so that 429s can be retried.
Description
Configuration
The implementation allows consumers of the HTTP client to specify which HTTP status codes to retry for. The exponential backoff policy will retry requests up to three times, with the following backoff times:
Server-Defined Backoff Durations
If the server returns the
Retry-After
header with an integer value, the SDK will use that integer as the number of milliseconds to wait before retrying again instead of computing a backoff duration.New
X-Retry-After
Header For Outbound Network RequestsThis PR introduces a new
X-Retry-After
header for all outbound network requests, whose value is an integer. For the initial request (with no retries), the value is 0. For the first retry, the value is 1. For the second retry, the value is 2, and so on.Etag Management
To keep our etag behavior consistent, the backoff duration for a first retry will always be 0 seconds.
Logging
A debug message in the format
"Queued request \(httpMethod) \(path) for retry number \(retryNumber) in \(backoffInterval) seconds."
is logged when a request is queued for retry. If a request fails all retries, a message with the format"Request \(httpMethod) \(path) failed all \(retryCount) retries."
will be logged.Testing
A good number of unit tests have been written, for both the individual parts of the HTTP client introduced in this PR and for testing the entire request flow through the HTTPClient.