Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to apple push notifications #128

Merged
merged 3 commits into from
Sep 24, 2024
Merged

Add retries to apple push notifications #128

merged 3 commits into from
Sep 24, 2024

Conversation

larkox
Copy link
Contributor

@larkox larkox commented Sep 17, 2024

Summary

Add retries to apple notifications. This may solve some RequestError errors we see in the metrics.

Ticket Link

https://mattermost.atlassian.net/browse/MM-60572

@larkox larkox added the 1: Dev Review Requires review by a core commiter label Sep 17, 2024
@larkox larkox requested review from marianunez and enahum September 17, 2024 15:22
Comment on lines 276 to 279
// Set the retry context timeout to a value that allow us to retry the notification
// the MAX RETRIES with exponential backoff. With default values, this timeout
// will be 5 seconds.
retryContextTimeout := me.sendTimeout / (MAX_RETRIES * 2)
Copy link
Member

@marianunez marianunez Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry value would be 60 secs? So the timeout for the context here is 10 secs ? Just wondering is that what you intended

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me.sendTimeout is by default 30s. MAX_RETRIES is 3.

30 / (3* 2) = 5 seconds

So... worst case scenario with the defaults...

  • 5 seconds trying the first time and fail because the context exceeded the deadline
  • Wait 1 second
  • 5 seconds trying the second time and fail because the context exceeded the deadline
  • Wait 2 seconds
  • 5 seconds trying the third time and fail because the context exceeded the deadline
  • Return error

A total of 18s, more than enough to fit in the 30s timeout.

I can do something more hardcoded, or tweak the numbers if we prefer. I don't have a strong opinion on these.

Copy link
Member

@marianunez marianunez Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we confident that 5 seconds is enough to set the timeout for the apple client call? It seems we are going from 30 secs per call to only 5 secs per call. In the dashboard we saw there were times it would take up to 10 secs.

Just trying to understand why we are trying to fit all the retry logic within the initial configuration of time out per send?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The sendTimeout is the timeout for a single send. It should not be made to limit the time including retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry is that 30 seconds 3 times is +90 seconds the server waiting for the push proxy response. The server problably has also a context deadline, so not sure if we want to wait that long.

But if you have a strong feeling we should just use the configured timeout for each retry, I am happy to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about semantics. The sendTimeout should be the timeout for a single request. If you want to have a separate timeout including all the retries, then that should be a separate config setting.

The server timeout is hardcoded to 30s currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new config, and use both to properly handle this. I set it to 8 seconds, since it is the highest integer we can use that will fit comfortably in the 30 seconds (8*3 + 1 + 2 = 27s).

@enahum enahum requested a review from agnivade September 18, 2024 02:23
Copy link
Member

@agnivade agnivade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think retries are better implemented with a roundTripper as done here: https://github.com/hashicorp/go-retryablehttp/blob/main/roundtripper.go. Or perhaps, we can simply use that library to do the retry.

Comment on lines 276 to 279
// Set the retry context timeout to a value that allow us to retry the notification
// the MAX RETRIES with exponential backoff. With default values, this timeout
// will be 5 seconds.
retryContextTimeout := me.sendTimeout / (MAX_RETRIES * 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The sendTimeout is the timeout for a single send. It should not be made to limit the time including retries.

@agnivade
Copy link
Member

Also, note that the server might return a 429: https://developer.apple.com/documentation/usernotifications/handling-notification-responses-from-apns#Interpret-header-responses, in which case we should slow down, and retrying further might worsen the condition. This will require usage of a rate limiter. Probably something to do as a future improvement.

@larkox
Copy link
Contributor Author

larkox commented Sep 19, 2024

@agnivade A 429 is handled by the library, and doesn't become an error, but a "not sent" response with a ReasonTooManyRequests reason. That would not get retried.

@larkox
Copy link
Contributor Author

larkox commented Sep 19, 2024

I think retries are better implemented with a roundTripper as done here: https://github.com/hashicorp/go-retryablehttp/blob/main/roundtripper.go. Or perhaps, we can simply use that library to do the retry.

My problem with this approach is that the library already does plenty of stuff that I don't want to mess with. So I don't need the retry at the http client level, but higher. My idea is mainly retry when there is a network hiccup. Most of the errors we have seen like this are context deadline errors, which I don't think are automatically retried in these circumstances either.

@agnivade
Copy link
Member

That's okay. Not a big deal.

Copy link
Member

@marianunez marianunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @larkox!

Copy link
Member

@agnivade agnivade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small thing to verify in config.

if cfg.RetryTimeoutSec == 0 {
cfg.RetryTimeoutSec = 8
}

if cfg.EnableFileLog {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a check here to verify that the retryTimeout isn't greater than the sendTimeout.

start := time.Now()

retryContext, cancelRetryContext := context.WithTimeout(generalContext, me.retryTimeout)
defer cancelRetryContext()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor note: having defers in a for loop has a slight problem that it will keep getting accumulated until the whole function exits. But since this loop has a hardcoded upper bound, this is okay. Otherwise, it would be recommended to cancel as soon as one iteration finishes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability sake, I feel it is better to keep it as is. As you said, the loop is bound, so it should not be a big issue.

@larkox larkox requested a review from agnivade September 24, 2024 10:45
@larkox larkox added 2: Reviews Complete All reviewers have approved the pull request and removed 1: Dev Review Requires review by a core commiter labels Sep 24, 2024
@larkox larkox merged commit 681c1cb into master Sep 24, 2024
5 checks passed
@larkox larkox deleted the appleRetry branch September 24, 2024 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants