Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
In
send_notification_batch
, we start withself.connect()
to self recover from potential connection related errors:PyAPNs2/apns2/client.py
Lines 187 to 189 in 5e4a938
However, it is not enough, because the self recovery is done through:
APNsClient.connect()
->APNsClient._connection.connect()
-> https://github.com/python-hyper/hyper/blob/b77e758f472f00b098481e3aa8651b0808524d84/hyper/http20/connection.py#L331-L347, which is basically:In the cases where
self._sock
is in a weird state, this call sequence does not help with resetting the connection states, asself._sock
does still hold a value despite no longer being able to actually doing meaningful work. As a result, after callingAPNsClient.connect()
, we still end up using a broken socket. And that scenario is not recoverable.What we have observed in production was that the
send_notification_batch()
call just retries 3 times then raisesConnectionFailed
💀 .Approach
This PR is an attempt to fix that, by forcing a full connection reset from the second retry onwards. We are currently running with a hack in production, by basically doing:
And this did work for us. The unrecoverable
ConnectionFailed
never happened to us after this was deployed.Some thoughts
I still left the first try to be a "soft" connection reboot so that some other scenario might recover/reset faster, and also that keeps the first retry the same as the current code. But for sure we can change that to be a full reconnection too if you like 😃