Improve retry robustness #128

blakegong · 2021-04-26T15:52:00Z

Background

In send_notification_batch, we start with self.connect() to self recover from potential connection related errors:

Lines 187 to 189 in 5e4a938

    
           # Make sure we're connected to APNs, so that we receive and process the server's SETTINGS 
        
           # frame before starting to send notifications. 
        
           self.connect()

However, it is not enough, because the self recovery is done through:
APNsClient.connect() -> APNsClient._connection.connect() -> https://github.com/python-hyper/hyper/blob/b77e758f472f00b098481e3aa8651b0808524d84/hyper/http20/connection.py#L331-L347, which is basically:

def connect():
    with self._lock:
        if self._sock is not None:
            return

        # ... (omitted, the code proceeds to rebuild the connection otherwise)

In the cases where self._sock is in a weird state, this call sequence does not help with resetting the connection states, as self._sock does still hold a value despite no longer being able to actually doing meaningful work. As a result, after calling APNsClient.connect(), we still end up using a broken socket. And that scenario is not recoverable.

What we have observed in production was that the send_notification_batch() call just retries 3 times then raises ConnectionFailed 💀 .

Approach

This PR is an attempt to fix that, by forcing a full connection reset from the second retry onwards. We are currently running with a hack in production, by basically doing:

try:
    results = client.send_notification_batch(notifications, topic=bundle_id)
except ConnectionFailed:
    client._connection.close()  # the hack
    retry(...)

And this did work for us. The unrecoverable ConnectionFailed never happened to us after this was deployed.

Some thoughts

I still left the first try to be a "soft" connection reboot so that some other scenario might recover/reset faster, and also that keeps the first retry the same as the current code. But for sure we can change that to be a full reconnection too if you like 😃

Improve retry robustness

b152fef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve retry robustness #128

Improve retry robustness #128

blakegong commented Apr 26, 2021

	# Make sure we're connected to APNs, so that we receive and process the server's SETTINGS
	# frame before starting to send notifications.
	self.connect()

Improve retry robustness #128

Are you sure you want to change the base?

Improve retry robustness #128

Conversation

blakegong commented Apr 26, 2021

Background

Approach

Some thoughts