Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve retry robustness #128

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

blakegong
Copy link

Background

In send_notification_batch, we start with self.connect() to self recover from potential connection related errors:

PyAPNs2/apns2/client.py

Lines 187 to 189 in 5e4a938

# Make sure we're connected to APNs, so that we receive and process the server's SETTINGS
# frame before starting to send notifications.
self.connect()

However, it is not enough, because the self recovery is done through:
APNsClient.connect() -> APNsClient._connection.connect() -> https://github.com/python-hyper/hyper/blob/b77e758f472f00b098481e3aa8651b0808524d84/hyper/http20/connection.py#L331-L347, which is basically:

def connect():
    with self._lock:
        if self._sock is not None:
            return

        # ... (omitted, the code proceeds to rebuild the connection otherwise)

In the cases where self._sock is in a weird state, this call sequence does not help with resetting the connection states, as self._sock does still hold a value despite no longer being able to actually doing meaningful work. As a result, after calling APNsClient.connect(), we still end up using a broken socket. And that scenario is not recoverable.

What we have observed in production was that the send_notification_batch() call just retries 3 times then raises ConnectionFailed 💀 .

Approach

This PR is an attempt to fix that, by forcing a full connection reset from the second retry onwards. We are currently running with a hack in production, by basically doing:

try:
    results = client.send_notification_batch(notifications, topic=bundle_id)
except ConnectionFailed:
    client._connection.close()  # the hack
    retry(...)

And this did work for us. The unrecoverable ConnectionFailed never happened to us after this was deployed.

Some thoughts

I still left the first try to be a "soft" connection reboot so that some other scenario might recover/reset faster, and also that keeps the first retry the same as the current code. But for sure we can change that to be a full reconnection too if you like 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant