Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: attempt to fix dangling broadcasts #428

Merged
merged 6 commits into from
Mar 25, 2024

Conversation

jmlee337
Copy link
Contributor

@jmlee337 jmlee337 commented Mar 14, 2024

  • remove unnecessary newFirstEvent check in broadcast reconnect
  • add retry logic while connecting/reconnecting

#427

…try logic while connecting

this should fix client-side causes of dangling broadcast project-slippi#427

note that I had some problems with electronmon intermittently detecting file changes for seemingly no reason, so I tested this with `electronmon` changed to just `electron` in the `"start:main"` rule of `package.json`
tested with local changes to trigger retry logic
- also fix possible bug with cleanup on stop(), since it's possible to have `this.wsConnection` and not `this.broadcastId`

again, tested with local changes to trigger retry logic
@jmlee337 jmlee337 changed the title Attempt to fix client-side causes of dangling broadcast #427 Attempt to fix client-side causes of dangling broadcast Mar 14, 2024
@jmlee337
Copy link
Contributor Author

Good news! I've confirmed in a long-running test that the retry logic works for real! Looks like it was able to recover after some sort of server hiccup?

09:23:22.335 (broadcast.worker)             > Error: write ECONNRESET
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
09:23:22.335 (broadcast.worker)             > WS connection closed: 1006, Socket Error: write ECONNRESET
09:23:22.340                                > Starting broadcast
09:23:22.341 (broadcast.worker)             > Connecting to WS service
09:23:24.352 (broadcast.worker)             > Retrying connecting sub step: 1 after 2000ms
09:23:28.357 (broadcast.worker)             > Retrying connecting sub step: 1 after 4000ms
09:23:36.358 (broadcast.worker)             > Retrying connecting sub step: 1 after 8000ms
09:23:43.381 (broadcast.worker)             > WS failed to connect
09:23:43.381 (broadcast.worker)             > connect ETIMEDOUT 142.251.42.148:80
09:23:45.382 (broadcast.worker)             > WS failed to connect
09:23:45.382 (broadcast.worker)             > connect ETIMEDOUT 142.251.42.148:80
09:23:49.387 (broadcast.worker)             > WS failed to connect
09:23:49.387 (broadcast.worker)             > connect ETIMEDOUT 142.251.42.148:80
09:23:52.363 (broadcast.worker)             > Retrying connecting sub step: 1 after 16000ms
09:23:57.391 (broadcast.worker)             > WS failed to connect
09:23:57.392 (broadcast.worker)             > connect ETIMEDOUT 142.251.42.148:80
09:24:13.417 (broadcast.worker)             > WS failed to connect
09:24:13.417 (broadcast.worker)             > connect ETIMEDOUT 142.251.42.148:80
09:24:24.362 (broadcast.worker)             > Retrying connecting sub step: 1 after 32000ms
09:24:45.405 (broadcast.worker)             > WS failed to connect
09:24:45.406 (broadcast.worker)             > connect ETIMEDOUT 142.251.42.148:80
09:25:28.365 (broadcast.worker)             > Retrying connecting sub step: 1 after 64000ms
09:25:49.440 (broadcast.worker)             > WS failed to connect
09:25:49.441 (broadcast.worker)             > connect ETIMEDOUT 142.251.222.20:80
09:27:36.371 (broadcast.worker)             > Retrying connecting sub step: 1 after 128000ms
09:27:57.413 (broadcast.worker)             > WS failed to connect
09:27:57.414 (broadcast.worker)             > connect ETIMEDOUT 142.251.222.20:80
09:31:52.376 (broadcast.worker)             > Retrying connecting sub step: 1 after 256000ms
09:31:52.610 (broadcast.worker)             > WS connection successful

But I think I should hook into the connectFailed handler instead of blindly retrying. Will work on that and update

also properly dispose of ws client
@jmlee337
Copy link
Contributor Author

updated! tested by pulling my ethernet cord lol

21:40:06.776 (broadcast.worker)             > Error: write ECONNRESET
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
21:40:06.776 (broadcast.worker)             > WS connection closed: 1006, Socket Error: write ECONNRESET
21:40:06.782                                > Starting broadcast
21:40:06.784 (broadcast.worker)             > Connecting to WS service
21:40:06.787 (broadcast.worker)             > WS failed to connect
21:40:06.788 (broadcast.worker)             > getaddrinfo ENOTFOUND broadcast-dot-slippi.uc.r.appspot.com
21:40:08.790 (broadcast.worker)             > Retrying connecting sub step: 1 after 2000ms
21:40:08.794 (broadcast.worker)             > WS failed to connect
21:40:08.795 (broadcast.worker)             > getaddrinfo ENOTFOUND broadcast-dot-slippi.uc.r.appspot.com
21:40:12.798 (broadcast.worker)             > Retrying connecting sub step: 1 after 4000ms
21:40:12.799 (broadcast.worker)             > WS failed to connect
21:40:12.801 (broadcast.worker)             > getaddrinfo ENOTFOUND broadcast-dot-slippi.uc.r.appspot.com
21:40:20.800 (broadcast.worker)             > Retrying connecting sub step: 1 after 8000ms
21:40:20.801 (broadcast.worker)             > WS failed to connect
21:40:20.801 (broadcast.worker)             > getaddrinfo ENOTFOUND broadcast-dot-slippi.uc.r.appspot.com
21:40:36.802 (broadcast.worker)             > Retrying connecting sub step: 1 after 16000ms
21:40:37.016 (broadcast.worker)             > WS connection successful
21:40:37.220 (broadcast.worker)             > {
  type: 'get-broadcasts-resp',
  broadcasts: [
    {
      id: 'wNn4p9p2LZbqPTKBZJuZHWSuELF2-pWqsMDWPg4SFofWvZtBim2',
      name: 'TEST#127',
      broadcaster: [Object]
    }
  ]
}
21:40:37.366 (broadcast.worker)             > {
  type: 'start-broadcast-resp',
  broadcastId: 'wNn4p9p2LZbqPTKBZJuZHWSuELF2-pWqsMDWPg4SFofWvZtBim2',
  recoveryGameCursor: 1968620
}
21:40:37.367 (broadcast.worker)             > Picking broadcast back up from 1968620. Last not sent: 1969772
21:40:37.368 (broadcast.worker)             > Backup events include range from: [1967972, 1969771]. Next cursor to be sent: 1968621
21:40:37.369 (broadcast.worker)             > Starting broadcast to: wNn4p9p2LZbqPTKBZJuZHWSuELF2-pWqsMDWPg4SFofWvZtBim2

Copy link
Member

@NikhilNarayana NikhilNarayana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only two things stick out to me, otherwise this is looking pretty good

src/broadcast/broadcast_manager.ts Outdated Show resolved Hide resolved
@@ -305,10 +356,33 @@ export class BroadcastManager extends EventEmitter {
});

getBroadcasts().catch(console.warn);
const postSocketConnectingSubStepRetry = () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is giving me pause atm. could you add some comments and explain why this needs to run constantly with setTimeout and instead of only if getBroadcasts fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getBroadcasts and startBroadcast only fail if something goes wrong client-side, that is, if we fail to send the message. We expect the server to respond to both of these messages but because WebSockets simply deals with 'messages' or 'events' and doesn't use a request/response paradigm, there no notion of the server failing to respond to a message. Setting a simple exponential backoff retry covers all possible cases:

  1. failure to send
  2. server sends an unexpected/invalid/error response
  3. server sends nothing

Co-authored-by: Nikhil Narayana <[email protected]>
@NikhilNarayana NikhilNarayana changed the title Attempt to fix client-side causes of dangling broadcast fix: attempt to fix dangling broadcasts Mar 25, 2024
@NikhilNarayana NikhilNarayana merged commit c321aa9 into project-slippi:main Mar 25, 2024
4 checks passed
@jmlee337 jmlee337 deleted the dangling branch March 25, 2024 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants