-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote ros1 subscriber disconnect causes publisher to hang, doesn't publish to other subscribers that are still connected #206
Comments
…e still active subscribers- ought to collect the futures and await them separately, or restructure further to put every tcpstream write_all in a separate tokio task?
Spent some time looking at this yesterday and today. Testing RantThis really deserves a unit / integration test, but it ends up being remarkably hard to write one. Emulating a TCP socket disconnecting mid test is proving challenging to replicate:
So I've not found a good way to fixture a unit / integration test around this. Actual SolutionLooked into a few options, but tokio::broadcast does look like exactly the correct solution for this. I went ahead and implemented that, and it seems to get the job done, but introduced other complexities around clean-up / shutdown. |
Alright: #208 is ready to rock with what I think is a full fix for this. I wasn't able to come up with a reasonable way to test this directly in CI, but I'm confident in the underlying re-work solving it. I'd love your eyes on the code changes @lucasw and if you have a chance to test that branch it would be appreciated. |
I think this is just a specific case of #14
Modify ros1_talker to publisher at a much higher rate in a loop instead of a 1 second delay for only 50 counts:
Start a roscore and talker on one computer:
Then on another computer on the same network:
After letting it go a while disconnect the remote computer- close the laptop lid if that causes the computer to sleep, or disconnect via network manager, or pull the ethernet cable, or similar. Obviously the remote rostopic echo stops working immediately, but the issue is the echo on the same computer as the publisher stops working after a few seconds also (9 seconds in one test, 900 message after the disconnect given the 100 Hz update rate). Reconnecting usually recovers both echos quickly.
Is this await blocking the loop to send data to the rest of the subscribers?
Maybe there are options to make it time out quickly? https://github.com/RosLibRust/roslibrust/blob/master/roslibrust/src/ros1/publisher.rs#L217
I can imagine a restructure where there's a broadcast channel with limited queue size sending messages (ideally not cloned) to per subscriber tokio tasks, and then if one hangs it won't bring down the others, but maybe there's a quicker fix.
This is all tested on Ubuntu 22.04/24.04.
The text was updated successfully, but these errors were encountered: