Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote ros1 subscriber disconnect causes publisher to hang, doesn't publish to other subscribers that are still connected #206

Closed
lucasw opened this issue Nov 16, 2024 · 2 comments · Fixed by #208
Labels
bug Something isn't working

Comments

@lucasw
Copy link
Contributor

lucasw commented Nov 16, 2024

I think this is just a specific case of #14

Modify ros1_talker to publisher at a much higher rate in a loop instead of a 1 second delay for only 50 counts:

    // for count in 0..50 {
    let mut count = 0;
    loop {
        let mut msg = geometry_msgs::PointStamped::default();
        msg.header.seq = count;
        msg.point.x = count as f64;
        publisher.publish(&msg).await?;
        tokio::time::sleep(tokio::time::Duration::from_millis(10)).await;
        count += 1;
    }

Start a roscore and talker on one computer:

roscore
ROS_IP=master_ip cargo run --release --example ros1_talker
rostopic echo /my_point

Then on another computer on the same network:

ROS_MASTER_URI=http://master_ip:11311 ROS_IP=remote_computer_ip rostopic echo /my_point

After letting it go a while disconnect the remote computer- close the laptop lid if that causes the computer to sleep, or disconnect via network manager, or pull the ethernet cable, or similar. Obviously the remote rostopic echo stops working immediately, but the issue is the echo on the same computer as the publisher stops working after a few seconds also (9 seconds in one test, 900 message after the disconnect given the 100 Hz update rate). Reconnecting usually recovers both echos quickly.

Is this await blocking the loop to send data to the rest of the subscribers?
Maybe there are options to make it time out quickly? https://github.com/RosLibRust/roslibrust/blob/master/roslibrust/src/ros1/publisher.rs#L217
I can imagine a restructure where there's a broadcast channel with limited queue size sending messages (ideally not cloned) to per subscriber tokio tasks, and then if one hangs it won't bring down the others, but maybe there's a quicker fix.

This is all tested on Ubuntu 22.04/24.04.

lucasw added a commit to lucasw/roslibrust that referenced this issue Nov 16, 2024
lucasw added a commit to lucasw/roslibrust that referenced this issue Nov 16, 2024
…e still active subscribers- ought to collect the futures and await them separately, or restructure further to put every tcpstream write_all in a separate tokio task?
@Carter12s Carter12s added the bug Something isn't working label Nov 24, 2024
@Carter12s
Copy link
Collaborator

Spent some time looking at this yesterday and today.

Testing Rant

This really deserves a unit / integration test, but it ends up being remarkably hard to write one. Emulating a TCP socket disconnecting mid test is proving challenging to replicate:

  • tc netem is hard to setup inside a docker image, couldn't figure out kernel permissions here
  • Tokio Turmoil seems like it would do exactly what we want, but we can't directly use it cleanly as we want "real" communication with rosmaster while mocking communication between nodes.
  • We could theoretically use docker's networking controls to simulate this, but we'd have to switch CI to using docker-in-docker which is annoying to get setup.

So I've not found a good way to fixture a unit / integration test around this.

Actual Solution

Looked into a few options, but tokio::broadcast does look like exactly the correct solution for this. I went ahead and implemented that, and it seems to get the job done, but introduced other complexities around clean-up / shutdown.

@Carter12s
Copy link
Collaborator

Alright: #208 is ready to rock with what I think is a full fix for this.

I wasn't able to come up with a reasonable way to test this directly in CI, but I'm confident in the underlying re-work solving it.

I'd love your eyes on the code changes @lucasw and if you have a chance to test that branch it would be appreciated.

lucasw added a commit to lucasw/roslibrust that referenced this issue Nov 29, 2024
lucasw added a commit to lucasw/roslibrust that referenced this issue Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants