Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor Segmentation fault when callback group is destroyed while spinning #2664

Open
msplr opened this issue Nov 7, 2024 · 20 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@msplr
Copy link

msplr commented Nov 7, 2024

Bug report

Required Info:

  • Operating System: Ubuntu 24.04
  • Installation type: debian binaries
  • Version:
    • ros-jazzy-rclcpp Version: 28.1.5-1noble.20240922.085057
    • ros-rolling-rclcpp Version: 28.3.3-1noble.20240919.221549
  • DDS implementation: Fast-RTPS
  • Client library: rclcpp

Issue description

The rclcpp::Executor crashes when a timer with a non-default callback group is destroyed before the executor stops spinning.

The bug also appears if a callback group is created and destroyed without being added to the timer.

The bug does not manifest in the example below when:

  • The timer's callback group has a longer lifetime than the executor spin thread (e.g. declared in the outer scope).
  • The timer uses the default callback group.
  • The callback group is moved to the timer factory.

Steps to reproduce issue

See the full gist

#include <rclcpp/rclcpp.hpp>
#include <std_msgs/msg/string.hpp>

using namespace std::chrono_literals;

int main(int argc, char** argv) {
  rclcpp::init(argc, argv);
  auto node = rclcpp::Node::make_shared("test");
  auto log = node->get_logger();
  rclcpp::executors::SingleThreadedExecutor executor;
  auto spinThread = std::thread([&] {
    executor.add_node(node);
    executor.spin();
    executor.remove_node(node);
  });

  while (!executor.is_spinning()) {
    std::this_thread::sleep_for(1ms);
  }

  {
    RCLCPP_INFO(log, "BEGIN");
    auto cg = node->create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive);
    auto timer = node->create_timer(1s, [&log] { RCLCPP_INFO(log, "timer"); }, cg);  // SIGSEGV
    // auto timer = node->create_timer(1s, [&log] { RCLCPP_INFO(log, "timer"); }, std::move(cg)); // OK
    std::this_thread::sleep_for(100ms);  // wait a bit for the executor to catch up
    RCLCPP_INFO(log, "END");             // timer and cg go out of scope here
  }

  executor.cancel();
  if (spinThread.joinable()) {
    spinThread.join();
  }
  rclcpp::shutdown();
}

Expected behavior

The node runs and terminates without issue.

Actual behavior

The node fails with a Segmentation fault in rclcpp::Executor::spin().

Additional information

Backtrace:

Thread 16 "rclcpp_bug" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe90006c0 (LWP 66063)]
0x00007ffff72fd301 in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/jazzy/lib/librmw_fastrtps_shared_cpp.so
(gdb) bt
#0  0x00007ffff72fd301 in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/jazzy/lib/librmw_fastrtps_shared_cpp.so
#1  0x00007ffff735d8f8 in rmw_wait () from /opt/ros/jazzy/lib/librmw_fastrtps_cpp.so
#2  0x00007ffff7f6622c in rcl_wait () from /opt/ros/jazzy/lib/librcl.so
#3  0x00007ffff7de565f in ?? () from /opt/ros/jazzy/lib/librclcpp.so
#4  0x00007ffff7d08e00 in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/jazzy/lib/librclcpp.so
#5  0x00007ffff7d0962b in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/jazzy/lib/librclcpp.so
#6  0x00007ffff7d193a5 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/jazzy/lib/librclcpp.so
#7  0x0000555555703dc1 in operator() (__closure=0x555555b6e0a8) at /home/mspieler/work/brahma_ws/ros2_ws/src/rclcpp_bug/bug.cpp:16
#8  0x000055555570c934 in std::__invoke_impl<void, main(int, char**)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#9  0x000055555570c8f7 in std::__invoke<main(int, char**)::<lambda()> >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
#10 0x000055555570c8a4 in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x555555b6e0a8) at /usr/include/c++/13/bits/std_thread.h:292
#11 0x000055555570c7ee in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::operator()(void) (this=0x555555b6e0a8) at /usr/include/c++/13/bits/std_thread.h:299
#12 0x000055555570c4f0 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > > >::_M_run(void) (this=0x555555b6e0a0) at /usr/include/c++/13/bits/std_thread.h:244
#13 0x00007ffff78ecdb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007ffff749ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#15 0x00007ffff7529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
@msplr
Copy link
Author

msplr commented Nov 7, 2024

Possible duplicate with #2445

@Barry-Xu-2018
Copy link
Collaborator

Barry-Xu-2018 commented Nov 8, 2024

I reproduced this issue in rolling.

GDB output

Thread 16 "rclcpp_2664" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe74006c0 (LWP 292621)]
0x00007ffff5b8de9d in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=0x7ffff6558140 "rmw_fastrtps_cpp", subscriptions=0x555555919678, guard_conditions=0x555555919690, services=0x5555559196c0, 
    clients=0x5555559196a8, events=0x5555559196d8, wait_set=0x5555559156a0, wait_timeout=0x7fffe73fdc80) at /root/ros2_src_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:310
310	      if (!condition->get_trigger_value()) {
(gdb) bt
#0  0x00007ffff5b8de9d in rmw_fastrtps_shared_cpp::__rmw_wait (identifier=0x7ffff6558140 "rmw_fastrtps_cpp", subscriptions=0x555555919678, guard_conditions=0x555555919690, services=0x5555559196c0, 
    clients=0x5555559196a8, events=0x5555559196d8, wait_set=0x5555559156a0, wait_timeout=0x7fffe73fdc80) at /root/ros2_src_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp:310
#1  0x00007ffff6548a5f in rmw_wait (subscriptions=0x555555919678, guard_conditions=0x555555919690, services=0x5555559196c0, clients=0x5555559196a8, events=0x5555559196d8, wait_set=0x5555559156a0, 
    wait_timeout=0x7fffe73fdc80) at /root/ros2_src_rolling/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_wait.cpp:33
#2  0x00007ffff69a4146 in rmw_wait (v7=0x555555919678, v6=0x555555919690, v5=0x5555559196c0, v4=0x5555559196a8, v3=0x5555559196d8, v2=0x5555559156a0, v1=0x7fffe73fdc80)
    at /root/ros2_src_rolling/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:578
#3  0x00007ffff6fbe161 in rcl_wait (wait_set=0x7fffffff2718, timeout=-1) at /root/ros2_src_rolling/src/ros2/rcl/rcl/src/rcl/wait.c:657
#4  0x00007ffff7b0fa8d in rclcpp::wait_set_policies::SequentialSynchronization::sync_wait<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > >(std::chrono::duration<long, std::ratio<1l, 1000000000l> >, std::function<void ()>, std::function<rcl_wait_set_s& ()>, std::function<rclcpp::WaitResult<rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage> > (rclcpp::WaitResultKind)>) (this=0x7fffffff2718, time_to_wait_ns=std::chrono::duration = { -1ns }, rebuild_rcl_wait_set=..., 
    get_rcl_wait_set=..., create_wait_result=...) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/wait_set_policies/sequential_synchronization.hpp:280
#5  0x00007ffff7b08461 in rclcpp::WaitSetTemplate<rclcpp::wait_set_policies::SequentialSynchronization, rclcpp::wait_set_policies::DynamicStorage>::wait<long, std::ratio<1l, 1000000000l> > (this=0x7fffffff2718, 
    time_to_wait=std::chrono::duration = { -1ns }) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/wait_set_template.hpp:712
#6  0x00007ffff7af31bb in rclcpp::Executor::wait_for_work (this=0x7fffffff24b0, timeout=std::chrono::duration = { -1ns }) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:738
#7  0x00007ffff7af42a6 in rclcpp::Executor::get_next_executable (this=0x7fffffff24b0, any_executable=..., timeout=std::chrono::duration = { -1ns })
    at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/executor.cpp:886
#8  0x00007ffff7b48d53 in rclcpp::executors::SingleThreadedExecutor::spin (this=0x7fffffff24b0) at /root/ros2_src_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/executors/single_threaded_executor.cpp:41
#9  0x00005555555639f4 in operator() (__closure=0x5555559155e8) at /root/ros2_src_rolling/src/bug_reproduce/rclcpp_bug/src/rclcpp_2664/rclcpp_2664.cpp:13
#10 0x0000555555565df0 in std::__invoke_impl<void, main(int, char**)::<lambda()> >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
#11 0x0000555555565db3 in std::__invoke<main(int, char**)::<lambda()> >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
#12 0x0000555555565d60 in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x5555559155e8) at /usr/include/c++/13/bits/std_thread.h:292
#13 0x0000555555565cee in std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > >::operator()(void) (this=0x5555559155e8) at /usr/include/c++/13/bits/std_thread.h:299
#14 0x0000555555565a6c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<main(int, char**)::<lambda()> > > >::_M_run(void) (this=0x5555559155e0) at /usr/include/c++/13/bits/std_thread.h:244
#15 0x00007ffff6d89bb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007ffff6afaa94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#17 0x00007ffff6b87c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

A non-existent guard condition is accessed.

It's currently suspected that it might be related to the notify_guard_condition_ in the temporarily created callback group.

std::shared_ptr<rclcpp::GuardCondition> notify_guard_condition_ = nullptr;

The newly created callback group is stored in the node using a weak pointer.

std::vector<rclcpp::CallbackGroup::WeakPtr> callback_groups_;

I will investigate further.

@mjcarroll
Copy link
Member

I was also able to reproduce, but it does seem to exclusively happen on fastrtps for me and not with the cyclone middleware, perhaps a sequencing issue? #2665

@fujitatomoya
Copy link
Collaborator

@mjcarroll @Barry-Xu-2018 i only can reproduce this issue with rmw_fastrtps but rmw_cyclonedds.

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Nov 8, 2024
@fujitatomoya
Copy link
Collaborator

Possible duplicate with #2445

according to the stack trace between #2664 (comment) and #2445 (comment), it looks like a different problem.

@Barry-Xu-2018
Copy link
Collaborator

@Barry-Xu-2018
Copy link
Collaborator

Barry-Xu-2018 commented Nov 11, 2024

The root cause:

The GuardCondition in tempararay callback group was added to Executor::notify_waitable_

// Store node guard condition in map and add it to the notify waitable
auto group_guard_condition = group_ptr->get_notify_guard_condition();
weak_groups_to_guard_conditions_.insert({group_ptr, group_guard_condition});
this->notify_waitable_->add_guard_condition(group_guard_condition);

In Executor::collect_entities(), all guard conditions (include interrupt, shutdown, nodes, callback group guard condition) in Executor::notify_waitable_ are added as one waitable.

if (notify_waitable_) {
current_notify_waitable_ = std::make_shared<rclcpp::executors::ExecutorNotifyWaitable>(
*notify_waitable_);
auto notify_waitable = std::static_pointer_cast<rclcpp::Waitable>(current_notify_waitable_);
collection.waitables.insert({notify_waitable.get(), {notify_waitable, {}}});
}

In storage_rebuild_rcl_wait_set_with_sets(), waitable was added to wait set.

waitable_entry.waitable->add_to_wait_set(rcl_wait_set_);

Finally, it will call ExecutorNotifyWaitable::add_to_wait_set().
You will find it doesn't lock guard condition. So while callback group is freed, related guard condition is also freed.
If rwm_wait access freed guard condition, this issue occurs.

for (auto weak_guard_condition : this->notify_guard_conditions_) {
auto guard_condition = weak_guard_condition.lock();
if (!guard_condition) {continue;}
rcl_guard_condition_t * cond = &guard_condition->get_rcl_guard_condition();
rcl_ret_t ret = rcl_wait_set_add_guard_condition(&wait_set, cond, NULL);
if (RCL_RET_OK != ret) {
rclcpp::exceptions::throw_from_rcl_error(
ret, "failed to add guard condition to wait set");
}
}

If guard condition isn't freed, executor will get notify about callback group changed. Guard condition will be removed before next calling rwm_wait(). This issue doesn't occur.

So the root cause is unrelated to rmw.
I haven't discovered a better solution(A suitable location in the code for locking and unlocking). Do you have any suggestions? @mjcarroll @fujitatomoya

@fujitatomoya
Copy link
Collaborator

@Barry-Xu-2018 thank you for checking on this issue. could you also enlighten me why this problem does not happen with cyclonedds if that is the problem with rclcpp?

@Barry-Xu-2018
Copy link
Collaborator

@fujitatomoya

@Barry-Xu-2018 thank you for checking on this issue. could you also enlighten me why this problem does not happen with cyclonedds if that is the problem with rclcpp?

It is related to the implementation of rmw_cyclonedds.

https://github.com/ros2/rmw_cyclonedds/blob/0c2904d725dade83cc0b1ffde931b46f9d28269f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L4485-L4498

    // Detach guard conditions
    if (gcs) {
      for (size_t i = 0; i < gcs->guard_condition_count; i++) {
        auto x = static_cast<CddsGuardCondition *>(gcs->guard_conditions[i]);
        if (ws->trigs[trig_idx] == static_cast<dds_attach_t>(nelems)) {
          bool dummy;
          dds_take_guardcondition(x->gcondh, &dummy);
          trig_idx++;
        } else {
          gcs->guard_conditions[i] = nullptr;
        }
        nelems++;
      }
    }

Based on the debugging results, when the guard condition is released, ws->trigs[trig_idx] == static_cast<dds_attach_t>(nelems) in rmw_wait() is not satisfied. So rmw_cyclonedds doesn't access freed pointer of guard condition.

@Barry-Xu-2018
Copy link
Collaborator

this->wait_result_.emplace(wait_set_.wait(timeout));

There's no way to directly lock the guard condition of the callback group in wait_set_.wait().
So I consider locking the callback group before wait_set_.wait(), and then releasing the callback group after calling wait_set_.wait().
Modify as follows
Barry-Xu-2018@fbe602f#diff-e09e1b6262f7c9c86a3fc1a128c0da842729af46311955be86d1c97bf2558fbeR747

@DiogoRoseira
Copy link

Possible, the same problem of ros2/demos#654

@Barry-Xu-2018
Copy link
Collaborator

Possible, the same problem of ros2/demos#654

Not sure. After fixing the issue, we can try to see if the problem still occurs.

@Barry-Xu-2018
Copy link
Collaborator

@mjcarroll @fujitatomoya @clalancette

The root cause has been found. Please look at #2664 (comment).
About the solution, I'd like to hear your opinions. I have a solution (#2664 (comment). But I think there might be a better solution.).

@DiogoRoseira
Copy link

Possible, the same problem of ros2/demos#654

Not sure. After fixing the issue, we can try to see if the problem still occurs.

My code is not equal of the issue ros2/demos#654, but i have tested not reset() the callback group on the on_cleanup() and the program dont throw a segmentation fault anymore, so i think it is the same bug.

@fujitatomoya
Copy link
Collaborator

Based on the debugging results, when the guard condition is released, ws->trigs[trig_idx] == static_cast<dds_attach_t>(nelems) in rmw_wait() is not satisfied.

it looks like rmw_cyclonedds checks waitset trigger index after wait call. if the index does not match, it sets nullptr for guard_conditions without accessing it.
on the other hand, rmw_fastrtps expects all these entities exist during this rmw_wait call. (i think rmw_connextdds also has the same expectation.)

I haven't discovered a better solution(A suitable location in the code for locking and unlocking). Do you have any suggestions?

to be honest, i am not really sure either, including this approach is the appropriate path...

@mjcarroll @alsora @jmachowinski any thoughts?

@jmachowinski
Copy link
Contributor

jmachowinski commented Nov 22, 2024

This is a design failure in the ExecutorNotifyWaitable.
From my point of view this is clearly a rclcpp issue, not a RWM issue.

Before creating the rwm waitset, we aquire all shared pointers, to make sure, that they don't run out of scope during the rmw_wait.
The BIG assumption here is, that if I hold a valid shared ptr to anything inside the collection, none of the rcl primitves will go out of scope. Therefore all pointers inside the rwm_waitset should be valid during the rwm_wait, and it is totally fine, to access the during the rwm_wait.

The bug is in the ExecutorNotifyWaitable, as it only holds weak pointers to the guard condition. Therefore the GuardConditions can run out of scope during the rmw_wait, causing the segfault.

As for a fix for jazzy, I would suggest something like this :

void
Executor::wait_for_work(std::chrono::nanoseconds timeout)
{
  TRACETOOLS_TRACEPOINT(rclcpp_executor_wait_for_work, timeout.count());

  // Clear any previous wait result
  this->wait_result_.reset();

  // we need to make sure that callback groups don't get out of scope
  // during the wait. As in jazzy, they are not covered by the DynamicStorage,
  // we explicitly hold them here as a bugfix
  std::vector<rclcpp::CallbackGroup::SharedPtr> cbgs;

  {
    std::lock_guard<std::mutex> guard(mutex_);

    if (this->entities_need_rebuild_.exchange(false) || current_collection_.empty()) {
      this->collect_entities();
    }

    auto callback_groups = this->collector_.get_all_callback_groups();
    cbgs.resize(callback_groups.size());
    for(const auto &w_ptr : callback_groups)
    {
      auto shr_ptr = w_ptr.lock();
      if(shr_ptr)
      {
        cbgs.push_back(std::move(shr_ptr));
      }
    }
  }
  this->wait_result_.emplace(wait_set_.wait(timeout));
  if (!this->wait_result_ || this->wait_result_->kind() == WaitResultKind::Empty) {
    RCUTILS_LOG_WARN_NAMED(
      "rclcpp",
      "empty wait set received in wait(). This should never happen.");
  } else {
    // drop references to the callback groups, before trying to execute anything
    cbgs.clear();
    if (this->wait_result_->kind() == WaitResultKind::Ready && current_notify_waitable_) {
      auto & rcl_wait_set = this->wait_result_->get_wait_set().get_rcl_wait_set();
      if (current_notify_waitable_->is_ready(rcl_wait_set)) {
        current_notify_waitable_->execute(current_notify_waitable_->take_data());
      }
    }
  }
}

@fujitatomoya
Copy link
Collaborator

@jmachowinski thanks!

@Barry-Xu-2018 can you try #2683 to see if that fixes the problem?

@fujitatomoya fujitatomoya added the bug Something isn't working label Nov 22, 2024
@Barry-Xu-2018
Copy link
Collaborator

@fujitatomoya

Jmachowinski's solution is almost same as what I thought #2664 (comment). I've already verified that this method can fix the problem.
Currently, acquiring ownership is done in one poistion.

// ensure the ownership of the entities in the wait set is shared for the duration of wait
this->storage_acquire_ownerships();

The fix, however, will handle the callback group separately in a different place, which is why I was considering if there might be a better approach.

@jmachowinski
Copy link
Contributor

jmachowinski commented Nov 25, 2024

@Barry-Xu-2018 Sorry, I missed your solution...

I think this was already written somewhere else, but just to sum it up.
My patch is just a quick way to fix the problem in rolling and jazzy.

For rolling, we should afterwards implement a cleaner solution, like locking the callbackgroups in the DynamicStorage.

@Barry-Xu-2018
Copy link
Collaborator

@jmachowinski

It's okay. Our goal is to fix this problem. 😄

For rolling, we should afterwards implement a cleaner solution, like locking the callbackgroups in the DynamicStorage.

Totally agree. It is better to lock callbackgroups in the DynamicStorage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants