Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclcpp::exceptions::RCLError / rcl_wait() failed: error not set #1977

Closed
doisyg opened this issue Aug 1, 2022 · 21 comments
Closed

rclcpp::exceptions::RCLError / rcl_wait() failed: error not set #1977

doisyg opened this issue Aug 1, 2022 · 21 comments

Comments

@doisyg
Copy link

doisyg commented Aug 1, 2022

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • bin deb
  • Version or commit hash:
    • last Humble sync
  • DDS implementation:
    • Cyclone + localhost_only
  • Client library (if applicable):
    • rclcpp

Don't know if it is rclcpp or something nav2 specific, already posted here: ros-navigation/navigation2#3097

I get an rclcpp::exceptions::RCLError / rcl_wait() failed: error not set error randomly during endurance testing on a real robot on amcl, controller_server and some other navigation node. All were working fine under galactic.
Struggling to understand the origin of it.

Jul 27 02:14:09 bash[813572]: [controller_server-66] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
Jul 27 02:14:09 bash[813572]: [controller_server-66] what(): rcl_wait() failed: error not set
Jul 27 02:14:09 bash[813572]: [controller_server-66] Stack trace (most recent call last) in thread 816228:
Jul 27 02:14:09 bash[813572]: [controller_server-66] ros-navigation/navigation2#14 Object "", at 0xffffffffffffffff, in
Jul 27 02:14:09 bash[813572]: [controller_server-66] ros-navigation/navigation2#13 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc4eee369ff, in
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#12 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc4eeda4b42, in
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#11 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc4ef0342c2, in
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#10 Object "/opt/ros/humble/lib/librclcpp.so", at 0x7fc4ef44dc80, in rclcpp::executors::SingleThreadedExecutor::spin()
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#9 Object "/opt/ros/humble/lib/librclcpp.so", at 0x7fc4ef445972, in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#8 Object "/opt/ros/humble/lib/librclcpp.so", at 0x7fc4ef4435cd, in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#7 Object "/opt/ros/humble/lib/librclcpp.so", at 0x7fc4ef43f9c8, in rclcpp::exceptions::throw_from_rcl_error(int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, rcutils_error_state_s const*, void (*)())
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#6 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc4ef00627d, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc4ef0062f6, in std::terminate()
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#4 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc4ef00628b, in
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#3 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7fc4eeffabfd, in
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#2 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc4eed387f2, in abort
Jul 27 02:14:10 bash[813572]: [controller_server-66] ros-navigation/navigation2#1 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc4eed52475, in raise
Jul 27 02:14:10 bash[813572]: [controller_server-66] #0 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fc4eeda6a7c, in pthread_kill
Jul 27 02:14:10 bash[813572]: [controller_server-66] Aborted (Signal sent by tkill() 814467 1000)
Jul 27 02:14:10 bash[813572]: [ERROR] [controller_server-66]: process has died [pid 814467, exit code -6, cmd '/opt/wyca/elodie/nav2_controller/lib/nav2_controller/controller_server --ros-args --params-file /home/wyca/.wyca/nav2_params/controller_server_params.yaml --params-file /home/wyca/.wyca/nav2_params/local_costmap_params.yaml --params-file /home/wyca/.wyca/nav2_params/costmaps_plugins_params.yaml --params-file /tmp/launch_params_est2liq2 -r /cmd_vel:=/twist_mux_root/twist_nav'].

@liangyuwei
Copy link

Ran into a similar situation with the same exception thrown. Not sure why it happened, we simply use a workaround to spin with try-catch for RCLError.

@doisyg
Copy link
Author

doisyg commented Aug 6, 2022

@liangyuwei thanks for confirming it.

Not sure why it happened, we simply use a workaround to spin with try-catch for RCLError.

Where did you put this try-catch ? In a fork of rclcpp ?

@fujitatomoya
Copy link
Collaborator

@liangyuwei do you also use rmw_cyclonedds?

@doisyg this could be related to rmw_wait as implementation, so it would be worth to try rmw_fastrtps to see if the same problem can be observed. (see https://github.com/ros2/rmw_fastrtps#getting-started)

Where did you put this try-catch ? In a fork of rclcpp?

i think user application code to catch the exception from spin.

@doisyg
Copy link
Author

doisyg commented Aug 15, 2022

@doisyg this could be related to rmw_wait as implementation, so it would be worth to try rmw_fastrtps to see if the same problem can be observed. (see https://github.com/ros2/rmw_fastrtps#getting-started)

@fujitatomoya Currently cannot test with rmw_fastrtps as this related issue ros-navigation/navigation2#3014 is still causing our system to crash at startup.

@SteveMacenski
Copy link
Collaborator

rclcpp::Executor::wait_for_work does that give anyone some contextual ideas from the traceback?

@doisyg
Copy link
Author

doisyg commented Aug 18, 2022

@arslogavitabrevis also reported the issue on the nav2 ticket. Can you confirm OS, ROS version and DDS used ?

@arslogavitabrevis
Copy link

arslogavitabrevis commented Aug 19, 2022

We are using Cyclone DDS, Ubuntu 22.04 and Humble from debian binary. We also encontered this error during endurance tests on navigation2 nodes.

@iuhilnehc-ynos
Copy link
Collaborator

iuhilnehc-ynos commented Aug 19, 2022

There are no reproduced steps, I am not sure if the following path is the error case:

rcl_wait -> rcl_timer_get_time_until_next_call -> rcutils_atomic_load_bool (updated 5 months ago) -> return RCL_RET_TIMER_CANCELED without setting a error state. (The log showed failed: error not set)

https://github.com/ros2/rcl/blob/191b13bc2eb8578e40799eb6b50c74b885e7066e/rcl/src/rcl/wait.c#L573-L576

      ret = rcl_timer_get_time_until_next_call(wait_set->timers[i], &timer_timeout);
      if (ret != RCL_RET_OK) {
        return ret;  // The rcl error state should already be set.
      }

It seems we should deal with RCL_RET_TIMER_CANCELED.

@ladianchad
Copy link

I had the same error in planner_server. I'm using fast DDS, ubuntu 22.04, and humble.

[planner_server-3] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[planner_server-3]   what():  rcl_wait() failed: error not set
[ERROR] [planner_server-3]: process has died [pid 13889, exit code -6, cmd '/opt/floatic/workspace/dependency_ws/install/nav2_planner/lib/nav2_planner/planner_server --ros-args -r __node:=planner_server -r __ns:=/navigation --params-file /tmp/tmp7i39them --params-file /tmp/launch_params_06b_heu1 -r tf:=/tf -r tf_static:=/tf_static -r cmd_vel:=/nav_vel -r odom:=/odom -r goal_pose:=/goal_pose'].
[INFO] [planner_server-3]: process started with pid [27679]
[planner_server-3] [INFO] [1661392904.274648775] [navigation.planner_server]: 
[planner_server-3]      planner_server lifecycle node launched. 
[planner_server-3]      Waiting on external lifecycle transitions to activate
[planner_server-3]      See https://design.ros2.org/articles/node_lifecycle.html for more information.
[planner_server-3] [INFO] [1661392904.279793060] [navigation.planner_server]: Creating
[planner_server-3] [INFO] [1661392904.282890912] [navigation.global_costmap.global_costmap]: 
[planner_server-3]      global_costmap lifecycle node launched. 
[planner_server-3]      Waiting on external lifecycle transitions to activate
[planner_server-3]      See https://design.ros2.org/articles/node_lifecycle.html for more information.

@ladianchad
Copy link

I requested pr to solve this problem at this ticket ros2/rcl#1003

@iuhilnehc-ynos
Copy link
Collaborator

iuhilnehc-ynos commented Aug 25, 2022

@fujitatomoya cc: @ivanpauno

I believe that @ladianchad's PR can solve this issue, but I suspected that rolling back the ros2/rcl#963 is the correct way.
Adding https://github.com/ros2/rcl/blob/191b13bc2eb8578e40799eb6b50c74b885e7066e/rcl/src/rcl/timer.c#L318-L320 can not avoid calling rcl_clock_get_now with a canceled timer. If it could happen(there is a small time window to cancel the timer between the check and getting time), is it still reasonable to add the checking inside?

Even if using rcl_timer_get_time_until_next_call to get the time, it seems necessary to check whether the timer is canceled or not.

The later checking the flag, the more accurate to process the timer event.
I think that's probably why @wjwwood implemented the timer at the beginning.

@liangyuwei
Copy link

liangyuwei commented Sep 5, 2022

@liangyuwei thanks for confirming it.

Not sure why it happened, we simply use a workaround to spin with try-catch for RCLError.

Where did you put this try-catch ? In a fork of rclcpp ?

@doisyg Sorry for the late reply, we do this:

while (rclcpp::ok()) {
    try {
        executor.spin();
    } catch (rclcpp::exceptions::RCLError & e) {
        RCLCPP_ERROR_STREAM(
            rclcpp::get_logger("rclcpp"),
            "executor spin error: " << e.what());
    }
}

@fujitatomoya
Copy link
Collaborator

I believe that backport to humble/galactic/foxy required,

then we can close this issue.

@iuhilnehc-ynos
Copy link
Collaborator

iuhilnehc-ynos commented Sep 5, 2022

@fujitatomoya
Copy link
Collaborator

@iuhilnehc-ynos thanks for pointing that out, you are correct. we should not check that something never returns it.

@fujitatomoya
Copy link
Collaborator

@ladianchad thanks for the contribution, backport for humble has been completed.

@doisyg i will close this issue, please re-open the issue if you find the problem.

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/preparing-for-humble-sync-2022-09-09/27240/2

@pepeRossRobotics
Copy link

Hello, I am having a similar issue when working with a custom action server. I am wondering if this is somehow related or if my issue is unrelated.

The error happens after accepting the goal. It happens 1 in 20-30 action server executions so it is not easily reproducable.

rclcpp_action::GoalResponse handleGoal(
    const rclcpp_action::GoalUUID &,
    std::shared_ptr<const typename GotoTrackWaypoint::Goal> goal)
  {
    RCLCPP_INFO(get_logger(), "Receiving new goal.");
    last_goal = goal;
    return rclcpp_action::GoalResponse::ACCEPT_AND_EXECUTE;
  }

I am using Galactic.

[INFO] [1668222934.457141422] [mgs_action_server]: Receiving new goal.
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  error not set, at /tmp/binarydeb/ros-galactic-rcl-3.1.3/src/rcl/service.c:287

This is the offending line:
https://github.com/ros2/rcl/blob/47753122084d9762e570761af95c5e123ececf37/rcl/src/rcl/service.c#L287

I tried to add a try-catch like @liangyuwei but the output is:
terminate called without an active exception without any other information.

@pepeRossRobotics
Copy link

Just to add to this. The error happens more often when the send_goal request comes from a device on the same WAN than from within the localhost.

@Lannister-Xiaolin
Copy link

Lannister-Xiaolin commented Mar 16, 2023

@pepeRossRobotics
Did you solve this problem on Galactic, I'm having the same issue

@pepeRossRobotics
Copy link

I realised I was using an onld Docker of Galactic that have not been update/upgrade in awhile. The solved the problem for me. Since I have moved to Humble, and it seems it is more stable now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants