Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small Memory Leak with Controller Server, Planner Server, and AMCL #1889

Closed
Michael-Equi opened this issue Jul 27, 2020 · 20 comments
Closed

Comments

@Michael-Equi
Copy link
Contributor

Michael-Equi commented Jul 27, 2020

Bug report

  • Operating System:
    • Ubuntu 20.04
  • ROS2 Version:
    • ROS2 Foxy binaries
  • Version or commit hash:
    • Navigation2 built from source (c65e3aa)
  • DDS implementation:
    • Cyclone DDS (version 0.7.3)

Steps to reproduce issue

Run the Nav2 stack with the TB3 simulation for a long duration while recording memory usage using the system monitor.

Expected behavior

Memory should reach a stable point at which it no longer increases on controller, planner, and AMCL nodes especially when the robot is not actively navigating or being moved around.

Actual behavior

Controller Server shows consistent memory increase at around 55MB/h (taken over 3 hours without moving robot)
Planner Server shows consistent memory increase at around 50MB/h (taken over 3 hours without moving robot)
AMCL shows consistent memory increase at around 25MB/h (taken over 3 hours without moving robot)
Gzserver and gzclient memory usage remains constant (perhaps this information helps in localizing the issue)

Additional information

I have been having issues with the controller and planner servers suddenly crashing after running for long periods of time especially with limited robot motion or use. In exploring the cause of this issue I have noticed what appear to be minor memory leaks on the troublesome nodes. I tried getting the controller server to crash in gdb so I could get a more helpful segfault error but have not been able to get it to crash under those conditions (really strange). Outside of gdb I have experienced frequent crashing of both the controller and planner server after a couple of hours of limited use or idling.

In monitoring the controller server with gdb I did see some errors but they did not seem to result in a fatal segmentation fault but may help debug the issue:

>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:

  'guard_condition_handle not from this implementation, at /tmp/binarydeb/ros-foxy-rmw-cyclonedds-cpp-0.7.3/src/rmw_node.cpp:2713, at /tmp/binarydeb/ros-foxy-rcl-1.1.6/src/rcl/guard_condition.c:160'

with this new error message:

  'guard condition implementation is invalid, at /tmp/binarydeb/ros-foxy-rcl-1.1.6/src/rcl/guard_condition.c:171'

rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1595816288.567870974] [rcl]: Failed to get trigger guard condition in jump callback

and upon killing the controller server I got the error

Thread 1 "controller_serv" received signal SIGINT, Interrupt.
futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555559aeae8) at ../sysdeps/nptl/futex-internal.h:183
183	../sysdeps/nptl/futex-internal.h: No such file or directory.

While running the node, I receive lots of repeated warnings such as the one below
[planner_server-5] [INFO] [1595818100.434728681] [global_costmap.global_costmap_rclcpp_node]: Message Filter dropping message: frame 'scanner_link' at time 15239.380 for reason 'Unknown'

I am aware that the issues I am having could be caused by a multitude of factors many of which are not be in the scope of navigation2, but being that this seems to be a particularly "severe" issue with nav2 related software, I thought it would be best to start the issue here and see what similar experiences exist and what conditions are like on different computers/setups.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 27, 2020

Interesting. I think step 1 would be to figure out if its even Navigation2 itself or its rclcpp or message filters that's leaking (which is my bet). Given the 3 servers you identify are the only 3 servers that get regular information even when static (in the form of sensor data) my guess is that rclcpp or message filters are leaking not navigation2 itself.

Step 2 I would think of doing is to remove the plugins (navfn and DWB) with dummies that don't do anything. Then see if it still happens. At that point you'd know if the algorithms are leaking or the server is leaking. After that I'd go at it with a profiler knowing where I should be looking (algorithm, server, rclcpp). If you said just controller and planner, I'd think it was costmap since that would be the same in both, with AMCL (and AMCL lower) makes me think its ROS2 / message filters since the take in sensor data and AMCL takes in about half as much sensor data.

The rclutils warnings should be reported to the appropriate projects (not here) with a full traceback of where it was triggered from (to make sure its ROS2 error and not user error).

@SteveMacenski
Copy link
Member

have you been able to look into this more?

@Michael-Equi
Copy link
Contributor Author

I have not. I was able to ignore it through most of our testing but a solution is still in our backlog.

@mrunaljsarvaiya
Copy link

I ran into the same issue, I'll try and dig into this soon
System configuration

  • Operating System:
    • Ubuntu Mate 20.04 on Rpi4
  • ROS2 Version:
    • ROS2 Foxy binaries
  • Version or commit hash:
    • Navigation2 built from source (fa1a64a)

@mrunaljsarvaiya
Copy link

mrunaljsarvaiya commented Sep 17, 2020

This is the GDB backtrace on my system. I'm running ROS1 on a PC that is running the robot in gazebo and supplying the relevant information to the nav stack along with the clock topic through the ros1_bridge. I've set the nav stack nodes and ros1_bridge's use_sim_time param to True

None of these seem to be caused by the nav stack, they're related to rclcpp and fastrtps in my case

Backtrace for the planner server


#0  malloc_consolidate (av=av@entry=0xffffcc000020) at malloc.c:4475
#1  0x0000fffff7998c38 in _int_malloc (av=av@entry=0xffffcc000020, bytes=bytes@entry=2880) at malloc.c:3698
#2  0x0000fffff799a69c in __GI___libc_malloc (bytes=2880) at malloc.c:3066
#3  0x0000fffff7b539ac in operator new(unsigned long) () from /lib/aarch64-linux-gnu/libstdc++.so.6
#4  0x0000fffff09c45d8 in std::vector<float, std::allocator<float> >::_M_default_append(unsigned long) ()
   from /opt/ros/foxy/lib/libsensor_msgs__rosidl_typesupport_fastrtps_cpp.so
#5  0x0000fffff09c7f10 in sensor_msgs::msg::typesupport_fastrtps_cpp::cdr_deserialize(eprosima::fastcdr::Cdr&, sensor_msgs::msg::LaserScan_<std::allocator<void> >&) () from /opt/ros/foxy/lib/libsensor_msgs__rosidl_typesupport_fastrtps_cpp.so
#6  0x0000fffff70d0310 in ?? () from /opt/ros/foxy/lib/librmw_fastrtps_cpp.so
#7  0x0000fffff7088400 in rmw_fastrtps_shared_cpp::TypeSupport::deserialize(eprosima::fastrtps::rtps::SerializedPayload_t*, void*) ()
   from /opt/ros/foxy/lib/librmw_fastrtps_shared_cpp.so
#8  0x0000fffff6bebd5c in eprosima::fastrtps::SubscriberHistory::deserialize_change(eprosima::fastrtps::rtps::CacheChange_t*, unsigned int, void*, eprosima::fastrtps::SampleInfo_t*) () from /opt/ros/foxy/lib/libfastrtps.so.2
#9  0x0000fffff6befa28 in eprosima::fastrtps::SubscriberHistory::takeNextData(void*, eprosima::fastrtps::SampleInfo_t*, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >&) () from /opt/ros/foxy/lib/libfastrtps.so.2
#10 0x0000fffff6be68f8 in eprosima::fastrtps::SubscriberImpl::takeNextData(void*, eprosima::fastrtps::SampleInfo_t*) ()
   from /opt/ros/foxy/lib/libfastrtps.so.2
#11 0x0000fffff7086434 in rmw_fastrtps_shared_cpp::_take(char const*, rmw_subscription_t const*, void*, bool*, rmw_message_info_t*, rmw_subscription_allocation_t*) () from /opt/ros/foxy/lib/librmw_fastrtps_shared_cpp.so
#12 0x0000fffff762ae3c in rcl_take () from /opt/ros/foxy/lib/librcl.so
#13 0x0000fffff7e4be94 in rclcpp::SubscriptionBase::take_type_erased(void*, rclcpp::MessageInfo&) () from /opt/ros/foxy/lib/librclcpp.so
#14 0x0000fffff7dc94ac in ?? () from /opt/ros/foxy/lib/librclcpp.so
#15 0x0000fffff7dca2c8 in rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) ()
   from /opt/ros/foxy/lib/librclcpp.so
#16 0x0000fffff7dcab14 in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) () from /opt/ros/foxy/lib/librclcpp.so
#17 0x0000fffff7dcfa2c in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/foxy/lib/librclcpp.so
#18 0x0000fffff777703c in nav2_util::NodeThread::<lambda()>::operator() (__closure=<optimized out>)
    at /home/peanut/navigation2_ws/src/navigation2/nav2_util/src/node_thread.cpp:29
#19 std::__invoke_impl<void, nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > (
    __f=...) at /usr/include/c++/9/bits/invoke.h:60
#20 std::__invoke<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > (__fn=...)
    at /usr/include/c++/9/bits/invoke.h:95
#21 std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > >::_M_invoke<0> (this=<optimized out>) at /usr/include/c++/9/thread:244
#22 std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#23 std::thread::_State_impl<std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > > >::_M_run(void) (this=0xaaaaab398090) at /usr/include/c++/9/thread:195
--Type <RET> for more, q to quit, c to continue without paging--ret
#24 0x0000fffff7b8076c in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#25 0x0000fffff7c9e4fc in start_thread (arg=0xffffffffbf5f) at pthread_create.c:477
#26 0x0000fffff79f1f2c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78


Backtrace for controller_server

#0  0x0000000000000031 in ?? ()
#1  0x0000fffff7cfdddc in ?? () from /opt/ros/foxy/lib/librcl.so
#2  0x0000fffff7cfe9b8 in rcl_set_ros_time_override ()
   from /opt/ros/foxy/lib/librcl.so
#3  0x0000fffff7f53fd8 in rclcpp::TimeSource::set_clock(std::shared_ptr<builtin_interfaces::msg::Time_<std::allocator<void> > >, bool, std::shared_ptr<rclcpp::Clock>) () from /opt/ros/foxy/lib/librclcpp.so
#4  0x0000fffff7f55b20 in rclcpp::TimeSource::clock_cb(std::shared_ptr<rosgraph_msgs::msg::Clock_<std::allocator<void> > >) ()
   from /opt/ros/foxy/lib/librclcpp.so
#5  0x0000fffff7f5dd54 in std::_Function_handler<void (std::shared_ptr<rosgraph_msgs::msg::Clock_<std::allocator<void> > >), std::_Bind<void (rclcpp::TimeSource::*(rclcpp::TimeSource*, std::_Placeholder<1>))(std::shared_ptr<rosgraph_msgs::msg::Clock_<std::allocator<void> > >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rosgraph_msgs::msg::Clock_<std::allocator<void> > >&&) ()
   from /opt/ros/foxy/lib/librclcpp.so
#6  0x0000fffff7f6bc10 in rclcpp::AnySubscriptionCallback<rosgraph_msgs::msg::Clock_<std::allocator<void> >, std::allocator<void> >::dispatch(std::shared_ptr<rosgraph_msgs::msg::Clock_<std::allocator<void> > >, rclcpp::MessageInfo const&)
    () from /opt/ros/foxy/lib/librclcpp.so
#7  0x0000fffff7f6f454 in rclcpp::Subscription<rosgraph_msgs::msg::Clock_<std::allocator<void> >, std::allocator<void>, rclcpp::message_memory_strategy::MessageMemoryStrategy<rosgraph_msgs::msg::Clock_<std::allocator<void> >, std::allocator--Type <RET> for more, q to quit, c to continue without paging--malloc_consolidate (av=av@entry=0xffffcc000020) at malloc.c:4475
<void> > >::handle_message(std::shared_ptr<void>&, rclcpp::MessageInfo const&) () from /opt/ros/foxy/lib/librclcpp.so
#8  0x0000fffff7ecd6c4 in ?? () from /opt/ros/foxy/lib/librclcpp.so
#9  0x0000fffff7ece2c8 in rclcpp::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) ()
   from /opt/ros/foxy/lib/librclcpp.so
#10 0x0000fffff7eceb14 in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) ()
   from /opt/ros/foxy/lib/librclcpp.so
#11 0x0000fffff7ed3a2c in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/foxy/lib/librclcpp.so
#12 0x0000fffff7b6603c in nav2_util::NodeThread::<lambda()>::operator() (__closure=<optimized out>)
    at /home/peanut/navigation2_ws/src/navigation2/nav2_util/src/node_thread.cpp:29
#13 std::__invoke_impl<void, nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > (__f=...) at /usr/include/c++/9/bits/invoke.h:60
#14 std::__invoke<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > (
    __fn=...) at /usr/include/c++/9/bits/invoke.h:95
#15 std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > >::_M_invoke<0> (this=<optimized out>) at /usr/include/c++/9/thread:244
#16 std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#17 std::thread::_State_impl<std::thread::_Invoker<std::tuple<nav2_util::NodeThread::NodeThread(rclcpp::node_interfaces::NodeBaseInterface::SharedPtr)::<lambda()> > > >::_M_run(void) (this=0xaaaaab36c0f0) at /usr/include/c++/9/thread:195
#18 0x0000fffff787876c in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6
#19 0x0000fffff79ab4fc in start_thread (arg=0xffffffffbeef) at pthread_create.c:477
#20 0x0000fffff763ef2c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

@SteveMacenski
Copy link
Member

I'm not sure what this has to do with the ticket? A memory leak doesn't result in a crash report via GDB.

CC @EduPonz @JaimeMartin see the fast-rtps crash report above.

@mrunaljsarvaiya
Copy link

I'm not sure what this has to do with the ticket? A memory leak doesn't result in a crash report via GDB.

CC @EduPonz @JaimeMartin see the fast-rtps crash report above.

I'm not too familiar with this, but if there is a memory leak, could that result in a process accessing memory that is being manipulated by another process (that caused the memory leak)?

@SteveMacenski
Copy link
Member

SteveMacenski commented Sep 17, 2020

No. That's not what this ticket is talking about. Memory leaking is not properly deleting memory usage on shutdown, not invalid access that would result in a buffer overflow or segmentation faults. As far as I'm aware, we've never had issues with either of those issues.

@maxlein
Copy link
Contributor

maxlein commented Sep 18, 2020

We have similar behavior like @Michael-Equi on foxy deb pkgs.
image

We already saw similar behavior here, and although a seg fault issue here.

@SteveMacenski
Copy link
Member

SteveMacenski commented Oct 28, 2020

@maxlein is that then from message filters you think? Did you capture any valgrind info when running these experiments so we can hone in on where the growth is deriving from?

@EduPonz
Copy link

EduPonz commented Oct 29, 2020

I'm not sure what this has to do with the ticket? A memory leak doesn't result in a crash report via GDB.

CC @EduPonz @JaimeMartin see the fast-rtps crash report above.

Apologies, this went completely our of my radar. The memory leak is reported for CyclondeDDS. I'll look into the crash right away!

@EduPonz
Copy link

EduPonz commented Oct 29, 2020

@SteveMacenski @mrunaljsarvaiya Looking a bit closer at the GDB traces from here, I cannot see a Fast DDS related crash. From what I can see, there is an on-going deserialization in planner server at the time of the crash, which happens to be in the process of resizing a vector. The crash itself however happens in controller_server, more precisely when trying to access 0x0000000000000031:

#0  0x0000000000000031 in ?? ()
#1  0x0000fffff7cfdddc in ?? () from /opt/ros/foxy/lib/librcl.so
#2  0x0000fffff7cfe9b8 in rcl_set_ros_time_override ()

I don't know the details about rclcpp, but it seems the crash occurs when trying to set some clock.

@maxlein
Copy link
Contributor

maxlein commented Oct 29, 2020

@maxlein is that then from message filters you think? Did you capture any valgrind info when running these experiments so we can hone in on where the growth is deriving from?

No I didn't. Also a core dump was sadly not generated...
I only know we had such memory leaks before, got fixed and now they seem to be here again as also the same nodes are affected again. So I only have a feeling that it may be related to tf2_ros::Buffer::waitForTransform and message filters.

@SteveMacenski
Copy link
Member

SteveMacenski commented Oct 29, 2020

@maxlein can you file a ticket in message filters about this? I believe you are correct from looking over the nodes that are growing - they all use message filters (planner/controller via costmap obstacle and voxel layers and AMCL itself uses it).

You could also easily verify that this is the case by modifying AMCL to not use message filters and show that it doesn't grow.

If that is the case, we should close this since the bug isn't in navigation2 but in message filters.

Edit: Here's the ticket ros2/message_filters#49 please @maxlein add any additional context or images like you showed above there for them to debug with.


@EduPonz

#4  0x0000fffff09c45d8 in std::vector<float, std::allocator<float> >::_M_default_append(unsigned long) ()
   from /opt/ros/foxy/lib/libsensor_msgs__rosidl_typesupport_fastrtps_cpp.so
#5  0x0000fffff09c7f10 in sensor_msgs::msg::typesupport_fastrtps_cpp::cdr_deserialize(eprosima::fastcdr::Cdr&, sensor_msgs::msg::LaserScan_<std::allocator<void> >&) () from /opt/ros/foxy/lib/libsensor_msgs__rosidl_typesupport_fastrtps_cpp.so
#6  0x0000fffff70d0310 in ?? () from /opt/ros/foxy/lib/librmw_fastrtps_cpp.so
#7  0x0000fffff7088400 in rmw_fastrtps_shared_cpp::TypeSupport::deserialize(eprosima::fastrtps::rtps::SerializedPayload_t*, void*) ()
   from /opt/ros/foxy/lib/librmw_fastrtps_shared_cpp.so
#8  0x0000fffff6bebd5c in eprosima::fastrtps::SubscriberHistory::deserialize_change(eprosima::fastrtps::rtps::CacheChange_t*, unsigned int, void*, eprosima::fastrtps::SampleInfo_t*) () from /opt/ros/foxy/lib/libfastrtps.so.2
#9  0x0000fffff6befa28 in eprosima::fastrtps::SubscriberHistory::takeNextData(void*, eprosima::fastrtps::SampleInfo_t*, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >&) () from /opt/ros/foxy/lib/libfastrtps.so.2
#10 0x0000fffff6be68f8 in eprosima::fastrtps::SubscriberImpl::takeNextData(void*, eprosima::fastrtps::SampleInfo_t*) ()
   from /opt/ros/foxy/lib/libfastrtps.so.2
#11 0x0000fffff7086434 in rmw_fastrtps_shared_cpp::_take(char const*, rmw_subscription_t const*, void*, bool*, rmw_message_info_t*, rmw_subscription_allocation_t*) () from /opt/ros/foxy/lib/librmw_fastrtps_shared_cpp.so
#12 0x0000fffff762ae3c in rcl_take () from /opt/ros/foxy/lib/librcl.so

There's an awful lot of fast-rtps references in there for it not to be an issue in fast-rtps

@Jconn
Copy link
Contributor

Jconn commented Oct 29, 2020

Hey now that you mention it I'm getting a crash very infrequently in non-navigation node. I will file a ticket or add to one in that repo. But I am using cyclone, so I'm thinking this is fastrtps unrelated.
[ERROR] [1603250140.055141528] [rclcpp]: Couldn't initialize rcl timer handle: failed to create guardcondition, at /home/mekaworks/foxy_ws/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2675, at /home/mekaworks/foxy_ws/src/ros2/rcl/rcl/src/rcl/guard_condition.c:90

@SteveMacenski
Copy link
Member

I'd file that with the rmw for cyclone

@SteveMacenski
Copy link
Member

@maxlein ros2/message_filters#49 (comment)

It would be interesting to retry the test with Rolling. We've fixed a few memory leaks in tf2_ros there, but we haven't been able to backport all of them to Foxy (since they break ABI; see ros2/geometry2#330 for an in-progress PR).

Is that something you could test? All you'd need to do is clone the master for geometry2 and run on that.

@maxlein
Copy link
Contributor

maxlein commented Nov 9, 2020

@SteveMacenski Yes, we will try that.
Hope we have time in the next few days.

@maxlein
Copy link
Contributor

maxlein commented Nov 13, 2020

Seems like it's fixed in this pr:
image

@SteveMacenski
Copy link
Member

Sweetness, closing ticket since its not from Nav2 + solution has been merged for all new distros and proposed for Foxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants