Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"DROPPING PACKET" and "70% full, THROTTLING" #352

Open
James-R-Han opened this issue Aug 6, 2024 · 16 comments
Open

"DROPPING PACKET" and "70% full, THROTTLING" #352

James-R-Han opened this issue Aug 6, 2024 · 16 comments
Assignees

Comments

@James-R-Han
Copy link

James-R-Han commented Aug 6, 2024

Hello!

I am using

  • OS0-128
  • ROS2 Humble (with CycloneDDS)
  • UDP Profile RNG15_RFL8_NIR8
  • Run configuration : ros2 launch ouster_ros sensor.launch.xml sensor_hostname:=192.168.131.18 use_system_default_qos:=true timestamp_mode:=TIME_FROM_ROS_TIME sensor_qos_profile:=reliable proc_mask:="IMG|PCL" viz:=false

I will experience the following warnings:
image

Any advice on how to handle the situation?

Thank you in advance!

@kavishmshah
Copy link

Hi,
I too see the same when I have multiple subscribers to the pointcloud topic.
I'm using an nvidia Orin devkit and OS0-128U with 512x10, I don't see this issue occur but with 1024x10, I see packet dropa when Rviz is open and I record a bag simultaneously.

What resolution are you using?

@James-R-Han
Copy link
Author

Hey Kavish! I'm glad I'm not the only one haha. I was using 1024x10 when I got the messages above

@Samahu Samahu self-assigned this Aug 8, 2024
@Samahu
Copy link
Contributor

Samahu commented Aug 8, 2024

Hi @James-R-Han and @kavishmshah thanks for sharing the feedback.

I noticed that sensor_qos_profile:=reliable which suggest that you are not using the BEST_EFFORT which is what used for SensorDataQoS. I think choosing the RELIABLE QoS for the sensor will increase the holdup on the published queue resulting in the throttling and dropped packets. I recommend that you use the default SensorDataQoS for live processing and only use RELIABLE when capturing data.

In any case, as I noted in the current merged fix #321 I have further TODOs that I want to implement soon which should improve the performance on the driver side.

So stay tuned 🤞

@James-R-Han
Copy link
Author

Thanks @Samahu!

When I've tried BEST_EFFORT publisher with BEST_EFFORT subscriber (RVIZ), empirically I see the frame rate will occasionally drop well below 10Hz.

When I use RELIABLE publisher with BEST_EFFORT subscriber, sometimes the RVIZ is just blank (see below).

If I use RELIABLE for both I find the best result - the frame rate is consistently high; I'm able to wave my hand around and it's a smooth motion in the point cloud.

image

@Samahu
Copy link
Contributor

Samahu commented Aug 9, 2024

This problem could be tied to the underlying RMW used. I don't see this problem on x86 platforms with CycloneDDS. I didn't measure how good the driver works on NVIDIA devices but I do intend to once I get more free time. In any case, I think that Zenoh is official now and works with ROS2 (Rolling, Iron, Jazzy, ..) which offers an alternative to the DDS communication layer and I have been hearing good reviews about it but I haven't tried myself. You can give it a try.

This is of course besides the fact that we can do more optimization on the driver side.

@Limerzzz
Copy link

I face the same problem, but I fix it by set the proc_mask value as PCL in driver_params.yaml. And I drive it in python.

@kavishmshah
Copy link

Hi @Samahu , @Limerzzz and @James-R-Han ,

With qos set to BEST_EFFORT, when parsing the ROS messages (after recording a bag file), we saw a lot of 'nan' values and decided to change it back to RELIABLE. Maybe this might explain the blank screen on RViz, not entirely sure though.
Set use_system_default_qos: true, to have the system default settings, which is Reliable and Volatile. Link

We were able to resolve the packet drop issue by doing the below:

  1. Changing the DDS to cyclone and tuning it with multicast and localhost settings.
  2. In addition to this, we also increased the memory buffer size.
  3. Setting MTU to 9000 also helped.

We have tried the same on a NVIDIA Devkit and a high spec PC. In both the devices, we didn't notice packet drops after the change. I'll post in the commands/settings needed in a few days once we are able to replicate on another set of PCs, just to make sure.

Thanks!

@Samahu
Copy link
Contributor

Samahu commented Aug 22, 2024

Thanks @kavishmshah,

When you say increased the memory buffer size do you mean to increase net.core.rmem_max and net.core.rmem_default values?

@kavishmshah
Copy link

@Samahu yup, thats right. I believe I set them to 2GB.

@Samahu
Copy link
Contributor

Samahu commented Sep 16, 2024

Hi @James-R-Han, I have implemented few improvements to the point cloud generation as part of #369 which I believe should help with your situation. This is partially to improving the handling of the function that generation the point cloud by allowing to skip copying the fields if the sensor didn't have valid returns for the specific pixel. Which should reduce the overhead. You can also reduce the effective range of the sensor via the two new launch file parameters min_range and max_range but this can depend on your specific use case. Additionally users who may have limited bandwidth can switch to using non-organized point cloud by setting the organized parameter to false. I do have further improvements in the pipeline but this what I'd like to merge short term. Please let me know if you were able to check that the provided solution helps (or not) with your case.

@outrider-jkoch
Copy link

If I'm not mistaken the driver collects a full scan and then batch processes it correct? If so, while it wouldn't save processing time could the driver instead do processing on each piece as it receives it? That way the work would split up into at most 128 (2048 / 16) small batches versus a single large batch. It might give CPUs a little bit more time to complete the required task. But if the CPU is heavily loaded this could still be problematic. Perhaps this has been tried before and not proven to be effective.

@Samahu
Copy link
Contributor

Samahu commented Sep 17, 2024

@outrider-jkoch

If I'm not mistaken the driver collects a full scan and then batch processes it correct?

Correct, currently that how ouster-ros currently operates

If so, while it wouldn't save processing time could the driver instead do processing on each piece as it receives it?

Yeah it is possible to restructure the PointCloud composition/generation such that you merge the ScanBatching and the cartesian step in one iteration. This would be a good step to make as it equalizes the workload on every LidarPacket rather than perform the cartesian in one go once a LidarScan has been completed (the cartesian is one of the hefty operations). However, the downside to that would be you can't make LidarScan wide operations before you generate the PointCloud object from it. This is why I didn't implement this optimization per this PR as I do have examples or uses for invoking operations (such as filters) on the LidarScan before you do the cartesian.

As I mentioned the PR does have two improvements to this regard by skipping invalid range values + if non-organized point cloud is a viable option for the integrate this should reduce the overhead further due to lower bandwidth.

@Samahu
Copy link
Contributor

Samahu commented Sep 17, 2024

Regarding the cartesian being hefty operation I do have some planned optimizations that should significantly reduce its overhead but these will have to wait until the next or later release.

@Samahu
Copy link
Contributor

Samahu commented Oct 8, 2024

@James-R-Han Could you please try out the last release and let me know if it helps with your situation. Thanks.

@James-R-Han
Copy link
Author

Hi @Samahu! Thanks for implementing those improvements! Shortly after my post, I swapped to a better computer (better CPU) and the problem went away. So, I won't be able to explicitly recreate testing conditions when I first raised this post. Sorry about that!

@Samahu
Copy link
Contributor

Samahu commented Oct 9, 2024

Thanks, I do have a low end computer that exposed the same problem I will see if I can still re-produce before and after the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants