-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPUDirect RDMA Performance issue #9287
Comments
Hi, I think that UCX didn't switch from eager to rendezvous protocol based on osu_bw output. Could you please do the following to find out the cause of the issue:
You command line should be changed to something like this:
It should be enough to collect logs for one iteration only and for only one message size. So please use these osu_bw parameters: |
Thanks for confirming. |
Thanks for the log files.
There are a couple PRs related to the issue. However, the changes are in the release branch. It would also help me if you confirmed or denied that the issue is reproduced using master branch. |
@kzmymmt please share the output of the following commands:
|
I will check the master branch later. @ivankochin
|
@rakhmets
|
Can it be closed |
I'll check it out in future releases as well. |
Describe the bug
I measured bandwidth on GPUDirect RDMA with OSU Micro-Benchmarks osu_bw D D.
Bandwidth was lower with ucx 1.15.0rc3 than with 1.13.1.
Previously ucx1.13.1 had 24756.14 MB/sec with Size 4194304.
This was close to the speed of the NIC (IB NDR200, 200Gbps) and was fine.
What logs should be taken to identify the problem?
Steps to Reproduce
ucx_info -v
)Setup and versions
ibstat
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCXThe text was updated successfully, but these errors were encountered: