-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang on UD-only MPI runs on some hosts #1767
Comments
Is it Connect-X5? What is FW version? |
see also #1513 |
Doesn't look related to #1513, since the hang is on the first sends - it doesn't start showing benchmark results (I attached gdb - all procs seem to wait on I'll post the FW & HCA versions. |
It is Connect-X5. Still waiting on the FW version. |
Currently it works for me on that cluster:
@alex--m can we close this? |
If OSU passes i consider it solved. |
I noticed the problem on the Thor Cluster (Mellanox-internal). The problem did not reproduce on other hosts. The following command hangs, while changing
ud
torc
(or disabling UCX altogether) make is pass:Also, hangs even w/o HCOLL, with
ud_x
, and for other async (i*) OSU collective benchmarks. Withosu_gatherv
- hangs at around 4K (message size).Is there any additional data you'd like me to provide?
The text was updated successfully, but these errors were encountered: