-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpich 3.4a3 with ch4:ofi:gni hangs at message size > 8192 on Cori #4720
Comments
Same behavior with 3.4b1 |
NOTES: I thought #4811 should've fixed this. |
Same issue with 3.4 release |
3.4.1 on cori using libfabric-1.11.0 seems even worse:
not sure why osu_latency is not working beyond 0 bytes. 'cpi' is working for me, though, as are a few of the MPI-IO benchmarks that I tried. |
Trying mpich/main with libfabric/master gni on Cori. I can reproduce the same issue that processes hang at 8K with osu_latency. Looking into the problem now. |
To narrow down this bug, I let rank 0 performs only
The bug happens at the MPICH AM pipeline send/recv path with ofi/gni. Roughly speaking, it seems to be a bug of ofi/gni provider code when using Below is a note showing how the situation happens:
Cause of bug: Will report this bug to ofi/gni developers. |
Btw, looks like ofi/gni supports both |
I can try. But as I said, it only hides the bug, but not really resolves it (i.e., it will no longer trigger AM pipeline with ofi/gni). |
I agree. We should keep following the real issue, but that shouldn't be an excuse for us to leave it broken. |
The issue is not completely resolved by #5085. We still need libfabric #6593 or make changes in mpich (e.g., passing a pointer to NULL rather than NULL as msg.desc). |
Thanks @minsii! I'll cherry-pick this back to the 3.4.x branch, as well. |
ofiwg/libfabric#6593 was merged into libfabric, so I think we can close this and let users know to grab a recent libfabric in order to run MPICH. |
Some samples with
perf record -g --pid <pid>
ofrank 0
rank 1
separate libfabric
I also built my own libfabric 1.10.1 with
MPICH 3.4a3 built against this has the same issue. If I instead use the verbs provider and verbs compatibility on Cray I get bad performance, but all message sizes work:
libfabric only
The stand alone build of libfabric appears to work with expected performance.
https://github.com/ofi-cray/cray-tests/blob/master/performance/multi-node/rdm_pingpong.c
with
MAX_MSG_SIZE (1<<22)
The text was updated successfully, but these errors were encountered: