-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROcE problem with OMPI direct modex - UD assertion #1005
Comments
Not sure. I ve seen such a fault with my UD error handling code, but it should not be relevant for master (the root cause was due to specific disconnect change). Needs to be analyzed. |
I think we should drop duplicated packet before that assert. Will try to check whether that helps on Orion |
Noticed the following errors from libibverbs, while reproducing the issue: libibverbs: resolver: Destination unrechable (type 7) |
Another ibverbs error with latest UCX: |
Added unique ID to all CREQ packets. It is seen that the same CREQ is received by two different processes (on different hosts). Then both of them reply with CREP and the fault appears. |
@amaslenn this is not the same since it doesn't have the ud assertion described in this ticket. |
discussed with @artpol84 offline, closing this. |
OMPI version open-mpi/ompi@917d96ba50efa8 (compiled without debug)
UCX version 69545a1 (default configuration)
On the RoCE adapters for the command:
mpirun --mca pml ucx --mca bml '^r2' --mca mpi_add_procs_cutoff 0 --mca pmix_base_collect_data 0 --mca pmix_base_async_modex 1 --map-by core -np 3556 ./hello_c
I see the following backtrace:
The text was updated successfully, but these errors were encountered: