-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assertion failure from ud #1462
Comments
The same assertion failure reproduced with the IMB benchmark, on a smaller scale, with RoCE. /hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 168 -mca btl_openib_warn_default_gid_prefix 0 --debug-daemons --bind-to core --tag-output --timestamp-output --display-map -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 -mca btl_openib_if_include mlx5_2:1 -mca coll_hcoll_enable 0 -x UCX_TLS=ud,sm -mca opal_pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170507_110520_17563_732699_clx-orion-012/installs/iR7B/tests/imb/imb/src/IMB-MPI1 -npmin 168 -iter 1000 -mem 0.9
|
First part of the issue for RoCE configuration is a know one in OFED:
internal issue number: #828609 |
- fix uninitialized UCT err handler in UCP - increased UD timeout, fix variable name
- fix uninitialized UCT err handler in UCP - increased UD timeout, fix variable name
The command line to reproduce:
/hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 2496 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output --display-map -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc_x,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170424_010351_9868_727186_clx-hercules-001/installs/SAZJ/tests/mpich_tests/mpich-mellanox.git/test/mpi/pt2pt/probe-unexp
78 nodes, ppn=32.
http://e2e-gw.mellanox.com:4080/hpc/scrap/users/mtt/scratch/ucx_ompi/20170424_010351_9868_727186_clx-hercules-001/html/test_stdout_4ls83P.txt
From the comments in the test itself:
The text was updated successfully, but these errors were encountered: