-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang in MPI_Finalize with UCX_TLS=rc[_x],sm on the bsend2 test #1513
Comments
A backtrace of the ranks during the hang is attached. |
looks like one of the endpoints has not been connected during startup, so it tries to connect during finalize and ends in timeout state:
|
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
Reopening this ticket since the same test hangs from ompi_finalize. 16 hercules hosts, ppn=32 /hpc/local/benchmarks/hpcx_install_Thursday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 512 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=ud,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170608_205432_16648_747832_clx-hercules-065/installs/YaaK/tests/mpich_tests/mpich-mellanox.git/test/mpi/pt2pt/bsend2 and /hpc/local/benchmarks/hpcx_install_Thursday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 512 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=ud,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170608_205432_16648_747832_clx-hercules-065/installs/YaaK/tests/mpich_tests/mpich-mellanox.git/test/mpi/pt2pt/bsend3 (bsend2 and bsend3) Attaching a trace. |
the original hang can be reproduced with UCX_TLS=rc only on older UCX versions (e.g |
Closing this one since it's not a UCX issue. |
The command line to reproduce:
/hpc/local/benchmarks/hpcx_install_Thursday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 2784 -mca btl_openib_warn_default_gid_prefix 0 --debug-daemons --bind-to core --tag-output --timestamp-output --display-map -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170512_123328_23697_735568_clx-hercules-001/installs/dZW5/tests/mpich_tests/mpich-mellanox.git/test/mpi/pt2pt/bsend2
http://e2e-gw.mellanox.com:4080/hpc/scrap/users/mtt/scratch/ucx_ompi/20170512_123328_23697_735568_clx-hercules-001/html/test_stdout_7oPpOD.txt
all 2784 ranks are at ompi_mpi_finalize ():
may be related to #1502 and #1512
The text was updated successfully, but these errors were encountered: