Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in MPI_Finalize with UCX_TLS=rc[_x],sm on the bsend2 test #1513

Closed
alinask opened this issue May 14, 2017 · 7 comments
Closed

Hang in MPI_Finalize with UCX_TLS=rc[_x],sm on the bsend2 test #1513

alinask opened this issue May 14, 2017 · 7 comments
Assignees
Milestone

Comments

@alinask
Copy link
Contributor

alinask commented May 14, 2017

The command line to reproduce:

/hpc/local/benchmarks/hpcx_install_Thursday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 2784 -mca btl_openib_warn_default_gid_prefix 0 --debug-daemons --bind-to core --tag-output --timestamp-output --display-map -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170512_123328_23697_735568_clx-hercules-001/installs/dZW5/tests/mpich_tests/mpich-mellanox.git/test/mpi/pt2pt/bsend2

http://e2e-gw.mellanox.com:4080/hpc/scrap/users/mtt/scratch/ucx_ompi/20170512_123328_23697_735568_clx-hercules-001/html/test_stdout_7oPpOD.txt

all 2784 ranks are at ompi_mpi_finalize ():

Thread 2 (Thread 0x7fffe9bb2700 (LWP 17578)):
#0  0x00007ffff76307a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007fffe9dffab5 in ucs_async_thread_func (arg=0xa0c6e0) at async/thread.c:93
#2  0x00007ffff7902dc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007ffff76301cd in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fb5740 (LWP 17211)):
#0  0x00007ffff707b513 in opal_sys_timer_get_cycles () at ../../../../opal/include/opal/sys/x86_64/timer.h:42
#1  opal_timer_linux_get_usec_sys_timer () at timer_linux_component.c:226
#2  0x00007ffff6feefa9 in opal_progress () at runtime/opal_progress.c:197
#3  0x00007fffea78704d in mca_pml_ucx_waitall (count_p=<synthetic pointer>, reqs=0xbcf500) at pml_ucx.c:340
#4  mca_pml_ucx_del_procs (procs=0xb130c0, nprocs=2784) at pml_ucx.c:396
#5  0x00007ffff7b5c11d in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:318
#6  0x0000000000402762 in main ()
Thread 2 (Thread 0x7fffe9bb2700 (LWP 17575)):
#0  0x00007ffff76307a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007fffe9dffab5 in ucs_async_thread_func (arg=0xa0c6e0) at async/thread.c:93
#2  0x00007ffff7902dc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007ffff76301cd in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fb5740 (LWP 17208)):
#0  0x00007ffff75f741d in nanosleep () from /usr/lib64/libc.so.6
#1  0x00007ffff7628014 in usleep () from /usr/lib64/libc.so.6
#2  0x00007ffff438359a in PMIx_Fence (procs=procs@entry=0x0, nprocs=nprocs@entry=0, info=info@entry=0x0, ninfo=0) at src/client/pmix_client_fence.c:101
#3  0x00007ffff43524b8 in pmix1_fence (procs=<optimized out>, collect_data=0) at pmix1_client.c:317
#4  0x00007fffea787112 in mca_pml_ucx_del_procs (procs=0xb130c0, nprocs=2784) at pml_ucx.c:403
#5  0x00007ffff7b5c11d in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:318
#6  0x0000000000402762 in main ()

may be related to #1502 and #1512

@alinask
Copy link
Contributor Author

alinask commented May 14, 2017

A backtrace of the ranks during the hang is attached.
trace_issue_1513.txt

@yosefe
Copy link
Contributor

yosefe commented May 14, 2017

looks like one of the endpoints has not been connected during startup, so it tries to connect during finalize and ends in timeout state:

(gdb) p  ((ucp_stub_ep_t*)((ucp_request_t*)reqs[0]-1)->send.ep.uct_eps[0])->aux_ep->iface.ops
$28 = {iface_close = 0x222c3a0, iface_query = 0x220fcb0, iface_flush = 0x21ec570, iface_fence = 0x0, iface_wakeup_open = 0x2011730, iface_wakeup_get_fd = 0x1, 
  iface_wakeup_arm = 0xffffffffffff0001, iface_wakeup_wait = 0xff, iface_wakeup_signal = 0x0, iface_wakeup_close = 0x0, iface_tag_recv_zcopy = 0x0, 
  iface_tag_recv_cancel = 0x0, ep_create = 0x0, ep_create_connected = 0x0, ep_destroy = 0x7fffea716ae0 <uct_ep_failed_destroy>, ep_get_address = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_connect_to_ep = 0x7fffea210650 <ucs_empty_function_return_ep_timeout>, 
  iface_get_device_address = 0x281, iface_get_address = 0x7fffea97ada0 <ucp_wireup_msg_progress>, iface_is_reachable = 0x0, ep_put_short = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_put_bcopy = 0x7fffea210660 <ucs_empty_function_return_bc_ep_timeout>, ep_put_zcopy = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_get_bcopy = 0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_get_zcopy = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_am_short = 0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_am_bcopy = 
    0x7fffea210660 <ucs_empty_function_return_bc_ep_timeout>, ep_am_zcopy = 0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_atomic_add64 = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_atomic_fadd64 = 0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_atomic_swap64 = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_atomic_cswap64 = 0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_atomic_add32 = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_atomic_fadd32 = 0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_atomic_swap32 = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_atomic_cswap32 = 0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_pending_add = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_pending_purge = 0x7fffea716870 <uct_ep_failed_purge>, ep_flush = 
    0x7fffea210650 <ucs_empty_function_return_ep_timeout>, ep_fence = 0x0, ep_check = 0x0, ep_tag_eager_short = 0x0, ep_tag_eager_bcopy = 0x0, 
  ep_tag_eager_zcopy = 0x0, ep_tag_rndv_zcopy = 0x7fffea97ada0 <ucp_wireup_msg_progress>, ep_tag_rndv_cancel = 0x0, ep_tag_rndv_request = 0x0}
(gdb) bt
#0  0x00007ffff707b513 in opal_sys_timer_get_cycles () at ../../../../opal/include/opal/sys/x86_64/timer.h:42
#1  opal_timer_linux_get_usec_sys_timer () at timer_linux_component.c:226
#2  0x00007ffff6feefa9 in opal_progress () at runtime/opal_progress.c:197
#3  0x00007fffeab8904d in mca_pml_ucx_waitall (count_p=<synthetic pointer>, reqs=0x22efeb0) at pml_ucx.c:340
#4  mca_pml_ucx_del_procs (procs=0xb07320, nprocs=2784) at pml_ucx.c:396
#5  0x00007ffff7b5c11d in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:318
#6  0x0000000000402762 in main ()
(gdb) f 
#3  0x00007fffeab8904d in mca_pml_ucx_waitall (count_p=<synthetic pointer>, reqs=0x22efeb0) at pml_ucx.c:340
340	            opal_progress();

@yosefe yosefe modified the milestone: v1.2 - release May 15, 2017
evgeny-leksikov added a commit to evgeny-leksikov/ucx that referenced this issue May 21, 2017
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
evgeny-leksikov added a commit to evgeny-leksikov/ucx that referenced this issue May 21, 2017
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
evgeny-leksikov added a commit to evgeny-leksikov/ucx that referenced this issue May 23, 2017
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
evgeny-leksikov added a commit to evgeny-leksikov/ucx that referenced this issue May 23, 2017
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
yosefe added a commit that referenced this issue May 24, 2017
yosefe added a commit that referenced this issue May 25, 2017
@evgeny-leksikov
Copy link
Contributor

Fixed in #1532 , #1541 (v1.2)

@yosefe yosefe closed this as completed May 28, 2017
@alinask alinask reopened this Jun 11, 2017
@alinask
Copy link
Contributor Author

alinask commented Jun 11, 2017

Reopening this ticket since the same test hangs from ompi_finalize.

16 hercules hosts, ppn=32

/hpc/local/benchmarks/hpcx_install_Thursday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 512 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=ud,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170608_205432_16648_747832_clx-hercules-065/installs/YaaK/tests/mpich_tests/mpich-mellanox.git/test/mpi/pt2pt/bsend2

and

/hpc/local/benchmarks/hpcx_install_Thursday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 512 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=ud,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170608_205432_16648_747832_clx-hercules-065/installs/YaaK/tests/mpich_tests/mpich-mellanox.git/test/mpi/pt2pt/bsend3

(bsend2 and bsend3)

http://e2e-gw.mellanox.com:4080//hpc/scrap/users/mtt/scratch/ucx_ompi/20170608_205432_16648_747832_clx-hercules-065/html/test_stdout_URIaTh.txt

Attaching a trace.

@alinask
Copy link
Contributor Author

alinask commented Jun 11, 2017

ucx_1513.txt

@yosefe yosefe modified the milestones: v1.2 - release, v1.3 Jul 1, 2017
@yosefe
Copy link
Contributor

yosefe commented Jul 24, 2017

the original hang can be reproduced with UCX_TLS=rc only on older UCX versions (e.g 7e5973d)
it can be reproduced with UCX_TLS=ud with latest version of UCX as well.

@alinask
Copy link
Contributor Author

alinask commented Aug 1, 2017

Closing this one since it's not a UCX issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants