You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a hang that happens when running the one-sided/osu_fop_latency test on a RoCE port.
In the following command line only mlx5_2:1 is an Eth port.
For some reason setting the HCA for the openib btl to a non-eth port resolves the issue (while the HCA for UCX doesn't change and the rest of the command line is the same).
While it hangs, one rank is at mpi_barrier and the other is here:
[alinas@clx-orion-052 ~]$ gstack 6864
Thread 4 (Thread 0x7ffff4349700 (LWP 6865)):
#0 0x00007ffff732e7a3 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00007ffff6d29bb3 in epoll_dispatch (base=0x6a6cb0, tv=<optimized out>) at epoll.c:407
#2 0x00007ffff6d2d600 in opal_libevent2022_event_base_loop (base=base@entry=0x6a6cb0, flags=flags@entry=1) at event.c:1630
#3 0x00007ffff437bbdd in progress_engine (obj=0x6a6cb0) at src/util/progress_threads.c:52
#4 0x00007ffff7600dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ffff732e1cd in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7ffff3b48700 (LWP 6866)):
#0 0x00007ffff732e7a3 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00007ffff6d29bb3 in epoll_dispatch (base=0x6a8190, tv=<optimized out>) at epoll.c:407
#2 0x00007ffff6d2d600 in opal_libevent2022_event_base_loop (base=0x6a8190, flags=flags@entry=1) at event.c:1630
#3 0x00007ffff6cf137e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4 0x00007ffff7600dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007ffff732e1cd in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffea813700 (LWP 6878)):
#0 0x00007ffff732e7a3 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00007fffeaa5fa55 in ucs_async_thread_func (arg=0x820170) at async/thread.c:93
#2 0x00007ffff7600dc5 in start_thread () from /usr/lib64/libpthread.so.0
#3 0x00007ffff732e1cd in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fb5740 (LWP 6864)):
#0 0x00007ffff6d79453 in opal_sys_timer_get_cycles () at ../../../../opal/include/opal/sys/x86_64/timer.h:42
#1 opal_timer_linux_get_usec_sys_timer () at timer_linux_component.c:226
#2 0x00007ffff6cecf49 in opal_progress () at runtime/opal_progress.c:197
#3 0x00007fffe0933abd in opal_condition_wait (m=0x8ca290, c=0x8ca2d0) at ../../../../opal/threads/condition.h:72
#4 ompi_osc_pt2pt_sync_wait_expected (sync=<optimized out>) at osc_pt2pt_sync.h:154
#5 ompi_osc_pt2pt_flush_lock (module=module@entry=0x8cd040, lock=0x8ca220, target=target@entry=1) at osc_pt2pt_passive_target.c:519
#6 0x00007fffe0935194 in ompi_osc_pt2pt_flush (target=1, win=<optimized out>) at osc_pt2pt_passive_target.c:561
#7 0x00007ffff7875316 in PMPI_Win_flush (rank=rank@entry=1, win=0x8ccd30) at pwin_flush.c:57
#8 0x0000000000401db6 in run_fop_with_flush (rank=0, type=<optimized out>) at osu_fop_latency.c:238
#9 0x0000000000401765 in main (argc=1, argv=0x7fffffffcf08) at osu_fop_latency.c:120
pml yalla doesn’t have this problem and runs on the RoCE port with both cases.
ob1 fails with the RoCE port (mlx5_2:1).
Open MPI v2.1.1rc1
The text was updated successfully, but these errors were encountered:
There is a hang that happens when running the one-sided/osu_fop_latency test on a RoCE port.
In the following command line only mlx5_2:1 is an Eth port.
For some reason setting the HCA for the openib btl to a non-eth port resolves the issue (while the HCA for UCX doesn't change and the rest of the command line is the same).
2 hosts, ppn=1.
hangs:
/hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 2 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --display-map -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 -mca coll_hcoll_enable 0 -x UCX_TLS=ud,sm --map-by node -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 -mca btl_openib_if_include mlx5_2:1 /hpc/scrap/users/mtt/scratch/ucx_ompi/20170509_075908_18677_734136_clx-orion-017/installs/Ou7g/tests/osu_micro_benchmark/osu-micro-benchmarks-5.3.2/mpi/one-sided/osu_fop_latency
works:
/hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 2 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --display-map -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 -mca coll_hcoll_enable 0 -x UCX_TLS=ud,sm --map-by node -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 -mca btl_openib_if_include mlx5_3:1 /hpc/scrap/users/mtt/scratch/ucx_ompi/20170509_075908_18677_734136_clx-orion-017/installs/Ou7g/tests/osu_micro_benchmark/osu-micro-benchmarks-5.3.2/mpi/one-sided/osu_fop_latency
While it hangs, one rank is at mpi_barrier and the other is here:
pml yalla doesn’t have this problem and runs on the RoCE port with both cases.
ob1 fails with the RoCE port (mlx5_2:1).
Open MPI v2.1.1rc1
The text was updated successfully, but these errors were encountered: