We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuration
OMPI: 4.0.2a1 MOFED: MLNX_OFED_LINUX-4.5-1.0.1.0 Module: hpcx-gcc (2019-07-03) Test module: none Nodes: jazz x3 (ppn=28(x3), nodelist=jazz[01,06,08]) ucx-emulation-roce
MTT log: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/html/test_stdout_zQgda0.txt
Looks similar to #1462 (refers internal issue 828609, closed now) and #1005.
Cmd: oshrun -np 4 --bind-to core -mca oshmem_proc_group_cache_size 10000 -mca sshmem ucx -mca atomic ucx -mca coll '^hcoll' -mca coll_hcoll_enable 0 -mca spml ucx -mca pml ucx -x UCX_TLS=ud_x -x UCX_NET_DEVICES=mlx5_3:1 -x UCX_UNIFIED_MODE=y --map-by node --mca scoll_basic_barrier_alg 3 --mca scoll_basic_broadcast_alg 1 --mca scoll_basic_collect_alg 2 --mca scoll_basic_reduce_alg 2 -x SHMEM_SYMMETRIC_HEAP_SIZE=128M /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/install/bin/oshmem_test exec --no-colour -o5 --task=coll
oshrun -np 4 --bind-to core -mca oshmem_proc_group_cache_size 10000 -mca sshmem ucx -mca atomic ucx -mca coll '^hcoll' -mca coll_hcoll_enable 0 -mca spml ucx -mca pml ucx -x UCX_TLS=ud_x -x UCX_NET_DEVICES=mlx5_3:1 -x UCX_UNIFIED_MODE=y --map-by node --mca scoll_basic_barrier_alg 3 --mca scoll_basic_broadcast_alg 1 --mca scoll_basic_collect_alg 2 --mca scoll_basic_reduce_alg 2 -x SHMEM_SYMMETRIC_HEAP_SIZE=128M /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/install/bin/oshmem_test exec --no-colour -o5 --task=coll
Output:
libibverbs: resolver: Neighbour doesn't have a hw addr libibverbs: resolver: Unspecific failurelibibverbs: Neigh resolution process failed [jazz08:17347:0:17347] ud_ep.c:494 Assertion `status == UCS_OK' failed [1562230008.131645] [jazz08:17347:0] ib_device.c:961 UCX ERROR ibv_create_ah(dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:2.1.3.1 sgid_index=3 traffic_class=106) failed: Connection timed out /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/debug/assert.c: [ ucs_fatal_error_message() ] ... 33 } 34 35 ucs_handle_error(message_buf); ==> 36 abort(); 37 } 38 39 void ucs_fatal_error_format(const char *file, unsigned line, ==== backtrace (tid: 17347) ==== 0 0x0000000000048a38 ucs_fatal_error_message() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/debug/assert.c:36 1 0x0000000000048b99 ucs_fatal_error_format() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/debug/assert.c:52 2 0x000000000003e6ff uct_ud_ep_create_passive() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/ib/ud/base/ud_ep.c:494 3 0x00000000000449fb uct_ud_mlx5_iface_poll_rx() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:420 4 0x000000000001e2a2 ucs_callbackq_dispatch() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/callbackq.h:211 5 0x0000000000003717 mca_pml_ucx_progress() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/mca/pml/ucx/pml_ucx.c:510 6 0x000000000003717c opal_progress() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/opal/runtime/opal_progress.c:231 7 0x000000000003360d ompi_request_wait_completion() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/../ompi/request/request.h:415 8 0x000000000003538d ompi_comm_nextcid() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/communicator/comm_cid.c:293 9 0x000000000003061b ompi_comm_dup_with_info() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/communicator/comm.c:1007 10 0x00000000000602a6 PMPI_Comm_dup() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/mpi/c/profile/pcomm_dup.c:63 11 0x000000000003828b oshmem_shmem_init() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/oshmem/runtime/oshmem_shmem_init.c:161 12 0x000000000003b002 _shmem_init() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/oshmem/shmem/c/profile/pshmem_init.c:77 13 0x000000000003b002 pstart_pes() /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/oshmem/shmem/c/profile/pshmem_init.c:57 14 0x0000000000406e09 __do_exec() /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/osh_exec.c:188 15 0x0000000000406e09 proc_mode_exec() /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/osh_exec.c:128 16 0x00000000004057ae main() /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/osh_main.c:138 17 0x0000000000021c05 __libc_start_main() ???:0 18 0x00000000004059c1 _start() ???:0 ================================= [jazz08:17347] *** Process received signal ***
The text was updated successfully, but these errors were encountered:
invalid ip address on jazz08
[yosefe@jazz08 ucx]$ ifconfig p2p2 p2p2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 2.1.3.0 netmask 255.255.255.0 broadcast 2.1.3.255 inet6 fe80::ee0d:9aff:fe46:9e35 prefixlen 64 scopeid 0x20<link> ether ec:0d:9a:46:9e:35 txqueuelen 1000 (Ethernet) RX packets 32 bytes 1920 (1.8 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 24905 bytes 1494468 (1.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Sorry, something went wrong.
closing as this is a setup issue; test passed on other set of nodes
No branches or pull requests
Configuration
MTT log: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/html/test_stdout_zQgda0.txt
Looks similar to #1462 (refers internal issue 828609, closed now) and #1005.
Cmd:
oshrun -np 4 --bind-to core -mca oshmem_proc_group_cache_size 10000 -mca sshmem ucx -mca atomic ucx -mca coll '^hcoll' -mca coll_hcoll_enable 0 -mca spml ucx -mca pml ucx -x UCX_TLS=ud_x -x UCX_NET_DEVICES=mlx5_3:1 -x UCX_UNIFIED_MODE=y --map-by node --mca scoll_basic_barrier_alg 3 --mca scoll_basic_broadcast_alg 1 --mca scoll_basic_collect_alg 2 --mca scoll_basic_reduce_alg 2 -x SHMEM_SYMMETRIC_HEAP_SIZE=128M /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/install/bin/oshmem_test exec --no-colour -o5 --task=coll
Output:
The text was updated successfully, but these errors were encountered: