Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mtt] libibverbs: resolver: Neighbour doesn't have a hw addr #3826

Closed
amaslenn opened this issue Jul 4, 2019 · 2 comments
Closed

[mtt] libibverbs: resolver: Neighbour doesn't have a hw addr #3826

amaslenn opened this issue Jul 4, 2019 · 2 comments
Labels

Comments

@amaslenn
Copy link
Contributor

amaslenn commented Jul 4, 2019

Configuration

OMPI: 4.0.2a1
MOFED: MLNX_OFED_LINUX-4.5-1.0.1.0
Module: hpcx-gcc (2019-07-03)
Test module: none
Nodes: jazz x3 (ppn=28(x3), nodelist=jazz[01,06,08])
ucx-emulation-roce

MTT log: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/html/test_stdout_zQgda0.txt

Looks similar to #1462 (refers internal issue 828609, closed now) and #1005.

Cmd:
oshrun -np 4 --bind-to core -mca oshmem_proc_group_cache_size 10000 -mca sshmem ucx -mca atomic ucx -mca coll '^hcoll' -mca coll_hcoll_enable 0 -mca spml ucx -mca pml ucx -x UCX_TLS=ud_x -x UCX_NET_DEVICES=mlx5_3:1 -x UCX_UNIFIED_MODE=y --map-by node --mca scoll_basic_barrier_alg 3 --mca scoll_basic_broadcast_alg 1 --mca scoll_basic_collect_alg 2 --mca scoll_basic_reduce_alg 2 -x SHMEM_SYMMETRIC_HEAP_SIZE=128M /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/install/bin/oshmem_test exec --no-colour -o5 --task=coll

Output:

libibverbs: resolver: Neighbour doesn't have a hw addr
libibverbs: resolver: Unspecific failurelibibverbs: Neigh resolution process failed
[jazz08:17347:0:17347]       ud_ep.c:494  Assertion `status == UCS_OK' failed
[1562230008.131645] [jazz08:17347:0]      ib_device.c:961  UCX  ERROR ibv_create_ah(dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:2.1.3.1 sgid_index=3 traffic_class=106) failed: Connection timed out

/hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/debug/assert.c: [ ucs_fatal_error_message() ]
      ...
       33     }
       34
       35     ucs_handle_error(message_buf);
==>    36     abort();
       37 }
       38
       39 void ucs_fatal_error_format(const char *file, unsigned line,

==== backtrace (tid:  17347) ====
 0 0x0000000000048a38 ucs_fatal_error_message()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/debug/assert.c:36
 1 0x0000000000048b99 ucs_fatal_error_format()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/debug/assert.c:52
 2 0x000000000003e6ff uct_ud_ep_create_passive()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/ib/ud/base/ud_ep.c:494
 3 0x00000000000449fb uct_ud_mlx5_iface_poll_rx()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:420
 4 0x000000000001e2a2 ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/callbackq.h:211
 5 0x0000000000003717 mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/mca/pml/ucx/pml_ucx.c:510
 6 0x000000000003717c opal_progress()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/opal/runtime/opal_progress.c:231
 7 0x000000000003360d ompi_request_wait_completion()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/../ompi/request/request.h:415
 8 0x000000000003538d ompi_comm_nextcid()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/communicator/comm_cid.c:293
 9 0x000000000003061b ompi_comm_dup_with_info()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/communicator/comm.c:1007
10 0x00000000000602a6 PMPI_Comm_dup()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/ompi/mpi/c/profile/pcomm_dup.c:63
11 0x000000000003828b oshmem_shmem_init()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/oshmem/runtime/oshmem_shmem_init.c:161
12 0x000000000003b002 _shmem_init()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/oshmem/shmem/c/profile/pshmem_init.c:77
13 0x000000000003b002 pstart_pes()  /hpc/local/benchmarks/hpcx_install_2019-07-03/src/hpcx-gcc-redhat7.4/ompi/oshmem/shmem/c/profile/pshmem_init.c:57
14 0x0000000000406e09 __do_exec()  /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/osh_exec.c:188
15 0x0000000000406e09 proc_mode_exec()  /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/osh_exec.c:128
16 0x00000000004057ae main()  /hpc/mtr_scrap/users/mtt/scratch/shmem/20190703_221405_135493_25572_jazz01/installs/JrPz/tests/verifier/tests-mellanox.git/verifier/osh_main.c:138
17 0x0000000000021c05 __libc_start_main()  ???:0
18 0x00000000004059c1 _start()  ???:0
=================================
[jazz08:17347] *** Process received signal ***
@amaslenn amaslenn added Bug MTT MTT Error labels Jul 4, 2019
@yosefe yosefe added the External label Jul 4, 2019
@yosefe
Copy link
Contributor

yosefe commented Jul 4, 2019

invalid ip address on jazz08

 [yosefe@jazz08 ucx]$ ifconfig p2p2
p2p2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 2.1.3.0  netmask 255.255.255.0  broadcast 2.1.3.255
        inet6 fe80::ee0d:9aff:fe46:9e35  prefixlen 64  scopeid 0x20<link>
        ether ec:0d:9a:46:9e:35  txqueuelen 1000  (Ethernet)
        RX packets 32  bytes 1920 (1.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 24905  bytes 1494468 (1.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

@yosefe
Copy link
Contributor

yosefe commented Jul 4, 2019

closing as this is a setup issue; test passed on other set of nodes

@yosefe yosefe closed this as completed Jul 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants