Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertion failure from ud #1462

Closed
alinask opened this issue Apr 30, 2017 · 2 comments
Closed

assertion failure from ud #1462

alinask opened this issue Apr 30, 2017 · 2 comments
Labels

Comments

@alinask
Copy link
Contributor

alinask commented Apr 30, 2017

The command line to reproduce:

/hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 2496 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output --display-map -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc_x,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170424_010351_9868_727186_clx-hercules-001/installs/SAZJ/tests/mpich_tests/mpich-mellanox.git/test/mpi/pt2pt/probe-unexp

Mon Apr 24 01:26:37 2017[1,1195]<stdout>:[1492986397.315859] [clx-hercules-026:7845 :0]         wireup.c:56   UCX  ERROR failed to send wireup: Endpoint timeout
Mon Apr 24 01:26:37 2017[1,2398]<stdout>:[1492986397.346190] [clx-hercules-093:28475:0]         wireup.c:56   UCX  ERROR failed to send wireup: Endpoint timeout
Mon Apr 24 01:26:38 2017[1,1195]<stderr>:[clx-hercules-026:7845 :0]       ud_ep.c:498  Assertion `ep->dest_ep_id == UCT_UD_EP_NULL_ID || ep->dest_ep_id == ctl->conn_rep.src_ep_id' failed
Mon Apr 24 01:26:38 2017[1,1195]<stderr>:==== backtrace ====
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 0 0x000000000003b4c9 uct_ud_ep_rx_ctl()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/ib/ud/base/ud_ep.c:497
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 1 0x000000000003b4c9 uct_ud_ep_process_rx()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/ib/ud/base/ud_ep.c:602
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 2 0x000000000004071c uct_ud_mlx5_iface_poll_rx()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:404
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 3 0x000000000004071c uct_ud_mlx5_iface_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:446
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 4 0x00000000000194ae ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/ucs/datastruct/callbackq.inl:39
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 5 0x00000000000194ae uct_worker_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/base/uct_md.c:233
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 6 0x000000000000f11d ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/ucp/core/ucp_worker.c:719
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 7 0x000000000000f11d ucs_async_check_miss()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/ucs/async/async.h:75
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 8 0x000000000000f11d ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/ucp/core/ucp_worker.c:720
Mon Apr 24 01:26:38 2017[1,1195]<stderr>: 9 0x0000000000002c77 mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/mca/pml/ucx/pml_ucx.c:421
Mon Apr 24 01:26:38 2017[1,1195]<stderr>:10 0x000000000002eecc opal_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/opal/runtime/opal_progress.c:225
Mon Apr 24 01:26:38 2017[1,1195]<stderr>:11 0x0000000000003a35 mca_pml_ucx_probe()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/mca/pml/ucx/pml_ucx.c:759
Mon Apr 24 01:26:38 2017[1,1195]<stderr>:12 0x000000000005a738 PMPI_Probe()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/mpi/c/profile/pprobe.c:77
Mon Apr 24 01:26:38 2017[1,1195]<stderr>:13 0x000000000040277e main()  ???:0
Mon Apr 24 01:26:38 2017[1,1195]<stderr>:14 0x0000000000021b15 __libc_start_main()  ???:0
Mon Apr 24 01:26:38 2017[1,1195]<stderr>:15 0x0000000000402499 _start()  ???:0
Mon Apr 24 01:26:38 2017[1,1195]<stderr>:===================

78 nodes, ppn=32.

http://e2e-gw.mellanox.com:4080/hpc/scrap/users/mtt/scratch/ucx_ompi/20170424_010351_9868_727186_clx-hercules-001/html/test_stdout_4ls83P.txt

From the comments in the test itself:

/*
 * This program verifies that MPI_Probe() is operating properly in the face of
 * unexpected messages arriving after MPI_Probe() has
 * been called.  This program may hang if MPI_Probe() does not return when the
 * message finally arrives (see req #375).
 */
@alinask alinask added the Bug label Apr 30, 2017
@alinask
Copy link
Contributor Author

alinask commented May 7, 2017

The same assertion failure reproduced with the IMB benchmark, on a smaller scale, with RoCE.

/hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -np 168 -mca btl_openib_warn_default_gid_prefix 0 --debug-daemons --bind-to core --tag-output --timestamp-output --display-map -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 -mca btl_openib_if_include mlx5_2:1 -mca coll_hcoll_enable 0 -x UCX_TLS=ud,sm -mca opal_pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170507_110520_17563_732699_clx-orion-012/installs/iR7B/tests/imb/imb/src/IMB-MPI1 -npmin 168 -iter 1000 -mem 0.9

Sun May  7 11:22:43 2017[1,1]<stderr>:[clx-orion-013:20274:0]       ud_ep.c:478  Assertion `uct_ib_unpack_uint24(ctl->conn_req.ep_addr.ep_id) == ep->dest_ep_id' failed
Sun May  7 11:22:43 2017[1,3]<stderr>:[clx-orion-015:19912:0]       ud_ep.c:478  Assertion `uct_ib_unpack_uint24(ctl->conn_req.ep_addr.ep_id) == ep->dest_ep_id' failed
   Sun May  7 11:22:43 2017[1,2]<stderr>:[clx-orion-014:19869:0]       ud_ep.c:478  Assertion `uct_ib_unpack_uint24(ctl->conn_req.ep_addr.ep_id) == ep->dest_ep_id' failed
Sun May  7 11:22:44 2017[1,4]<stderr>:[clx-orion-016:19885:0]       ud_ep.c:478  Assertion `uct_ib_unpack_uint24(ctl->conn_req.ep_addr.ep_id) == ep->dest_ep_id' failed
Sun May  7 11:22:44 2017[1,32]<stderr>:[clx-orion-014:19875:0]       ud_ep.c:498  Assertion `ep->dest_ep_id == UCT_UD_EP_NULL_ID || ep->dest_ep_id == ctl->conn_rep.src_ep_id' failed
Sun May  7 11:22:44 2017[1,3]<stderr>:==== backtrace ====
   Sun May  7 11:22:44 2017[1,3]<stderr>: 0 0x000000000003b420 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/ib/ud/base/ud_ep.c:478
   Sun May  7 11:22:44 2017[1,3]<stderr>: 1 0x000000000003b420 uct_ud_ep_process_rx()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/ib/ud/base/ud_ep.c:573
   Sun May  7 11:22:44 2017[1,3]<stderr>: 2 0x000000000003cd16 uct_ud_verbs_iface_poll_rx()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/ib/ud/verbs/ud_verbs.c:328
   Sun May  7 11:22:44 2017[1,3]<stderr>: 3 0x000000000003cd16 uct_ud_verbs_iface_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/ib/ud/verbs/ud_verbs.c:368
   Sun May  7 11:22:44 2017[1,3]<stderr>: 4 0x000000000001968e ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/ucs/datastruct/callbackq.inl:39
   Sun May  7 11:22:44 2017[1,3]<stderr>: 5 0x000000000001968e uct_worker_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/base/uct_md.c:233
   Sun May  7 11:22:44 2017[1,3]<stderr>: 6 0x000000000000f1dd ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/ucp/core/ucp_worker.c:719
   Sun May  7 11:22:44 2017[1,3]<stderr>: 7 0x000000000000f1dd ucs_async_check_miss()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/ucs/async/async.h:75
   Sun May  7 11:22:44 2017[1,3]<stderr>: 8 0x000000000000f1dd ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ucx-master/src/ucp/core/ucp_worker.c:720
   Sun May  7 11:22:44 2017[1,3]<stderr>: 9 0x0000000000002cb7 mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/mca/pml/ucx/pml_ucx.c:421
   Sun May  7 11:22:44 2017[1,3]<stderr>:10 0x000000000002eecc opal_progress()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/opal/runtime/opal_progress.c:225
   Sun May  7 11:22:44 2017[1,3]<stderr>:11 0x00000000000430dd ompi_request_wait_completion()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/../ompi/request/request.h:392
   Sun May  7 11:22:44 2017[1,3]<stderr>:12 0x00000000000430dd ompi_request_default_wait()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/request/req_wait.c:41
   Sun May  7 11:22:44 2017[1,3]<stderr>:13 0x000000000006a020 ompi_coll_base_bcast_intra_generic()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/mca/coll/base/coll_base_bcast.c:159
   Sun May  7 11:22:44 2017[1,3]<stderr>:14 0x000000000006a467 ompi_coll_base_bcast_intra_binomial()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/mca/coll/base/coll_base_bcast.c:331
   Sun May  7 11:22:44 2017[1,3]<stderr>:15 0x0000000000004c2c ompi_coll_tuned_bcast_intra_dec_fixed()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:258
   Sun May  7 11:22:44 2017[1,3]<stderr>:16 0x0000000000053550 PMPI_Bcast()  /hpc/local/benchmarks/hpcx_install_Friday/src/hpcx-gcc-redhat7.2/ompi-v2.x/ompi/mpi/c/profile/pbcast.c:109
   Sun May  7 11:22:44 2017[1,3]<stderr>:17 0x0000000000404705 IMB_basic_input()  ???:0
   Sun May  7 11:22:44 2017[1,3]<stderr>:18 0x0000000000401f60 main()  ???:0
   Sun May  7 11:22:44 2017[1,3]<stderr>:19 0x0000000000021b15 __libc_start_main()  ???:0
   Sun May  7 11:22:44 2017[1,3]<stderr>:20 0x0000000000401df9 _start()  ???:0
   Sun May  7 11:22:44 2017[1,3]<stderr>:===================

@alinask alinask changed the title assertion failure from ud on the probe-unexp test assertion failure from ud May 7, 2017
@yosefe yosefe modified the milestone: v1.2 - release May 10, 2017
@evgeny-leksikov
Copy link
Contributor

evgeny-leksikov commented May 11, 2017

First part of the issue for RoCE configuration is a know one in OFED:

libibverbs: resolver: (errno = Connection refused)libibverbs: Neigh resolution process failed

internal issue number: #828609

@yosefe yosefe closed this as completed in 84d1a06 May 11, 2017
yosefe added a commit that referenced this issue May 11, 2017
evgeny-leksikov added a commit to evgeny-leksikov/ucx that referenced this issue May 12, 2017
- fix uninitialized UCT err handler in UCP
- increased UD timeout, fix variable name
yosefe added a commit that referenced this issue May 14, 2017
MattBBaker pushed a commit to MattBBaker/ucx that referenced this issue May 22, 2017
- fix uninitialized UCT err handler in UCP
- increased UD timeout, fix variable name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants