Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uct_ud_ep_rx_creq error at np 1280 #544

Closed
janjust opened this issue Jan 8, 2016 · 6 comments · Fixed by #953
Closed

uct_ud_ep_rx_creq error at np 1280 #544

janjust opened this issue Jan 8, 2016 · 6 comments · Fixed by #953
Assignees
Labels
Milestone

Comments

@janjust
Copy link

janjust commented Jan 8, 2016

commit 5e4eb9e

Hey guys, at np 1280 I'm seeing the following error, is this a known issue?
I was testing IMB-MPI1 benchmarks using pml_ucx from ompi-trunk.

$mpirun -np 1280 --map-by ppr:20:node -mca pml ucx -x UCX_TLS=rc_x -x UCX_DEVICES=mlx5_0:1 ./IMB-MPI1 -npmin 1280

[clx-orion-038:8784 :0]       ud_ep.c:315  Assertion `ctl->type == UCT_UD_PACKET_CREQ' failed

/labhome/tomislavj/scrap/ucx-trunk/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_creq() ]
  ...
  312     uct_ud_ctl_hdr_t *ctl = (uct_ud_ctl_hdr_t *)(neth + 1);
  313
  314     ucs_assert_always(ctl->type == UCT_UD_PACKET_CREQ);
==>   315
  316     ep = uct_ud_iface_cep_lookup(iface, &ctl->conn_req.ib_addr, ctl->conn_req.conn_id);
  317     if (!ep) {
  318         ep = uct_ud_ep_create_passive(iface, ctl);

==== backtrace ====
 0 0x000000000001d961 uct_ud_ep_rx_creq()  /labhome/tomislavj/scrap/ucx-trunk/src/uct/ib/ud/base/ud_ep.c:315
 1 0x000000000001d961 uct_ud_ep_rx_creq()  /labhome/tomislavj/scrap/ucx-trunk/src/uct/ib/ud/base/ud_ep.c:319
 2 0x000000000001d961 uct_ud_ep_process_rx()  /labhome/tomislavj/scrap/ucx-trunk/src/uct/ib/ud/base/ud_ep.c:407
 3 0x000000000001ee84 uct_ud_verbs_iface_poll_rx()  /labhome/tomislavj/scrap/ucx-trunk/src/uct/ib/ud/verbs/ud_verbs.c:286
 4 0x000000000001ee84 uct_ud_verbs_iface_progress()  /labhome/tomislavj/scrap/ucx-trunk/src/uct/ib/ud/verbs/ud_verbs.c:315
 5 0x000000000000ef1a ucs_notifier_chain_call()  /labhome/tomislavj/scrap/ucx-trunk/src/ucs/datastruct/notifier.h:52
 6 0x000000000000ef1a uct_worker_progress()  /labhome/tomislavj/scrap/ucx-trunk/src/uct/base/uct_pd.c:210
 7 0x000000000000734d ucp_worker_progress()  /labhome/tomislavj/scrap/ucx-trunk/src/ucp/core/ucp_worker.c:230
 8 0x000000000000734d ucs_async_check_miss()  /labhome/tomislavj/scrap/ucx-trunk/src/ucs/async/async.h:135
 9 0x000000000000734d ucp_worker_progress()  /labhome/tomislavj/scrap/ucx-trunk/src/ucp/core/ucp_worker.c:231
10 0x0000000000002ddb mca_pml_ucx_progress()  /labhome/tomislavj/scrap/ompi-trunk/ompi/mca/pml/ucx/pml_ucx.c:323
11 0x000000000002f5ba opal_progress()  /labhome/tomislavj/scrap/ompi-trunk/opal/runtime/opal_progress.c:189
12 0x0000000000044d1d opal_condition_wait()  /labhome/tomislavj/scrap/ompi-trunk/ompi/../opal/threads/condition.h:76
13 0x0000000000044d1d ompi_request_default_wait_all()  /labhome/tomislavj/scrap/ompi-trunk/ompi/request/req_wait.c:287
14 0x0000000000072178 ompi_coll_base_sendrecv_nonzero_actual()  /labhome/tomislavj/scrap/ompi-trunk/ompi/mca/coll/base/coll_base_util.c:66
15 0x0000000000070a52 ompi_coll_base_sendrecv()  /labhome/tomislavj/scrap/ompi-trunk/ompi/mca/coll/base/coll_base_util.h:67
16 0x000000000002c6ad ompi_comm_split()  /labhome/tomislavj/scrap/ompi-trunk/ompi/communicator/comm.c:457
17 0x0000000000057929 PMPI_Comm_split()  /labhome/tomislavj/scrap/ompi-trunk/ompi/mpi/c/profile/pcomm_split.c:69
18 0x0000000000402ad9 IMB_set_communicator()  ??:0
19 0x0000000000402b27 IMB_init_communicator()  ??:0
20 0x0000000000401f17 main()  ??:0
21 0x0000003c8fe1ed1d __libc_start_main()  ??:0
22 0x0000000000401ce9 _start()  ??:0
===================
@shamisp shamisp added the Bug label Jan 9, 2016
@yosefe yosefe assigned alex-mikheev and unassigned yosefe Jan 9, 2016
@alex-mikheev
Copy link
Contributor

This one may be fixed by:
58524b0 Merge pull request #546 from yosefe/topic/ucp-unset-am-handlers
02b7251 Merge pull request #542 from alex-mikheev/topic/ud_dc_fixes
afdda89 UCT/UD: add lock when doing async callback

@yosefe yosefe closed this as completed Mar 28, 2016
@yosefe yosefe reopened this Jul 5, 2016
@yosefe
Copy link
Contributor

yosefe commented Jul 5, 2016

reported by @alinask

http://e2e-gw.mellanox.com:4080/hpc/scrap/users/mtt/scratch/ucx_ompi/20160705_092657_15284_33052_clx-orion-001/html/test_stdout_tuw9O6.txt

[clx-orion-064:14583:0]       ud_ep.c:470  Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed
[clx-orion-021:14989:0]       ud_ep.c:437  Assertion `ctl->type == UCT_UD_PACKET_CREQ' failed

More of them - http://e2e-gw.mellanox.com:4080/hpc/scrap/users/mtt/scratch/ucx_ompi/20160705_092657_15284_33052_clx-orion-001/Test_Run-mpich_tests_mpi_comm-ompi_ofed-1.10.3rc4-ompi_ofed.html
I’m not sure we are ready for such a scale since the 8 nodes mtt on pvegas isn’t perfect yet but just wanted to let you know.

@alinask
Copy link
Contributor

alinask commented Jul 5, 2016

adding the command line here for reproduction (in case the web page expires) :

/hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/bin/mpirun -np 896 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output --display-map -mca pml ucx -x UCX_SHM_DEVICES=all -x UCX_NET_DEVICES=mlx5_2:1 -x UCX_ACC_DEVICES=all -mca coll_hcoll_enable 0 -x UCX_TLS=all --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20160705_092657_15284_33052_clx-orion-001/installs/5_Ux/tests/mpich_tests/mpich-mellanox.git/test/mpi/comm/comm_idup_overlap

@yosefe yosefe added this to the v1.2 milestone Jul 22, 2016
@alinask
Copy link
Contributor

alinask commented Sep 6, 2016

Reproduces on a smaller scale (np=128) :

/hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/installs/q3UC/install/bin/mpirun -np 128 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output --display-map -mca pml ucx -x UCX_SHM_DEVICES=all -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_ACC_DEVICES=all -mca coll_hcoll_enable 0 -x UCX_TLS=rc_x,mm --map-by node /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/installs/q3UC/tests/mpich_tests/mpich-mellanox.git/test/mpi/coll/uoplong

 =============================================================[warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted
Sat Sep  3 08:42:25 2016[1,125]<stderr>:[vegas32:15493:0]       ud_ep.c:439  Assertion `ctl->type == UCT_UD_PACKET_CREQ' failed
Sat Sep  3 08:42:25 2016[1,125]<stderr>:
Sat Sep  3 08:42:25 2016[1,125]<stderr>:/hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_creq() ]
Sat Sep  3 08:42:25 2016[1,125]<stderr>:      ...
Sat Sep  3 08:42:25 2016[1,125]<stderr>:      436     uct_ud_ctl_hdr_t *ctl = (uct_ud_ctl_hdr_t *)(neth + 1);
Sat Sep  3 08:42:25 2016[1,125]<stderr>:      437 
Sat Sep  3 08:42:25 2016[1,125]<stderr>:      438     ucs_assert_always(ctl->type == UCT_UD_PACKET_CREQ);
Sat Sep  3 08:42:25 2016[1,125]<stderr>:==>   439 
Sat Sep  3 08:42:25 2016[1,125]<stderr>:      440     ep = uct_ud_iface_cep_lookup(iface, uct_ud_creq_ib_addr(ctl),
Sat Sep  3 08:42:25 2016[1,125]<stderr>:      441                                  &ctl->conn_req.ep_addr.iface_addr,
Sat Sep  3 08:42:25 2016[1,125]<stderr>:      442                                  ctl->conn_req.conn_id);
Sat Sep  3 08:42:25 2016[1,125]<stderr>:Sat Sep  3 08:42:25 2016[1,125]<stderr>:==== backtrace ====
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 0 0x000000000003f0c5 uct_ud_ep_rx_creq()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/uct/ib/ud/base/ud_ep.c:439
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 1 0x000000000003f0c5 uct_ud_ep_process_rx()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/uct/ib/ud/base/ud_ep.c:552
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 2 0x00000000000465eb uct_ud_mlx5_iface_poll_rx()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/uct/ib/ud/accel/ud_mlx5.c:390
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 3 0x00000000000465eb uct_ud_mlx5_iface_progress()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/uct/ib/ud/accel/ud_mlx5.c:433
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 4 0x0000000000015dbe ucs_callbackq_dispatch()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/ucs/datastruct/callbackq.h:263
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 5 0x0000000000015dbe uct_worker_progress()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/uct/base/uct_md.c:229
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 6 0x000000000000c838 ucp_worker_progress()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/ucp/core/ucp_worker.c:433
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 7 0x000000000000c838 ucs_async_check_miss()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/ucs/async/async.h:135
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 8 0x000000000000c838 ucp_worker_progress()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/ucx_src/src/ucp/core/ucp_worker.c:434
Sat Sep  3 08:42:25 2016[1,125]<stderr>: 9 0x0000000000003d13 mca_pml_ucx_progress()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/mpi-install/GjoN/src/ompi-release/ompi/mca/pml/ucx/pml_ucx.c:285
Sat Sep  3 08:42:25 2016[1,125]<stderr>:10 0x000000000002f872 opal_progress()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/mpi-install/GjoN/src/ompi-release/opal/runtime/opal_progress.c:187
Sat Sep  3 08:42:25 2016[1,125]<stderr>:11 0x0000000000057413 opal_condition_wait()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/mpi-install/GjoN/src/ompi-release/ompi/../opal/threads/condition.h:78
Sat Sep  3 08:42:25 2016[1,125]<stderr>:12 0x0000000000057d5a ompi_request_default_wait_all()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/mpi-install/GjoN/src/ompi-release/ompi/request/req_wait.c:281
Sat Sep  3 08:42:25 2016[1,125]<stderr>:13 0x0000000000017f8d ompi_coll_tuned_reduce_generic()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/mpi-install/GjoN/src/ompi-release/ompi/mca/coll/tuned/coll_tuned_reduce.c:192
Sat Sep  3 08:42:25 2016[1,125]<stderr>:14 0x0000000000018f23 ompi_coll_tuned_reduce_intra_in_order_binary()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/mpi-install/GjoN/src/ompi-release/ompi/mca/coll/tuned/coll_tuned_reduce.c:566
Sat Sep  3 08:42:25 2016[1,125]<stderr>:15 0x00000000000071d1 ompi_coll_tuned_reduce_intra_dec_fixed()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/mpi-install/GjoN/src/ompi-release/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:377
Sat Sep  3 08:42:25 2016[1,125]<stderr>:16 0x00000000000b0650 PMPI_Reduce()  /hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20160903_025248_31719_100509_vegas27/mpi-install/GjoN/src/ompi-release/ompi/mpi/c/profile/preduce.c:136
Sat Sep  3 08:42:25 2016[1,125]<stderr>:17 0x0000000000402776 main()  ??:0
Sat Sep  3 08:42:25 2016[1,125]<stderr>:18 0x0000000000021b15 __libc_start_main()  ??:0
Sat Sep  3 08:42:25 2016[1,125]<stderr>:19 0x0000000000402409 _start()  ??:0
Sat Sep  3 08:42:25 2016[1,125]<stderr>:===================

@yosefe yosefe assigned brminich and unassigned alex-mikheev Sep 6, 2016
@yosefe
Copy link
Contributor

yosefe commented Sep 6, 2016

@brminich you have a fix for this right?

@brminich
Copy link
Contributor

brminich commented Sep 6, 2016

Yes, creating a test case now

dmitrygx pushed a commit to dmitrygx/ucx that referenced this issue Dec 1, 2021
Set connection endpoint flag to no loopback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants