Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UD CREQ Assertion failure #892

Closed
alex--m opened this issue Jul 19, 2016 · 4 comments
Closed

UD CREQ Assertion failure #892

alex--m opened this issue Jul 19, 2016 · 4 comments
Assignees

Comments

@alex--m
Copy link
Contributor

alex--m commented Jul 19, 2016

reproduces on orion with 16 nodes (but not with 4), so maybe scale/race issue?

Original command is module load hpcx-gcc && salloc -N 16 -p orion mpirun --display-map --bind-to core -mca pml ucx /usr/mpi/gcc/openmpi-1.10.3rc4/tests/osu-micro-benchmarks-5.2/osu_allgather, and the visible error is:

[clx-orion-105:1450 :0]       ud_ep.c:439  Assertion `ctl->type == UCT_UD_PACKET_CREQ' failed
[clx-orion-105:1444 :0]       ud_ep.c:439  Assertion `ctl->type == UCT_UD_PACKET_CREQ' failed
[clx-orion-105:1437 :0]       ud_ep.c:472  Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed
[clx-orion-105:1436 :0]       ud_ep.c:439  Assertion `ctl->type == UCT_UD_PACKET_CREQ' failed
[clx-orion-105:1435 :0]       ud_ep.c:472  Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed
[clx-orion-105:1434 :0]       ud_ep.c:439  Assertion `ctl->type == UCT_UD_PACKET_CREQ' failed
[clx-orion-105:1439 :0]       ud_ep.c:472  Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed
[clx-orion-105:1447 :0]       ud_ep.c:472  Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed
==== backtrace ====
 0 0x000000000002b2e7 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:472
 1 0x000000000002b2e7 uct_ud_ep_process_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:552
 2 0x000000000002fd94 uct_ud_mlx5_iface_poll_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:390
 3 0x000000000002fd94 uct_ud_mlx5_iface_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:433
 4 0x00000000000136ce ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/datastruct/callbackq.h:201
 5 0x00000000000136ce uct_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/base/uct_md.c:229
 6 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:433
 7 0x00000000000096ed ucs_async_check_miss()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/async/async.h:135
 8 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:434
 9 0x0000000000002d4b mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/pml/ucx/pml_ucx.c:285
10 0x00000000000297ba opal_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/opal/runtime/opal_progress.c:187
11 0x000000000004378c ompi_mpi_init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/runtime/ompi_mpi_init.c:825
12 0x0000000000058270 PMPI_Init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mpi/c/profile/pinit.c:84
13 0x000000000040116d main()  ??:0
14 0x0000000000021b15 __libc_start_main()  ??:0
15 0x0000000000401599 _start()  ??:0
===================
[clx-orion-105:01437] *** Process received signal ***
[clx-orion-105:01437] Signal: Aborted (6)
[clx-orion-105:01437] Signal code:  (-6)
[clx-orion-105:01437] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff7612100]
[clx-orion-105:01437] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7ffff72775f7]
[clx-orion-105:01437] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7ffff7278ce8]
[clx-orion-105:01437] [ 3] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libucs.so.2(+0x33947)[0x7ffff0c0d947]
[clx-orion-105:01437] [ 4] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(uct_ud_ep_process_rx+0x457)[0x7ffff11092e7]
[clx-orion-105:01437] [ 5] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(+0x2fd94)[0x7ffff110dd94]
[clx-orion-105:01437] [ 6] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(uct_worker_progress+0x1e)[0x7ffff10f16ce]
[clx-orion-105:01437] [ 7] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libucp.so.2(ucp_worker_progress+0xd)[0x7ffff13346ed]
[clx-orion-105:01437] [ 8] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x2b)[0x7ffff1546d4b]
[clx-orion-105:01437] [ 9] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/libopen-pal.so.13(opal_progress+0x2a)[0x7ffff6d1a7ba]
[clx-orion-105:01437] [10] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/libmpi.so.12(ompi_mpi_init+0xa2c)[0x7ffff786278c]
[clx-orion-105:01437] [11] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/libmpi.so.12(MPI_Init+0x170)[0x7ffff7877270]
[clx-orion-105:01437] [12] /usr/mpi/gcc/openmpi-1.10.3rc4/tests/osu-micro-benchmarks-5.==== backtrace ====
 0 0x000000000002b6c9 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:439
 1 0x000000000002b6c9 uct_ud_ep_process_ack()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:370
 2 0x000000000002b6c9 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:464
 3 0x000000000002b6c9 uct_ud_ep_process_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:552
 4 0x000000000002fd94 uct_ud_mlx5_iface_poll_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:390
 5 0x000000000002fd94 uct_ud_mlx5_iface_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:433
 6 0x00000000000136ce ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/datastruct/callbackq.h:201
 7 0x00000000000136ce uct_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/base/uct_md.c:229
 8 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:433
 9 0x00000000000096ed ucs_async_check_miss()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/async/async.h:135
10 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:434
11 0x0000000000002d4b mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/pml/ucx/pml_ucx.c:285
12 0x00000000000297ba opal_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/opal/runtime/opal_progress.c:187
13 0x000000000004378c ompi_mpi_init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/runtime/ompi_mpi_init.c:825
14 0x0000000000058270 PMPI_Init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mpi/c/profile/pinit.c:84
15 0x000000000040116d main()  ??:0
16 0x0000000000021b15 __libc_start_main()  ??:0
17 0x0000000000401599 _start()  ??:0
===================
[clx-orion-105:01444] *** Process received signal ***
[clx-orion-105:01444] Signal: Aborted (6)
[clx-orion-105:01444] Signal code:  (-6)
[clx-orion-105:01444] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff7612100]
[clx-orion-105:01444] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7ffff72775f7]
[clx-orion-105:01444] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7ffff7278ce8]
[clx-orion-105:01444] [ 3] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libucs.so.2(+0x33947)[0x7ffff0c0d947]
[clx-orion-105:01444] [ 4] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(uct_ud_ep_process_rx+0x839)[0x7ffff11096c9]
[clx-orion-105:01444] [ 5] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(+0x2fd94)[0x7ffff110dd94]
[clx-orion-105:01444] [ 6] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(uct_worker_progress+0x1e)[0x7ffff10f16ce]
[clx-orion-105:01444] [ 7] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libucp.so.2(ucp_worker_progress+0xd)[0x7ffff13346ed]
[clx-orion-105:01444] [ 8] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x2b)[0x7ffff1546d4b]
[clx-orion-105:01444] [ 9] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/libopen-pal.so.13(opal_progress+0x2a)[0x7ffff6d1a7ba]
[clx-orion-105:01444] [10] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat==== backtrace ====
 0 0x000000000002b6c9 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:439
 1 0x000000000002b6c9 uct_ud_ep_process_ack()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:370
 2 0x000000000002b6c9 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:464
 3 0x000000000002b6c9 uct_ud_ep_process_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:552
 4 0x000000000002fd94 uct_ud_mlx5_iface_poll_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:390
 5 0x000000000002fd94 uct_ud_mlx5_iface_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:433
 6 0x00000000000136ce ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/datastruct/callbackq.h:201
 7 0x00000000000136ce uct_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/base/uct_md.c:229
 8 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:433
 9 0x00000000000096ed ucs_async_check_miss()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/async/async.h:135
10 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:434
11 0x0000000000002d4b mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/pml/ucx/pml_ucx.c:285
12 0x00000000000297ba opal_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/opal/runtime/opal_progress.c:187
13 0x000000000004378c ompi_mpi_init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/runtime/ompi_mpi_init.c:825
14 0x0000000000058270 PMPI_Init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mpi/c/profile/pinit.c:84
15 0x000000000040116d main()  ??:0
16 0x0000000000021b15 __libc_start_main()  ??:0
17 0x0000000000401599 _start()  ??:0
===================
[clx-orion-105:01450] *** Process received signal ***
[clx-orion-105:01450] Signal: Aborted (6)
[clx-orion-105:01450] Signal code:  (-6)
[clx-orion-105:01450] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff7612100]
[clx-orion-105:01450] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7ffff72775f7]
[clx-orion-105:01450] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7ffff7278ce8]
[clx-orion-105:01450] [ 3] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libucs.so.2(+0x33947)[0x7ffff0c0d947]
[clx-orion-105:01450] [ 4] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(uct_ud_ep_process_rx+0x839)[0x7ffff11096c9]
[clx-orion-105:01450] [ 5] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(+0x2fd94)[0x7ffff110dd94]
[clx-orion-105:01450] [ 6] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(uct_worker_progress+0x1e)[0x7ffff10f16ce]
[clx-orion-105:01450] [ 7] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libucp.so.2(ucp_worker_progress+0xd)[0x7ffff13346ed]
[clx-orion-105:01450] [ 8] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x2b)[0x7ffff1546d4b]
[clx-orion-105:01450] [ 9] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/libopen-pal.so.13(opal_progress+0x2a)[0x7ffff6d1a7ba]
[clx-orion-105:01450] [10] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat==== backtrace ====
 0 0x000000000002b2e7 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:472
 1 0x000000000002b2e7 uct_ud_ep_process_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:552
 2 0x000000000002fd94 uct_ud_mlx5_iface_poll_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:390
 3 0x000000000002fd94 uct_ud_mlx5_iface_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/accel/ud_mlx5.c:433
 4 0x00000000000136ce ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/datastruct/callbackq.h:201
 5 0x00000000000136ce uct_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/base/uct_md.c:229
 6 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:433
 7 0x00000000000096ed ucs_async_check_miss()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/async/async.h:135
 8 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:434
 9 0x0000000000002d4b mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/pml/ucx/pml_ucx.c:285
10 0x00000000000297ba opal_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/opal/runtime/opal_progress.c:187
11 0x000000000004378c ompi_mpi_init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/runtime/ompi_mpi_init.c:825
12 0x0000000000058270 PMPI_Init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mpi/c/profile/pinit.c:84
13 0x000000000040116d main()  ??:0
14 0x0000000000021b15 __libc_start_main()  ??:0
15 0x0000000000401599 _start()  ??:0
===================

<snip>

Ran with ucx master, built with debug, and got the following:

[clx-orion-105:1220 :0]       ud_ep.c:439  Assertion `ctl->type == UCT_UD_PACKET_CREQ' failed
==== backtrace ====
 0 0x000000000002b6c9 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:439
 1 0x000000000002b6c9 uct_ud_ep_process_ack()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:370
 2 0x000000000002b6c9 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:464
 3 0x000000000002b6c9 uct_ud_ep_process_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:552
 4 0x000000000002d8a0 uct_ud_verbs_iface_poll_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/verbs/ud_verbs.c:326
 5 0x000000000002d8a0 uct_ud_verbs_iface_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/verbs/ud_verbs.c:360
 6 0x00000000000136ce ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/datastruct/callbackq.h:201
 7 0x00000000000136ce uct_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/base/uct_md.c:229
 8 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:433
 9 0x00000000000096ed ucs_async_check_miss()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/async/async.h:135
10 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:434
11 0x0000000000002d4b mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/pml/ucx/pml_ucx.c:285
12 0x00000000000297ba opal_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/opal/runtime/opal_progress.c:187
13 0x00000000000851ac wait_completion()  hcoll_collectives.c:0
14 0x0000000000030982 comm_allreduce_hcolrte_generic()  common_allreduce.c:0
15 0x000000000003114c comm_allreduce_hcolrte()  ??:0
16 0x0000000000087c7c hcoll_get_context_from_cache()  ??:0
17 0x00000000000857b8 hcoll_create_context()  ??:0
18 0x0000000000002ce6 mca_coll_hcoll_comm_query()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/hcoll/coll_hcoll_module.c:309
19 0x000000000006802e query_2_0_0()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:392
20 0x000000000006802e query()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:375
21 0x000000000006802e check_one_component()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:337
22 0x000000000006802e check_components()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:301
23 0x000000000006802e mca_coll_base_comm_select()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:131
24 0x00000000000438c4 ompi_mpi_init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/runtime/ompi_mpi_init.c:895
25 0x0000000000058270 PMPI_Init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mpi/c/profile/pinit.c:84
26 0x000000000040116d main()  ??:0
27 0x0000000000021b15 __libc_start_main()  ??:0
28 0x0000000000401599 _start()  ??:0
===================
[clx-orion-105:01220] *** Process received signal ***
[clx-orion-105:01220] Signal: Aborted (6)
[clx-orion-105:01220] Signal code:  (-6)
[clx-orion-105:01220] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff7612100]
[clx-orion-105:01220] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7ffff72775f7]
[clx-orion-105:01220] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7ffff7278ce8]
[clx-orion-105:01220] [ 3] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libucs.so.2(+0x33947)[0x7ffff0c0d947]
[clx-orion-105:01220] [ 4] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(uct_ud_ep_process_rx+0x839)[0x7ffff11096c9]
[clx-orion-105:01220] [ 5] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(+0x2d8a0)[0x7ffff110b8a0]
[clx-orion-105:01220] [ 6] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libuct.so.2(uct_worker_progress+0x1e)[0x7ffff10f16ce]
[clx-orion-105:01220] [ 7] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../ucx/lib/libucp.so.2(ucp_worker_progress+0xd)[0x7ffff13346ed]
[clx-orion-105:01220] [ 8] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x2b)[0x7ffff1546d4b]
[clx-orion-105:01220] [ 9] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/libopen-pal.so.13(opal_progress+0x2a)[0x7ffff6d1a7ba]
[clx-orion-105:01220] [10] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../hcoll/lib/libhcoll.so.1(+0x851ac)[0x7fffee0b91ac]
[clx-orion-105:01220] [11] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../hcoll/lib/libhcoll.so.1(+0x30982)[0x7fffee064982]
[clx-orion-105:01220] [12] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../hcoll/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x4c)[0x7fffee06514c]
[clx-orion-105:01220] [13] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../hcoll/lib/libhcoll.so.1(hcoll_get_context_from_cache+0x34c)[0x7fffee0bbc7c]
[clx-orion-105:01220] [14] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/../../../hcoll/lib/libhcoll.so.1(hcoll_create_context+0xd8)[0x7fffee0b97b8]
[clx-orion-105:01220] [15] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x236)[0x7fffee46dce6]
[clx-orion-105:01220] [16] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/libmpi.so.12(mca_coll_base_comm_select+0x15ae)[0x7ffff788702e]
[clx-orion-105:01220] [17] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/libmpi.so.12(ompi_mpi_init+0xb64)[0x7ffff78628c4]
[clx-orion-105:01220] [18] /hpc/local/benchmarks/hpcx_install_Monday/hpcx-gcc-redhat6.5/ompi-v1.10/lib/libmpi.so.12(MPI_Init+0x170)[0x7ffff7877270]
[clx-orion-105:01220] [19] /usr/mpi/gcc/openmpi-1.10.3rc4/tests/osu-micro-benchmarks-5.2/osu_allgather[0x40116d]
[clx-orion-105:01220] [20] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff7263b15]
[clx-orion-105:01220] [21] /usr/mpi/gcc/openmpi-1.10.3rc4/tests/osu-micro-benchmarks-5.2/osu_allgather[0x401599]
[clx-orion-105:01220] *** End of error message ***
[clx-orion-116:14698:0]       ud_ep.c:439  Assertion `ctl->type == UCT_UD_PACKET_CREQ' failed
==== backtrace ====
 0 0x000000000002b6c9 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:439
 1 0x000000000002b6c9 uct_ud_ep_process_ack()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:370
 2 0x000000000002b6c9 uct_ud_ep_rx_creq()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:464
 3 0x000000000002b6c9 uct_ud_ep_process_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/base/ud_ep.c:552
 4 0x000000000002d8a0 uct_ud_verbs_iface_poll_rx()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/verbs/ud_verbs.c:326
 5 0x000000000002d8a0 uct_ud_verbs_iface_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/ib/ud/verbs/ud_verbs.c:360
 6 0x00000000000136ce ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/datastruct/callbackq.h:201
 7 0x00000000000136ce uct_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/uct/base/uct_md.c:229
 8 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:433
 9 0x00000000000096ed ucs_async_check_miss()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucs/async/async.h:135
10 0x00000000000096ed ucp_worker_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ucx-master/src/ucp/core/ucp_worker.c:434
11 0x0000000000002d4b mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/pml/ucx/pml_ucx.c:285
12 0x00000000000297ba opal_progress()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/opal/runtime/opal_progress.c:187
13 0x00000000000851ac wait_completion()  hcoll_collectives.c:0
14 0x0000000000030f8d comm_allreduce_hcolrte_generic()  common_allreduce.c:0
15 0x000000000003114c comm_allreduce_hcolrte()  ??:0
16 0x0000000000087c7c hcoll_get_context_from_cache()  ??:0
17 0x00000000000857b8 hcoll_create_context()  ??:0
18 0x0000000000002ce6 mca_coll_hcoll_comm_query()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/hcoll/coll_hcoll_module.c:309
19 0x000000000006802e query_2_0_0()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:392
20 0x000000000006802e query()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:375
21 0x000000000006802e check_one_component()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:337
22 0x000000000006802e check_components()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:301
23 0x000000000006802e mca_coll_base_comm_select()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mca/coll/base/coll_base_comm_select.c:131
24 0x00000000000438c4 ompi_mpi_init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/runtime/ompi_mpi_init.c:895
25 0x0000000000058270 PMPI_Init()  /hpc/local/benchmarks/hpcx_install_Monday/src/hpcx-gcc-redhat6.5/ompi-v1.10/ompi/mpi/c/profile/pinit.c:84
26 0x000000000040116d main()  ??:0
27 0x0000000000021b15 __libc_start_main()  ??:0
28 0x0000000000401599 _start()  ??:0
===================

<snip>
@alex--m
Copy link
Contributor Author

alex--m commented Jul 19, 2016

Also collected core files, if it helps, but it should be easily reproducable anyway.

@alinask
Copy link
Contributor

alinask commented Jul 20, 2016

looks like the same as #544

@alex--m
Copy link
Contributor Author

alex--m commented Jul 20, 2016

Agreed, didn't notice that one when i posted. On the other hand - mine is easier to reproduce. Can close as duplicate.

@yosefe
Copy link
Contributor

yosefe commented Jul 22, 2016

dup of #544

@yosefe yosefe closed this as completed Jul 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants