Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROcE problem with OMPI direct modex - UD assertion #1005

Closed
artpol84 opened this issue Sep 23, 2016 · 10 comments
Closed

ROcE problem with OMPI direct modex - UD assertion #1005

artpol84 opened this issue Sep 23, 2016 · 10 comments
Assignees
Labels
Milestone

Comments

@artpol84
Copy link
Contributor

OMPI version open-mpi/ompi@917d96ba50efa8 (compiled without debug)
UCX version 69545a1 (default configuration)

On the RoCE adapters for the command:
mpirun --mca pml ucx --mca bml '^r2' --mca mpi_add_procs_cutoff 0 --mca pmix_base_collect_data 0 --mca pmix_base_async_modex 1 --map-by core -np 3556 ./hello_c

I see the following backtrace:

~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_ctl() ]
      ...
      481     ucs_trace_func("");
      482     ucs_assert_always(ctl->type == UCT_UD_PACKET_CREP);
      483     ucs_assert_always(ep->dest_ep_id == UCT_UD_EP_NULL_ID ||
==>   484                       ep->dest_ep_id == ctl->conn_rep.src_ep_id);
      485
      486     /* Discard duplicate CREP */
      487     if (UCT_UD_PSN_COMPARE(neth->psn, <, ep->rx.ooo_pkts.head_sn)) {


~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_ctl() ]
      ...
      481     ucs_trace_func("");
      482     ucs_assert_always(ctl->type == UCT_UD_PACKET_CREP);
      483     ucs_assert_always(ep->dest_ep_id == UCT_UD_EP_NULL_ID ||
==>   484                       ep->dest_ep_id == ctl->conn_rep.src_ep_id);
      485
      486     /* Discard duplicate CREP */
      487     if (UCT_UD_PSN_COMPARE(neth->psn, <, ep->rx.ooo_pkts.head_sn)) {


~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_ctl() ]
      ...
      481     ucs_trace_func("");
      482     ucs_assert_always(ctl->type == UCT_UD_PACKET_CREP);
      483     ucs_assert_always(ep->dest_ep_id == UCT_UD_EP_NULL_ID ||
==>   484                       ep->dest_ep_id == ctl->conn_rep.src_ep_id);
      485
      486     /* Discard duplicate CREP */
      487     if (UCT_UD_PSN_COMPARE(neth->psn, <, ep->rx.ooo_pkts.head_sn)) {

==== backtrace ====
0 0x000000000003f52d uct_ud_ep_rx_ctl()  ~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_ep.c:484
1 0x000000000003f52d uct_ud_ep_process_rx()  ~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_ep.c:589
2 0x0000000000040c79 uct_ud_verbs_iface_poll_rx()  ~/openucx-ucx-69545a1/src/uct/ib/ud/verbs/ud_verbs.c:326
3 0x0000000000040c79 uct_ud_verbs_iface_async_progress()  ~/openucx-ucx-69545a1/src/uct/ib/ud/verbs/ud_verbs.c:345
4 0x000000000003b127 uct_ud_iface_async_progress()  ~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_iface.c:715
5 0x00000000000380ca ucs_async_dispatch_handler_cb()  ~/openucx-ucx-69545a1/src/ucs/async/async.c:94
6 0x0000000000038df5 ucs_hashed_ucs_async_handler_t_find()  ~/openucx-ucx-69545a1/src/ucs/async/async.c:23
7 0x0000000000038efb ucs_async_dispatch_handler()  ~/openucx-ucx-69545a1/src/ucs/async/async.c:126
8 0x000000000003c375 ucs_async_thread_func()  ~/openucx-ucx-69545a1/src/ucs/async/thread.c:82
9 0x0000000000007dc5 start_thread()  pthread_create.c:0
10 0x00000000000f61cd __clone()  ??:0
===================
[clx-orion-113:32433] *** Process received signal ***
[clx-orion-113:32433] Signal: Aborted (6)
[clx-orion-113:32433] Signal code:  (-6)
[clx-orion-113:32433] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff78d5100]
[clx-orion-113:32433] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7ffff753a5f7]
[clx-orion-113:32433] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7ffff753bce8]
[clx-orion-113:32433] [ 3] /openucx-ucx-69545a1_latest/lib/libucs.so.2(__ucs_abort+0x10e)[0x7fffea36938e]
[clx-orion-113:32433] [ 4] /openucx-ucx-69545a1_latest/lib/libuct.so.2(uct_ud_ep_process_rx+0x89d)[0x7fffea89952d]
[clx-orion-113:32433] [ 5] /openucx-ucx-69545a1_latest/lib/libuct.so.2(+0x40c79)[0x7fffea89ac79]
[clx-orion-113:32433] [ 6] /openucx-ucx-69545a1_latest/lib/libuct.so.2(+0x3b127)[0x7fffea895127]
[clx-orion-113:32433] [ 7] /openucx-ucx-69545a1_latest/lib/libucs.so.2(+0x380ca)[0x7fffea35b0ca]
[clx-orion-113:32433] [ 8] /openucx-ucx-69545a1_latest/lib/libucs.so.2(ucs_hashed_ucs_async_handler_t_find+0x35)[0x7fffea35bdf5]
[clx-orion-113:32433] [ 9] /openucx-ucx-69545a1_latest/lib/libucs.so.2(ucs_async_dispatch_handler+0x4b)[0x7fffea35befb]
[clx-orion-113:32433] [10] /openucx-ucx-69545a1_latest/lib/libucs.so.2(+0x3c375)[0x7fffea35f375]
[clx-orion-113:32433] [11] /usr/lib64/libpthread.so.0(+0x7dc5)[0x7ffff78cddc5]
[clx-orion-113:32433] [12] /usr/lib64/libc.so.6(clone+0x6d)[0x7ffff75fb1cd]
[clx-orion-113:32433] *** End of error message ***

~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_ctl() ]
      ...
==== backtrace ====
      481     ucs_trace_func("");
      482     ucs_assert_always(ctl->type == UCT_UD_PACKET_CREP);
      483     ucs_assert_always(ep->dest_ep_id == UCT_UD_EP_NULL_ID ||
==>   484                       ep->dest_ep_id == ctl->conn_rep.src_ep_id);
      485
      486     /* Discard duplicate CREP */
0 0x000000000003f52d uct_ud_ep_rx_ctl()  ~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_ep.c:484
1 0x000000000003f52d uct_ud_ep_process_rx()  ~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_ep.c:589
2 0x0000000000040f9e uct_ud_verbs_iface_poll_rx()  ~/openucx-ucx-69545a1/src/uct/ib/ud/verbs/ud_verbs.c:326
3 0x0000000000040f9e uct_ud_verbs_iface_progress()  ~/openucx-ucx-69545a1/src/uct/ib/ud/verbs/ud_verbs.c:360
4 0x00000000000162ae ucs_callbackq_dispatch()  ~/openucx-ucx-69545a1/src/ucs/datastruct/callbackq.h:263
5 0x00000000000162ae uct_worker_progress()  ~/openucx-ucx-69545a1/src/uct/base/uct_md.c:229
6 0x000000000000dfc0 ucp_worker_progress()  ~/openucx-ucx-69545a1/src/ucp/core/ucp_worker.c:546
7 0x000000000000dfc0 ucs_async_check_miss()  ~/openucx-ucx-69545a1/src/ucs/async/async.h:135
8 0x000000000000dfc0 ucp_worker_progress()  ~/openucx-ucx-69545a1/src/ucp/core/ucp_worker.c:547
9 0x0000000000004e4b mca_pml_ucx_send()  ??:0
10 0x000000000009dba3 ompi_coll_base_reduce_generic()  ??:0
11 0x000000000009e6f5 ompi_coll_base_reduce_intra_binomial()  ??:0
12 0x0000000000004fee ompi_coll_tuned_reduce_intra_dec_fixed()  ??:0
13 0x0000000000077251 PMPI_Reduce()  ??:0
14 0x0000000000400c7b main()  hello_c.c:39
      487     if (UCT_UD_PSN_COMPARE(neth->psn, <, ep->rx.ooo_pkts.head_sn)) {
15 0x0000000000021b15 __libc_start_main()  ??:0
16 0x0000000000400a09 _start()  ??:0
===================
[clx-orion-113:32403] *** Process received signal ***
[clx-orion-113:32403] Signal: Aborted (6)
[clx-orion-113:32403] Signal code:  (-6)

[clx-orion-113:32403] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff78d5100]
[clx-orion-113:32403] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7ffff753a5f7]
[clx-orion-113:32403] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7ffff753bce8]
[clx-orion-113:32403] [ 3] /openucx-ucx-69545a1_latest/lib/libucs.so.2(__ucs_abort+0x10e)[0x7fffea36938e]
[clx-orion-113:32403] [ 4] /openucx-ucx-69545a1_latest/lib/libuct.so.2(uct_ud_ep_process_rx+0x89d)[0x7fffea89952d]
[clx-orion-113:32403] [ 5] /openucx-ucx-69545a1_latest/lib/libuct.so.2(+0x40f9e)[0x7fffea89af9e]
[clx-orion-113:32403] [ 6] /openucx-ucx-69545a1_latest/lib/libuct.so.2(uct_worker_progress+0x1e)[0x7fffea8702ae]
[clx-orion-113:32403] [ 7] /openucx-ucx-69545a1_latest/lib/libucp.so.2(ucp_worker_progress+0x20)[0x7fffeaad6fc0]
[clx-orion-113:32403] [ 8] /ompi2/msg/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x12b)[0x7fffeacf5e4b]
[clx-orion-113:32403] [ 9] /ompi2/msg/lib/libmpi.so.0(ompi_coll_base_reduce_generic+0x463)[0x7ffff7b7fba3]
[clx-orion-113:32403] [10] /ompi2/msg/lib/libmpi.so.0(ompi_coll_base_reduce_intra_binomial+0xe5)[0x7ffff7b806f5]
[clx-orion-113:32403] [11]
~/openucx-ucx-69545a1/src/uct/ib/ud/base/ud_ep.c: [ uct_ud_ep_rx_ctl() ]
      ...
/ompi2/msg/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x18e)[0x7fffe8664fee]
[clx-orion-113:32403] [12]       481     ucs_trace_func("");
      482     ucs_assert_always(ctl->type == UCT_UD_PACKET_CREP);
      483     ucs_assert_always(ep->dest_ep_id == UCT_UD_EP_NULL_ID ||
==>   484                       ep->dest_ep_id == ctl->conn_rep.src_ep_id);
      485
/ompi2/msg/lib/libmpi.so.0(MPI_Reduce+0x1d1)[0x7ffff7b59251]
[clx-orion-113:32403] [13] ./hello_c[0x400c7b]
[clx-orion-113:32403] [14] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff7526b15]
…..
@yosefe
Copy link
Contributor

yosefe commented Sep 24, 2016

@brminich is it related to #953?

@yosefe yosefe added the Bug label Sep 24, 2016
@brminich
Copy link
Contributor

Not sure. I ve seen such a fault with my UD error handling code, but it should not be relevant for master (the root cause was due to specific disconnect change). Needs to be analyzed.

@brminich
Copy link
Contributor

I think we should drop duplicated packet before that assert. Will try to check whether that helps on Orion

@yosefe yosefe changed the title ROcE problem with OMPI direct modex ROcE problem with OMPI direct modex - UD assertion May 8, 2017
@brminich
Copy link
Contributor

Noticed the following errors from libibverbs, while reproducing the issue:

libibverbs: resolver: Destination unrechable (type 7)
libibverbs: resolver: Missing params
libibverbs: resolver: Not a VLAN link (errno = Operation not supported)libibverbs: Neigh resolution process failed
[clx-orion-123:29358] pml_ucx.c:271 Error: Failed to connect to proc: 310, Address not valid
[clx-orion-123:29358] pml_ucx.c:668 Error: Failed to get ep for rank 310

@brminich
Copy link
Contributor

brminich commented May 30, 2017

Another ibverbs error with latest UCX:
[clx-orion-018:4708 :0] ud_ep.c:503 Assertion ep->dest_ep_id == UCT_UD_EP_NULL_ID || ep->dest_ep_id == ctl->conn_rep.src_ep_id failed
libibverbs: resolver: Couldn't allocate neigh cache Netlink Error (errno = Connection refused)
[clx-orion-008:19690:0] ud_ep.c:436 Assertion status == UCS_OK failed

@brminich
Copy link
Contributor

Added unique ID to all CREQ packets. It is seen that the same CREQ is received by two different processes (on different hosts). Then both of them reply with CREP and the fault appears.

@yosefe yosefe added this to the v1.3 milestone Jun 1, 2017
@alinask
Copy link
Contributor

alinask commented Sep 26, 2017

@artpol84 @brminich @amaslenn Can we close this issue?

@amaslenn
Copy link
Contributor

@alinask
Copy link
Contributor

alinask commented Sep 26, 2017

@amaslenn this is not the same since it doesn't have the ud assertion described in this ticket.
the error in the link is:
libibverbs: resolver: Couldn't allocate neigh cache Netlink Error (errno = Connection refused)

@alinask
Copy link
Contributor

alinask commented Sep 26, 2017

discussed with @artpol84 offline, closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants