Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loop_spawn IBM test hanging #99

Closed
ompiteam opened this issue Oct 1, 2014 · 9 comments
Closed

loop_spawn IBM test hanging #99

ompiteam opened this issue Oct 1, 2014 · 9 comments
Assignees
Labels

Comments

@ompiteam
Copy link
Contributor

ompiteam commented Oct 1, 2014

On the trunk and v1.5 branches (r22536), the IBM test loop_spawn is hanging. The exact iteration on which it hangs is nondeterministic; it hangs for me somewhere around iteration 200.

I'm running on 2 Linux 4-core nodes thusly:

$ mpirun -np 3 -bynode loop_spawn
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #20 return : 0
parent: MPI_Comm_spawn #40 return : 0
parent: MPI_Comm_spawn #60 return : 0
parent: MPI_Comm_spawn #80 return : 0
parent: MPI_Comm_spawn #100 return : 0
parent: MPI_Comm_spawn #120 return : 0
parent: MPI_Comm_spawn #140 return : 0
parent: MPI_Comm_spawn #160 return : 0
[...hang...]

Note that this does ''not'' happen on the v1.4 branch; the test seems to work fine there. This suggests that something has changed on the trunk/v1.5 that caused the problem.

'''SIDENOTE:''' It is worth noting that if using the openib BTL with this test on the v1.4 branch, the test fails much later (i.e., around iteration 1300 for me) because of what looks like a problem in the openib BTL; see https://svn.open-mpi.org/trac/ompi/ticket/1928.

I was unable to determine ''why'' it was hanging. The BT from 2 of the 3 parent children appears to be nearly the same:

(gdb) bt
#0  0x0000002a9625590c in epoll_wait () from /lib64/tls/libc.so.6
#1  0x0000002a95a7016c in epoll_dispatch (base=0x519f20, arg=0x519db0,
    tv=0x7fbfffd110) at epoll.c:210
#2  0x0000002a95a6d83b in opal_event_base_loop (base=0x519f20, flags=2)
    at event.c:823
#3  0x0000002a95a6d568 in opal_event_loop (flags=2) at event.c:746
#4  0x0000002a95a49cb2 in opal_progress () at runtime/opal_progress.c:189
#5  0x0000002a958d458f in orte_grpcomm_base_allgather_list (
    names=0x7fbfffd410, sbuf=0x7fbfffd2e0, rbuf=0x7fbfffd280)
    at base/grpcomm_base_allgather.c:155
#6  0x0000002a958d5535 in orte_grpcomm_base_full_modex (procs=0x7fbfffd410,
    modex_db=true) at base/grpcomm_base_modex.c:115
#7  0x0000002a969fb470 in modex (procs=0x7fbfffd410)
    at grpcomm_bad_module.c:607
#8  0x0000002a9d418f67 in connect_accept (comm=0x5012e0, root=0,
    port_string=0x7fbfffd590 "", send_first=false, newcomm=0x7fbfffd990)
    at dpm_orte.c:375
#9  0x0000002a956c909a in PMPI_Comm_spawn (
    command=0x7fbfffda00 "./loop_child", argv=0x0, maxprocs=1, info=0x5016e0,
    root=0, comm=0x5012e0, intercomm=0x7fbfffdc20,
    array_of_errcodes=0x7fbfffdc38) at pcomm_spawn.c:126
#10 0x0000000000400c86 in main (argc=1, argv=0x7fbfffdd28) at loop_spawn.c:34
(gdb) 

Here's a bt from one of the two children:

(gdb) bt
#0  0x0000002a9623df89 in sched_yield () from /lib64/tls/libc.so.6
#1  0x0000002a95a49d0d in opal_progress () at runtime/opal_progress.c:220
#2  0x0000002a958d458f in orte_grpcomm_base_allgather_list (
    names=0x7fbfffd7a0, sbuf=0x7fbfffd670, rbuf=0x7fbfffd610)
    at base/grpcomm_base_allgather.c:155
#3  0x0000002a958d5535 in orte_grpcomm_base_full_modex (procs=0x7fbfffd7a0,
    modex_db=true) at base/grpcomm_base_modex.c:115
#4  0x0000002a969fb470 in modex (procs=0x7fbfffd7a0)
    at grpcomm_bad_module.c:607
#5  0x0000002a97673f67 in connect_accept (comm=0x5016d0, root=0,
    port_string=0x674a60 "4103012352.0;tcp://172.29.218.140:55452;tcp://10.10.2\
0.250:55452;tcp://10.10.30.250:55452+4103012353.0;tcp://172.29.218.202:54128;tc\
p://10.10.20.202:54128;tcp://10.10.30.202:54128:562", send_first=true,
    newcomm=0x7fbfffd940) at dpm_orte.c:375
#6  0x0000002a97675f27 in dyn_init () at dpm_orte.c:946
#7  0x0000002a956ae7f0 in ompi_mpi_init (argc=1, argv=0x7fbfffdc78,
    requested=0, provided=0x7fbfffdb48) at runtime/ompi_mpi_init.c:846
#8  0x0000002a956d5fd1 in PMPI_Init (argc=0x7fbfffdb9c, argv=0x7fbfffdb90)
    at pinit.c:84
#9  0x0000000000400b14 in main (argc=1, argv=0x7fbfffdc78) at loop_child.c:17
(gdb) 

So they all appear to be in a modex. Beyond that, I am unfamiliar with this portion of the code base...

@ompiteam ompiteam added this to the Open MPI 1.6.6 milestone Oct 1, 2014
@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Imported from trac issue 2224. Created by jsquyres on 2010-02-03T15:43:20, last modified: 2013-10-30T19:07:44

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2010-02-03 16:37:42:

I updated the parenthetical about openib/v1.4 to be a bit more clear.

Additionally, this appears to be an openib BTL issue somehow.

If I run with --mca btl tcp,self, everything works fine (!). If I run with --mca btl openib,self, I get the hanging behavior.

I found another detail: during a hang, at least one parent is still stuck in the MPI_Comm_disconnect in the prior iteration (i.e., all other MPI parent procs have moved on to the next MPI_Comm_spawn, but this one is still stuck at the MPI_Comm_disconnect at the end of the prior iteration). That "stuck" parent is stuck in a condition wait in ompi_request_default_wait_all() -- it doesn't think that any of its requests have completed.

I have no idea why this would happen 200+ iterations into the run, nor why it seems to be related to the openib BTL...

@ompiteam ompiteam added the bug label Oct 1, 2014
@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2010-10-26 17:32:20:

Just a ping that this is still happening on the SVN trunk HEAD as of r23957. It is definitely related to the openib BTL.

FWIW, it now hangs for me on the very first spawn in loop_spawn -- one parent process gets stuck somewhere in MPI_COMM_DISCONNECT (all the other parents complete it).

Re-assigning to Vasily, since Pasha left Mellanox.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2011-01-11 07:23:33:

Just an update...

This is still happening on the trunk at r24214; I ran across it yesterday. It appears to be some kind of progression race condition in the COMM_DISCONNECT code, specifically the ORTE DPM ompi_dpm_base_disconnect_init() and ompi_dpm_base_disconnect_waitall() functions. When I ran into it yesterday, one parent process was stuck in ompi_dpm_base_disconnect_waitall() while all the others had moved on to the modex / grpcomm.

I replicated the problem with running the following on a single node with 2 active IB HCA ports. The node has 4 cores; this was launching 4 processes each iteration (3 parents + 1 spawned child). I added some additional printf's into the IBM loop_spawn test; you can clearly see that one of the parent processes is not completing MPI_COMM_DISCONNECT in the very first iteration:

$ mpirun -np 3 --mca btl_openib_device_type ib --mca btl openib,self loop_spawn
parent: MPI_Comm_spawn #0 return : 0
parent 6304: after comm free
parent 6305: after comm free
parent 6306: after comm free
parent 6304: after comm disconnect
parent 6305: after comm disconnect

Here's the BT from 2 (out of 3) of the parent processes:

#0  0x00000032f840441b in clock_gettime () from /lib64/librt.so.1
#1  0x00002b42bb2babad in gettime (base=0x151ebd70, tp=0x151ebfe8)
    at /home/jsquyres/svn/ompi4/opal/mca/event/libevent207/libevent/event.c:359
#2  0x00002b42bb2bcd6f in event_base_loop (base=0x151ebd70, flags=2)
    at /home/jsquyres/svn/ompi4/opal/mca/event/libevent207/libevent/event.c:154\
8
#3  0x00002b42bb26d9e9 in opal_progress ()
    at /home/jsquyres/svn/ompi4/opal/runtime/opal_progress.c:189
#4  0x00002b42bb22bfee in orte_grpcomm_base_allgather_list (
    names=0x7fffad5ffcf0, sbuf=0x7fffad5ffae0, rbuf=0x7fffad5ffa80)
    at /home/jsquyres/svn/ompi4/orte/mca/grpcomm/base/grpcomm_base_allgather.c:\
217
#5  0x00002b42bb227d73 in orte_grpcomm_base_full_modex (procs=0x7fffad5ffcf0,
    modex_db=true)
    at /home/jsquyres/svn/ompi4/orte/mca/grpcomm/base/grpcomm_base_modex.c:116
#6  0x00002b42bc6413da in modex (procs=0x7fffad5ffcf0)
    at /home/jsquyres/svn/ompi4/orte/mca/grpcomm/bad/grpcomm_bad_module.c:335
#7  0x00002aaaab6fce7d in connect_accept (comm=0x601460, root=0,
    port_string=0x7fffad5ffed0 "3012820992.0;tcp://172.29.218.140:41711;tcp://1\
0.10.10.140:41711;tcp://10.10.20.140:41711;tcp://10.10.30.140:41711;tcp://172.1\
6.68.1:41711;tcp://172.16.29.1:41711+3012820993.0;tcp://172.29.218.140:591"...,\
 send_first=false, newcomm=0x7fffad6002d0)
    at /home/jsquyres/svn/ompi4/ompi/mca/dpm/orte/dpm_orte.c:405
#8  0x00002b42bb17abb8 in PMPI_Comm_spawn (
    command=0x7fffad600320 "./loop_child", argv=0x0, maxprocs=1,
    info=0x601360, root=0, comm=0x601460, intercomm=0x7fffad600530,
    array_of_errcodes=0x7fffad600548)
    at /home/jsquyres/svn/ompi4/ompi/mpi/c/profile/pcomm_spawn.c:126
#9  0x0000000000400cef in main (argc=1, argv=0x7fffad600648)
    at /home/jsquyres/svn/ompi-tests/ibm/dynamic/loop_spawn.c:34

and here's the BT from the process who was stuck in COMM_DISCONNECT, waiting for its MPI requests to complete:

#0  0x00002b76654bb19a in btl_openib_component_progress ()
    at /home/jsquyres/svn/ompi4/ompi/mca/btl/openib/btl_openib_component.c:3563
#1  0x00002b76619b1a11 in opal_progress ()
    at /home/jsquyres/svn/ompi4/opal/runtime/opal_progress.c:207
#2  0x00002b766189c664 in opal_condition_wait (c=0x2b7661cf04e0,
    m=0x2b7661cf0560) at /home/jsquyres/svn/ompi4/opal/threads/condition.h:94
#3  0x00002b766189cc05 in ompi_request_default_wait_all (count=2,
    requests=0xcc0fc60, statuses=0x0)
    at /home/jsquyres/svn/ompi4/ompi/request/req_wait.c:263
#4  0x00002b7661910acf in ompi_dpm_base_disconnect_waitall (count=1,
    objs=0x7fffe03e9098)
    at /home/jsquyres/svn/ompi4/ompi/mca/dpm/base/dpm_base_common_fns.c:212
#5  0x00002aaaab6fd978 in disconnect (comm=0xcc820e0)
    at /home/jsquyres/svn/ompi4/ompi/mca/dpm/orte/dpm_orte.c:549
#6  0x00002b76618b9ea9 in PMPI_Comm_disconnect (comm=0x7fffe03e9300)
    at /home/jsquyres/svn/ompi4/ompi/mpi/c/profile/pcomm_disconnect.c:63
#7  0x0000000000400d89 in main (argc=1, argv=0x7fffe03e9418)
    at /home/jsquyres/svn/ompi-tests/ibm/dynamic/loop_spawn.c:49

(I updated the CC list, too)

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2011-01-11 07:29:31:

(adding George, Edgar, and Ralph because this is now a PML / DPM / ORTE question)

One interesting thing that I just noticed is that if I change the MCA_PML_BASE_SEND_SYNCHRONOUS to to STANDARD in the DPM base function, it works:

#!patch

Index: ompi/mca/dpm/base/dpm_base_common_fns.c
===================================================================
--- ompi/mca/dpm/base/dpm_base_common_fns.c (revision 24214)
+++ ompi/mca/dpm/base/dpm_base_common_fns.c (working copy)
@@ -154,7 +154,7 @@
         }
         ret = MCA_PML_CALL(isend (&(obj->buf), 0, MPI_INT, i,
                      OMPI_COMM_BARRIER_TAG,
-                     MCA_PML_BASE_SEND_SYNCHRONOUS,
+                     MCA_PML_BASE_SEND_STANDARD,
                      comm, &(obj->reqs[2*i+1])));

         if ( OMPI_SUCCESS != ret ) {
  • Why is the send synchronous in the first place? Since both isends and ireceives are posted, the use of synchronous send seems superfluous from an MPI-semantics perspective. Is there another reason?
  • Why would using a synchronous send cause the openib BTL to hang? Is this the same behavior that we saw forever ago in the collectives (E.g., with barrier) where the btl/pml was returning when the buffer was available for re-use, but the message hadn't yet been placed on the network stack? (that doesn't seem plausible, though, since it hangs in the ''first'' iteration -- the network resources shouldn't be backed up yet...)

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bosilca on 2011-01-12 13:45:03:

So we do have a fully-fledged barrier in the MPI_Comm_disconnect? Terrific!

I might have an explanation on why going from SYNCHRONOUS to STANDARD might remove the deadlock. Imagine that no process has yet exchanged messages with rank 0. As there is a full barrier (not an optimized one) on disconnect, and this barrier force contacting the processes in the rank order ... at the beginning of the barrier every process start bashing the poor rank 0 with connection requests. This might overload the number of allowed accept and lead to drop some of the connection requests. As a result, we might end-up deadlocking. We had a similar problem a while back in the TCP BTL, but it was fixed. You might have the same issue on the openib BTL, but at this point this is pure conjecture.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2011-04-26 09:58:47:

Re-assigning to new Mellanox contact...

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by hjelmn on 2013-10-30 19:07:44:

I want to get this bug (an another on associated with loop_spawn) fixed for the 1.7.x/1.8.x series.

Quick question. Why is the isend in disconnect synchronous anyway? The irecv will end up synchronizing the two peers so it seems unnecessary.

I can confirm that loop_spawn does indeed hang in 1.7.x and trunk when isend is synchronous but does not if the isend is standard.

yosefe pushed a commit to yosefe/ompi that referenced this issue Mar 5, 2015
…-warning-fix-v1.8

Remove unused variable.
@rhc54
Copy link
Contributor

rhc54 commented Jan 3, 2017

Fixed long ago

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants