Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PMIx and integration glue #3696

Merged
merged 1 commit into from
Jun 20, 2017
Merged

Update PMIx and integration glue #3696

merged 1 commit into from
Jun 20, 2017

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jun 13, 2017

No description provided.

@ggouaillardet
Copy link
Contributor

@rhc54 i made rhc54/ompi#5 in order to fix a runtime failure
(fwiw, everything looks fine when configure'd with --enable-debug)

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 13, 2017

THANK YOU!! I was just going to start chasing it 😄

@open-mpi open-mpi deleted a comment from ibm-ompi Jun 13, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Jun 13, 2017
@rhc54
Copy link
Contributor Author

rhc54 commented Jun 13, 2017

@ggouaillardet I had to rebase this to get rid of that blasted "merge" commit as it was "unsigned". I left the commit messages intact.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 13, 2017

@bwbarrett Is there any way to tell if this had something to do with this PR? It's a failure when --enable-built-in-atomics is used, so I suspect that either OPAL or PMIx has a bug in the built-ins.

@ibm-ompi
Copy link

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/e86ac666b05846eaaa421e1e32a78587

@ibm-ompi
Copy link

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/7a572bb1fbfed0ad825eb0975098a38c

@ibm-ompi
Copy link

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/f0e7a74b468a756b1d703c15375fa46b

@ggouaillardet
Copy link
Contributor

@rhc54 did you do something else besides the rebase ?
even hello_c ran as a singleton is hanging.

i digged a bit and found that could be caused by a recursive lock

global lock is initially taken at

#0  0x00007ffff787cc02 in __GI___pthread_mutex_lock (mutex=0x7ffff4954730 <pmix_global_lock+16>) at pthread_mutex_lock.c:64
#1  0x00007ffff469e02d in pmix_mutex_lock (m=0x7ffff4954720 <pmix_global_lock>) at /home/gilles/src/ompi-master/opal/mca/pmix/pmix2x/pmix/src/threads/mutex_unix.h:145
#2  0x00007ffff469f698 in OPAL_MCA_PMIX2X_PMIx_Init (proc=0x7ffff49568a0 <my_proc>, info=0x6ca870, ninfo=4) at ../../../../../../../../src/ompi-master/opal/mca/pmix/pmix2x/pmix/src/client/pmix_client.c:329
#3  0x00007ffff466a753 in pmix2x_client_init (ilist=0x7fffffffd8d0) at ../../../../../../src/ompi-master/opal/mca/pmix/pmix2x/pmix2x_client.c:98
#4  0x00007ffff7ad1e09 in ompi_interlib_declare (threadlevel=0, version=0x7ffff7b84141 "4.0.0a1") at ../../../src/ompi-master/ompi/interlib/interlib.c:158
#5  0x00007ffff7ad8770 in ompi_mpi_init (argc=1, argv=0x7fffffffdd58, requested=0, provided=0x7fffffffdb2c) at ../../../src/ompi-master/ompi/runtime/ompi_mpi_init.c:549
#6  0x00007ffff7b0f0c0 in PMPI_Init (argc=0x7fffffffdb5c, argv=0x7fffffffdb50) at ../../../../../../src/ompi-master/ompi/mpi/c/profile/pinit.c:66
#7  0x00000000004008de in main (argc=1, argv=0x7fffffffdd58) at ../../src/ompi-master/examples/hello_c.c:18

and then it is taken again at

#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:138
#1  0x00007ffff787cd02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007ffff787cc08 in __GI___pthread_mutex_lock (mutex=0x7ffff4954730 <pmix_global_lock+16>) at pthread_mutex_lock.c:64
#3  0x00007ffff467fcb1 in pmix_mutex_lock (m=0x7ffff4954720 <pmix_global_lock>) at /home/gilles/src/ompi-master/opal/mca/pmix/pmix2x/pmix/src/threads/mutex_unix.h:145
#4  0x00007ffff467fe42 in OPAL_MCA_PMIX2X_PMIx_Notify_event (status=-147, source=0x7ffff4955204 <OPAL_MCA_PMIX2X_pmix_globals+4>, range=7 '\a', info=0x6cb100, ninfo=5, cbfunc=0x7ffff469e684 <release_info>,
    cbdata=0x6c77e0) at ../../../../../../../../src/ompi-master/opal/mca/pmix/pmix2x/pmix/src/event/pmix_event_notification.c:48
#5  0x00007ffff469f634 in _check_for_notify (info=0x6ca870, ninfo=4) at ../../../../../../../../src/ompi-master/opal/mca/pmix/pmix2x/pmix/src/client/pmix_client.c:304
#6  0x00007ffff469f758 in OPAL_MCA_PMIX2X_PMIx_Init (proc=0x7ffff49568a0 <my_proc>, info=0x6ca870, ninfo=4) at ../../../../../../../../src/ompi-master/opal/mca/pmix/pmix2x/pmix/src/client/pmix_client.c:343
#7  0x00007ffff466a753 in pmix2x_client_init (ilist=0x7fffffffd8d0) at ../../../../../../src/ompi-master/opal/mca/pmix/pmix2x/pmix2x_client.c:98
#8  0x00007ffff7ad1e09 in ompi_interlib_declare (threadlevel=0, version=0x7ffff7b84141 "4.0.0a1") at ../../../src/ompi-master/ompi/interlib/interlib.c:158
#9  0x00007ffff7ad8770 in ompi_mpi_init (argc=1, argv=0x7fffffffdd58, requested=0, provided=0x7fffffffdb2c) at ../../../src/ompi-master/ompi/runtime/ompi_mpi_init.c:549
#10 0x00007ffff7b0f0c0 in PMPI_Init (argc=0x7fffffffdb5c, argv=0x7fffffffdb50) at ../../../../../../src/ompi-master/ompi/mpi/c/profile/pinit.c:66
#11 0x00000000004008de in main (argc=1, argv=0x7fffffffdd58) at ../../src/ompi-master/examples/hello_c.c:18

note the second lock hangs because

pmix_lock_t pmix_global_lock = {
    .mutex = PMIX_MUTEX_STATIC_INIT,
};

if it is declared as a PMIX_RECURSIVE_MUTEX_STATIC_INIT, this goes a bit further, but hangs in the cond_wait.
that being said, i am not sure we should have ran into this (recursive lock) in the first place

@ggouaillardet
Copy link
Contributor

@rhc54 i found some differences.
i pushed my commit ggouaillardet/ompi@ebfbbae on top of your ggouaillardet/ompi@4075982 and ggouaillardet/ompi@fdd958f

if i compare my branch to the existing one, the main difference is that PMIX_WAIT_THREAD() was replaced with PMIX_ACQUIRE_THREAD(), plus a few more changes

note you squashed all the commits (e.g. PMIx refresh from upstream repo and ompi glue) into one

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 13, 2017

Yes, there is a Coverity fix in there as well, and I didn't get all the changes correct. I'll continue working on it. As I said, I had to rebase to get rid of the "unsigned" merge commit.

@open-mpi open-mpi deleted a comment from ibm-ompi Jun 13, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Jun 13, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Jun 13, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Jun 13, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Jun 13, 2017
@open-mpi open-mpi deleted a comment from ibm-ompi Jun 13, 2017
@rhc54
Copy link
Contributor Author

rhc54 commented Jun 13, 2017

bot:ompi:retest

@jjhursey
Copy link
Member

jjhursey commented Jun 13, 2017

I just ran a hello world loop test (-np 2 on a single machine) and saw a segv - I'm investigating now.

--------------------------------------------
-------------------- Iteration 30 of 2000
[c656f6n03:86064] *** Process received signal ***
[c656f6n03:86064] Signal: Segmentation fault (11)
[c656f6n03:86064] Signal code: Address not mapped (1)
[c656f6n03:86064] Failing at address: 0x90
[c656f6n03:86064] [ 0] [0x3fffb21b0478]
[c656f6n03:86064] [ 1] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(pmix_bfrop_pack+0xec)[0x3fffb0cf94f4]
[c656f6n03:86064] [ 2] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(+0x51d70)[0x3fffb0c51d70]
[c656f6n03:86064] [ 3] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(+0x52480)[0x3fffb0c52480]
[c656f6n03:86064] [ 4] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(+0x55ce4)[0x3fffb0c55ce4]
[c656f6n03:86064] [ 5] /tmp/ompi/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xcd0)[0x3fffb1b1e290]
[c656f6n03:86064] [ 6] /tmp/ompi/lib/openmpi/mca_pmix_pmix2x.so(+0x1163ec)[0x3fffb0d163ec]
[c656f6n03:86064] [ 7] /lib64/libpthread.so.0(+0x8728)[0x3fffb1f28728]
[c656f6n03:86064] [ 8] /lib64/libc.so.6(clone+0x98)[0x3fffb1e5d210]
[c656f6n03:86064] *** End of error message ***
./loop-exec.sh: line 31: 86058 Killed                  $cmd
Loop 30 of 2000. rc=137 r= s=1497371445 e=1497371507 d=62
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX FAILURE

It looks like a race or corruption of the ptl pointer

Program terminated with signal 11, Segmentation fault.
#0  0x00003fffb0d405fc in pmix_ptl_stub_send_recv (peer=0x1001a2636a0, bfr=0x3fffa4002ce0, cbfunc=0x3fffb0c4fb54 <regevents_cbfunc>, 
    cbdata=0x3fffa4000a20) at base/ptl_base_stubs.c:80
80	    return pr->compat.ptl->send_recv(peer, bfr, cbfunc, cbdata);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.ppc64le
(gdb) bt
#0  0x00003fffb0d405fc in pmix_ptl_stub_send_recv (peer=0x1001a2636a0, bfr=0x3fffa4002ce0, cbfunc=0x3fffb0c4fb54 <regevents_cbfunc>, 
    cbdata=0x3fffa4000a20) at base/ptl_base_stubs.c:80
#1  0x00003fffb0c51d70 in _send_to_server (rcd=0x3fffa4000a20) at event/pmix_event_registration.c:201
#2  0x00003fffb0c52480 in _add_hdlr (cd=0x1001a26b100, xfer=0x3fffb0b6e5b0) at event/pmix_event_registration.c:290
#3  0x00003fffb0c55ce4 in reg_event_hdlr (sd=-1, args=4, cbdata=0x1001a26b100) at event/pmix_event_registration.c:560
#4  0x00003fffb1b1e290 in event_process_active_single_queue (activeq=0x1001a263270, base=0x1001a262dd0) at event.c:1370
#5  event_process_active (base=<optimized out>) at event.c:1440
#6  opal_libevent2022_event_base_loop (base=0x1001a262dd0, flags=<optimized out>) at event.c:1644
#7  0x00003fffb0d163ec in progress_engine (obj=0x1001a262d88) at runtime/pmix_progress_threads.c:109
#8  0x00003fffb1f28728 in start_thread () from /lib64/libpthread.so.0
#9  0x00003fffb1e5d210 in clone () from /lib64/libc.so.6
(gdb) p pr->compat
$1 = {
  psec = 0x3fffb0bc0120 <pmix_native_module>, 
  ptl = 0x80
}
(gdb) p pmix_client_globals.myserver.compat
$2 = {
  psec = 0x3fffb0bc0120 <pmix_native_module>, 
  ptl = 0x80
}

A little context on the three threads from this core file:

(gdb) info threads
  Id   Target Id         Frame 
  3    Thread 0x3fffb17bf1d0 (LWP 86066) 0x00003fffb1e4d5d8 in poll () from /lib64/libc.so.6
  2    Thread 0x3fffb22166c0 (LWP 86064) 0x00003fffb1f2dd60 in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
* 1    Thread 0x3fffb0b6f1d0 (LWP 86068) 0x00003fffb0d405fc in pmix_ptl_stub_send_recv (peer=0x1001a2636a0, bfr=0x3fffa4002ce0, 
    cbfunc=0x3fffb0c4fb54 <regevents_cbfunc>, cbdata=0x3fffa4000a20) at base/ptl_base_stubs.c:80
(gdb) thread 3
[Switching to thread 3 (Thread 0x3fffb17bf1d0 (LWP 86066))]
#0  0x00003fffb1e4d5d8 in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00003fffb1e4d5d8 in poll () from /lib64/libc.so.6
#1  0x00003fffb1b29888 in poll_dispatch (base=0x1001a236c60, tv=0x3fffb17be6d0) at poll.c:165
#2  0x00003fffb1b1d824 in opal_libevent2022_event_base_loop (base=0x1001a236c60, flags=<optimized out>) at event.c:1630
#3  0x00003fffb1a9a7cc in progress_engine (obj=0x1001a236ba8) at runtime/opal_progress_threads.c:105
#4  0x00003fffb1f28728 in start_thread () from /lib64/libpthread.so.0
#5  0x00003fffb1e5d210 in clone () from /lib64/libc.so.6
(gdb) thread 2
[Switching to thread 2 (Thread 0x3fffb22166c0 (LWP 86064))]
#0  0x00003fffb1f2dd60 in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00003fffb1f2dd60 in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x00003fffb0c27778 in pmix2x_client_init (ilist=0x0) at pmix2x_client.c:138
#2  0x00003fffb17c22e0 in rte_init () at ess_pmi_module.c:127
#3  0x00003fffb1c01680 in orte_init (pargc=0x0, pargv=0x0, flags=32) at runtime/orte_init.c:273
#4  0x00003fffb1ff50ec in ompi_mpi_init (argc=1, argv=0x3fffd53d2de8, requested=0, provided=0x3fffd53d2800) at runtime/ompi_mpi_init.c:512
#5  0x00003fffb2056ec0 in PMPI_Init (argc=0x3fffd53d29b0, argv=0x3fffd53d29b8) at pinit.c:66
#6  0x0000000010000cdc in main (argc=1, argv=0x3fffd53d2de8) at hello_c.c:30
(gdb) thread 1
[Switching to thread 1 (Thread 0x3fffb0b6f1d0 (LWP 86068))]
#0  0x00003fffb0d405fc in pmix_ptl_stub_send_recv (peer=0x1001a2636a0, bfr=0x3fffa4002ce0, cbfunc=0x3fffb0c4fb54 <regevents_cbfunc>, 
    cbdata=0x3fffa4000a20) at base/ptl_base_stubs.c:80
80	    return pr->compat.ptl->send_recv(peer, bfr, cbfunc, cbdata);
(gdb) bt
#0  0x00003fffb0d405fc in pmix_ptl_stub_send_recv (peer=0x1001a2636a0, bfr=0x3fffa4002ce0, cbfunc=0x3fffb0c4fb54 <regevents_cbfunc>, 
    cbdata=0x3fffa4000a20) at base/ptl_base_stubs.c:80
#1  0x00003fffb0c51d70 in _send_to_server (rcd=0x3fffa4000a20) at event/pmix_event_registration.c:201
#2  0x00003fffb0c52480 in _add_hdlr (cd=0x1001a26b100, xfer=0x3fffb0b6e5b0) at event/pmix_event_registration.c:290
#3  0x00003fffb0c55ce4 in reg_event_hdlr (sd=-1, args=4, cbdata=0x1001a26b100) at event/pmix_event_registration.c:560
#4  0x00003fffb1b1e290 in event_process_active_single_queue (activeq=0x1001a263270, base=0x1001a262dd0) at event.c:1370
#5  event_process_active (base=<optimized out>) at event.c:1440
#6  opal_libevent2022_event_base_loop (base=0x1001a262dd0, flags=<optimized out>) at event.c:1644
#7  0x00003fffb0d163ec in progress_engine (obj=0x1001a262d88) at runtime/pmix_progress_threads.c:109
#8  0x00003fffb1f28728 in start_thread () from /lib64/libpthread.so.0
#9  0x00003fffb1e5d210 in clone () from /lib64/libc.so.6

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 13, 2017

You might try setting PMIX_MCA_ptl=tcp in your environment to see if that makes any difference. It will eliminate consideration of the usock component, which should be lower priority and ignored anyway. This would just remove one more possibility.

@jjhursey
Copy link
Member

Just tried that and same result (although it took longer to tigger).

@jjhursey
Copy link
Member

Progress update.

I've so far isolated the corruptions to be around this section of code (or around this time) in client/pmix_client.c during PMIx_Init.

if I check the value of pmix_client_globals.myserver->compat.ptl at the top of this section of code it is set correctly. Sometimes when I check it after this section of code it's corrupted.

    PMIX_RELEASE_THREAD(&pmix_global_lock);
  
      /* lood for a debugger attach key */
      (void)strncpy(wildcard.nspace, pmix_globals.myid.nspace, PMIX_MAX_NSLEN);
     wildcard.rank = PMIX_RANK_WILDCARD;
     PMIX_INFO_LOAD(&ginfo, PMIX_IMMEDIATE, NULL, PMIX_BOOL);
      if (PMIX_SUCCESS == PMIx_Get(&wildcard, PMIX_DEBUG_STOP_IN_INIT, &ginfo, 1, &val)) {
          PMIX_VALUE_FREE(val, 1); // cleanup memory
          /* if the value was found, then we need to wait for debugger attach here */
          /* register for the debugger release notification */
          PMIX_CONSTRUCT_LOCK(&reglock);
          PMIx_Register_event_handler(&code, 1, NULL, 0,
                                      notification_fn, NULL, (void*)&reglock);
          /* wait for it to arrive */
          PMIX_WAIT_THREAD(&reglock);
          PMIX_DESTRUCT_LOCK(&reglock);
      }
      PMIX_INFO_DESTRUCT(&ginfo);

I do not necessarily think that this section of code is the problem. However, I do think that after the release of the pmix_global_lock another thread is active and damaging memory - so there is a race.

I have a number of things to look into yet today to isolate further. But I just wanted to post an update since I know a number of folks are watching this PR.

pt->valcbfunc = cbfunc;
pt->cbdata = cbdata;
pt->lock.active = false;
OPAL_PMIX2X_THREADSHIFT(pt, _fence);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be _get instead of _fence?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 14, 2017

Yeah, I found that too - but it turns out that we don't exercise that code path.

OPAL_PMIX_WAIT_THREAD(&pt.lock);
rc = pt.status;
*val = pt.val;
OBJ_DESTRUCT(&pt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen a crash (just one in many, many runs) where at the very bottom of _get we call:

if (!pt->lock.active) {
    OBJ_RELEASE(pt);
}

At the exact moment as OBJ_DESTRUCT(&pt) here and both destructors become active at the same time. Leaving to a crash in free as two threads fight over which one gets to free the memory. Should we move these OBJ_CONSTRUCT/OBJ_DESTRUCT to OBJ_NEW/OBJ_RETAIN/OBJ_RELEASE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just about to push that up now 😄

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 14, 2017

@ggouaillardet Could you perhaps give us a hand? @jjhursey and I have been banging our head on the memory corruption problem and are having difficulty tracking it down - we could use some of your "master detective" skills 😄

@jjhursey provided a very simple script for triggering it:

#!/bin/bash

loops=2000
#loops=2

TIMEOUT="timeout --preserve-status -k 32 30"
TIMEOUT=""

#timeout --preserve-status -k 32 30 ./opal_fifo
MCA_PARAMS="-host rhc001:24 -mca coll ^hcoll -mca pml ob1 -mca btl tcp,self -mca oob tcp --tag-output"

VALGRIND_OPTS=""
#VALGRIND_OPTS="valgrind --track-origins=yes"

cmd="${TIMEOUT} mpirun -np 2 ${MCA_PARAMS} ${VALGRIND_OPTS} ./hello_c"

#export PMIX_DEBUG=2
export PMIX_MCA_ptl=tcp

i=1
while [ $i -le $loops ]; do
    echo "--------------------------------------------"
    echo "-------------------- Iteration $i of $loops"
    starttime=`date +%s`
    $cmd
    rc=$?
    endtime=`date +%s`

    echo "Loop $i of $loops. rc=$rc r=$ranks s=$starttime e=$endtime d=$((endtime - starttime))"
    if [ $rc != 0 ] ; then
        if [ $rc == 124 ] ; then
            echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX TIMEOUT - skip"
        else
            echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX FAILURE"
        fi
        exit -1
    fi

    i=$((i + 1))
done

exit 0

It takes me a few hundred iterations for it to "explode", but that varies by machine. We run it on a single node. It fails equally well on ppc, linux/x86, and Mac - equal opportunity! The failure occurs in the messaging system, but that may be a symptom and not the root cause location.

Any help would be greatly appreciated!

@jjhursey
Copy link
Member

Per Ralph's previous comment - the hello_c need only call MPI_Init then MPI_Finalize.

What I have found so far is that inside PMIx_Init around this call site:

if (PMIX_SUCCESS != (rc = pmix_ptl.send_recv(pmix_client_globals.myserver, req, job_data, (void*)&cb))){

if I add code to check the value of pmix_client_globals.myserver->compat.ptl after this if statment - something like this:

// check pmix_client_globals.myserver->compat.ptl  --- all is good
if( (intptr_t)(pmix_client_globals.myserver->compat.ptl) < 0x1000 ) {
   printf("BROKEN!!! ptr=%15p HERE 1\n", pmix_client_globals.myserver->compat.ptl);
}
usleep(100);
// check pmix_client_globals.myserver->compat.ptl --- Corrupted
if( (intptr_t)(pmix_client_globals.myserver->compat.ptl) < 0x1000 ) {
   printf("BROKEN!!! ptr=%15p HERE 2\n", pmix_client_globals.myserver->compat.ptl);
}

So this is telling me that the other thread is activated to do something (trying to pin that down now) which is causing the corruption. My current theory is that the pmix_ptl_base_recv_handler gets triggered here. Note that the pmix_client_globals.myserver->compat.ptl field not really that special, but seems to be a consistent point of the corruption.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 14, 2017

Hmmm...hold on a minute. Take a gander at the recv_handler function. We allocate space for the message:

                /* allocate the data region */
                peer->recv_msg->data = (char*)malloc(peer->recv_msg->hdr.nbytes);
                memset(peer->recv_msg->data, 0, peer->recv_msg->hdr.nbytes);
                /* point to it */
                peer->recv_msg->rdptr = peer->recv_msg->data;
                peer->recv_msg->rdbytes = peer->recv_msg->hdr.nbytes;

but then look down below where we read it:

        /* continue to read the data block - we start from
         * wherever we left off, which could be at the
         * beginning or somewhere in the message
         */
        if (PMIX_SUCCESS == (rc = read_bytes(peer->sd, &msg->rdptr, &msg->rdbytes))) {
            /* we recvd all of the message */
            pmix_output_verbose(2, pmix_globals.debug_output,
                                "RECVD COMPLETE MESSAGE FROM SERVER OF %d BYTES FOR TAG %d ON PEER SOCKET %d",
                                (int)peer->recv_msg->hdr.nbytes,
                                peer->recv_msg->hdr.tag, peer->sd);
            /* post it for delivery */
            PMIX_ACTIVATE_POST_MSG(peer->recv_msg);
            peer->recv_msg = NULL;
            return;
        } else if (PMIX_ERR_RESOURCE_BUSY == rc ||
                   PMIX_ERR_WOULD_BLOCK == rc) {
            /* exit this event and let the event lib progress */
            return;
        } else {
            /* the remote peer closed the connection - report that condition
             * and let the caller know
             */
            pmix_output_verbose(2, pmix_globals.debug_output,
                                "ptl:base:msg_recv: peer closed connection");
            goto err_close;
        }

In other words, we forgot to point the msg->rdptr at the allocated space! Let me try and see if that fixes it.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 14, 2017

Nurts - I see I was wrong as msg = peer->recv_msg, which was set. Back to the drawing board...

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 15, 2017

But I wonder: isn't the assert a result of the race condition here? The scenario we have is:

  • Thread A calls a function that sets up a shift to another thread (B) and returns
  • Thread A then goes into "wait_thread"
  • Thread B does its thing and executes a callback that then calls "wakeup_thread"

So if B executes the callback before A can get into wait_thread, then the assert is going to hit, yes? I'm wondering if we need to lock the mutex before we call the function to ensure that we are ready for the callback.

My head hurts...

@bosilca
Copy link
Member

bosilca commented Jun 15, 2017

I did notice you are using trylock in an assert, and this is in general troublesome.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 15, 2017

@bosilca I think we could use some advice here - does it otherwise look correct? If so, the assert can be removed. I just want to ensure that we are pursuing this correctly.

@ggouaillardet
Copy link
Contributor

assert(0==trylock)
Should be replaced with
pmix_mutex_lock
(For debugging purpose, i d rather have a crash than a deadlocj)

Assert(0 != trylock)
Should be removed

@bosilca
Copy link
Member

bosilca commented Jun 15, 2017

👍 on @ggouaillardet suggestion.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 15, 2017

@ggouaillardet @jjhursey Please give this a try on your systems - it works fine on mine!

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 15, 2017

Looks like the PRBC died for an unrelated issue - somehow lost connection.

bot:ompi:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 15, 2017

Just ran this with an optimized build, and ran clean thru the test again! So we might actually have this fixed now. Will wait to hear from @jjhursey as their system was the more sensitive one.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 15, 2017

Much sadness - I reran it again as debug build, and this time it failed:

-------------------- Iteration 369 of 2000
[1,0]<stderr>:[rhc001:21478] *** Process received signal ***
[1,0]<stderr>:[rhc001:21478] Signal: Segmentation fault (11)
[1,0]<stderr>:[rhc001:21478] Signal code: Address not mapped (1)
[1,0]<stderr>:[rhc001:21478] Failing at address: 0x90
[1,0]<stderr>:[rhc001:21478] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x7f054632c370]
[1,0]<stderr>:[rhc001:21478] [ 1] [1,0]<stderr>:/home/common/openmpi/build/foobar/lib/libpmix.so.0(pmix_ptl_stub_send_recv+0x2b)[0x7f054249619a]
[1,0]<stderr>:[rhc001:21478] [ 2] /home/common/openmpi/build/foobar/lib/libpmix.so.0(+0x2456d)[0x7f05423c956d]
[1,0]<stderr>:[rhc001:21478] [ 3] /home/common/openmpi/build/foobar/lib/libpmix.so.0(+0x24bf4)[0x7f05423c9bf4]
[1,0]<stderr>:[rhc001:21478] [ 4] /home/common/openmpi/build/foobar/lib/libpmix.so.0(+0x2871f)[0x7f05423cd71f]
[1,0]<stderr>:[rhc001:21478] [ 5] [1,0]<stderr>:/home/common/openmpi/build/foobar/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xe38)[0x7f05459a2878]
[1,0]<stderr>:[rhc001:21478] [ 6] /home/common/openmpi/build/foobar/lib/libpmix.so.0(+0xd18a0)[0x7f05424768a0]
[1,0]<stderr>:[rhc001:21478] [ 7] /usr/lib64/libpthread.so.0(+0x7dc5)[0x7f0546324dc5]
[1,0]<stderr>:[rhc001:21478] [ 8] [1,0]<stderr>:/usr/lib64/libc.so.6(clone+0x6d)[0x7f054605373d]
[1,0]<stderr>:[rhc001:21478] *** End of error message ***
^C^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly terminate

^CLoop 369 of 2000. rc=1 r= s=1497553461 e=1497553546 d=85
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX FAILURE

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 16, 2017

I'm trying to run the test thru valgrind, but so far no luck - I'm on iteration 1673 and it hasn't failed. I guess valgrind is slowing things down enough that the corruption doesn't occur.

Would it make sense to try things like electric fence? Anyone have a suggestion?

@ggouaillardet
Copy link
Contributor

i noted pmix_obj_update() is not thread safe
(it should use PMIX_THREAD_ADD32())

nor output() and friends from src/util/output.c
(the global variable temp_str and friends are not mutex protected nor per thread private)
(i still observe some crashes with --mca pmix_base_verbose 5, and i am tempted to
blame it on pmix_output_verbose() and friends)

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 16, 2017

These should be fixed here: openpmix/openpmix#401

After pondering some more, I'm convinced that we should move this to the PMIx area where we can get more control over the test program. We are trying to "peer" thru the lens of OMPI's restrictions (e.g., we must return certain data from get, must register event handlers, etc). I'm going to modify @jjhursey's program to run a test over there and see if I can replicate the problem.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 17, 2017

Please see openpmix/openpmix#403 as we track the PMIx-only tests on this issue.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 17, 2017

I've done some more digging, and with the latest PMIx updates I am now seeing a segfault in the OPAL pmix layer. It stems from a thread-shift caddy being overwritten by an opal_value_t.

The top-level call that starts the problem is in orte/mca/ess/pmi/ess_pmi_module.c, line 330:

        OPAL_MODEX_RECV_VALUE_OPTIONAL(ret, OPAL_PMIX_LOCALITY_STRING,
                                       ORTE_PROC_MY_NAME, &val, OPAL_STRING);

The "val" is a pointer to an opal_value_t - the macro will return the pointer to an allocated opal_value_t in the val address. If we look at what all the threads in the app proc are doing:

(gdb) thread apply all bt

Thread 3 (Thread 0x7f90cbb2c740 (LWP 65211)):
#0  0x00007f90cb3706d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007f90c7727759 in pmix2x_get (proc=0x7f90caf9d2e8 <orte_process_info+104>, key=0x7f90c834d040 "pmix.locstr", info=0x7ffdb8b23160, val=0x7ffdb8b23238) at pmix2x_client.c:628
#2  0x00007f90c834bb80 in rte_init () at ess_pmi_module.c:329
#3  0x00007f90cacaec07 in orte_init (pargc=0x0, pargv=0x0, flags=32) at runtime/orte_init.c:273
#4  0x00007f90cb5dd511 in ompi_mpi_init (argc=1, argv=0x7ffdb8b239f8, requested=0, provided=0x7ffdb8b237ac) at runtime/ompi_mpi_init.c:512
#5  0x00007f90cb61c16e in PMPI_Init (argc=0x7ffdb8b237dc, argv=0x7ffdb8b237d0) at pinit.c:66
#6  0x0000000000400bd9 in main (argc=1, argv=0x7ffdb8b239f8) at hello_c.c:28

Thread 2 (Thread 0x7f90c6977700 (LWP 65284)):
#0  0x00007f90cb09bd13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007f90ca9e6918 in epoll_dispatch (base=0x1fcce00, tv=<optimized out>) at epoll.c:407
#2  0x00007f90ca9e9bd6 in opal_libevent2022_event_base_loop (base=0x1fcce00, flags=1) at event.c:1630
#3  0x00007f90c74bc8bf in progress_engine (obj=0x1fccd58) at runtime/pmix_progress_threads.c:109
#4  0x00007f90cb36cdc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007f90cb09b73d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7f90c8346700 (LWP 65213)):
#0  0x00007f90cafd91d7 in raise () from /usr/lib64/libc.so.6
#1  0x00007f90cafda8c8 in abort () from /usr/lib64/libc.so.6
#2  0x00007f90cafd2146 in __assert_fail_base () from /usr/lib64/libc.so.6
#3  0x00007f90cafd21f2 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00007f90ca97decf in opal_list_item_destruct (item=0x1fda710) at class/opal_list.c:69
#5  0x00007f90c7722381 in opal_obj_run_destructors (object=0x1fda710) at ../../../../opal/class/opal_object.h:462
#6  0x00007f90c772748e in _get (sd=-1, args=4, cbdata=0x1fda710) at pmix2x_client.c:588
#7  0x00007f90ca9ea878 in event_process_active_single_queue (activeq=0x1f8a430, base=0x1f9fb10) at event.c:1370
#8  event_process_active (base=<optimized out>) at event.c:1440
#9  opal_libevent2022_event_base_loop (base=0x1f9fb10, flags=1) at event.c:1644
#10 0x00007f90ca98c9f9 in progress_engine (obj=0x1f8a2a8) at runtime/opal_progress_threads.c:105
#11 0x00007f90cb36cdc5 in start_thread () from /usr/lib64/libpthread.so.0
#12 0x00007f90cb09b73d in clone () from /usr/lib64/libc.so.6

we see that thread 3, the main thread, is waiting in the PMIx_Get inside the macro, and the PMIx progress thread is quiescent. These are both correct as the value is located in the PMIx dstore.

The returned value has been thread-shifted out of the PMIx thread and into the OPAL progress thread so it can access the OPAL pmix framework-level data (e.g., to convert the nspace to a local jobid). The return void* cbdata is expected to be the pmix2x_threadshift_t object, which is to be released at the end of the callback.

However, as the bt shows, the object incorrectly has an opal_list_item_t class in its "super" location instead of the expected opal_object_t. Printing out the threadshift object shows:

588	        OBJ_RELEASE(pt);
(gdb) print *pt
$1 = {super = {obj_magic_id = 0, obj_class = 0x7f90cac883a0 <opal_value_t_class>, obj_reference_count = 0, cls_init_file_name = 0x7f90c834cc60 "ess_pmi_module.c", cls_init_lineno = 330}, 
  ev = {ev_active_next = {tqe_next = 0x7ffdb8b23188, tqe_prev = 0x7ffdb8b23188}, ev_next = {tqe_next = 0x100000001, tqe_prev = 0x7ffdb8b23160}, ev_timeout_pos = {
      ev_next_with_common_timeout = {tqe_next = 0x1fda6f0, tqe_prev = 0x100000002}, min_heap_idx = 33400560}, ev_fd = 1, ev_base = 0x0, _ev = {ev_io = {ev_io_next = {tqe_next = 0x0, 
          tqe_prev = 0x0}, ev_timeout = {tv_sec = 0, tv_usec = 0}}, ev_signal = {ev_signal_next = {tqe_next = 0x0, tqe_prev = 0x0}, ev_ncalls = 0, ev_pncalls = 0x0}}, ev_events = 0, 
    ev_res = 0, ev_flags = 0, ev_pri = 0 '\000', ev_closure = 0 '\000', ev_timeout = {tv_sec = 0, tv_usec = 0}, ev_callback = 0x0, ev_arg = 0x0}, lock = {mutex = {super = {obj_magic_id = 0, 
        obj_class = 0x0, obj_reference_count = 0, cls_init_file_name = 0x261 <Address 0x261 out of bounds>, cls_init_lineno = -558907667}, m_lock_pthread = {__data = {__lock = -946600160, 
          __count = 32656, __owner = 1, __nusers = 0, __kind = -948716262, __spins = 32656, __list = {__prev = 0x26c, __next = 0x0}}, 
        __size = " \a\224ǐ\177\000\000\001\000\000\000\000\000\000\000\032\275sǐ\177\000\000l\002", '\000' <repeats 13 times>, __align = 140259800385312}, m_lock_debug = 33072176, 
      m_lock_file = 0x0, m_lock_line = 0, m_lock_atomic = {u = {lock = 0, sparc_lock = 0 '\000', padding = "\000\000\000"}}}, cond = {__data = {__lock = -1, __futex = 2, 
        __total_seq = 18446744073709551615, __wakeup_seq = 4294967295, __woken_seq = 33159952, __mutex = 0x1fda7e8, __nwaiters = 0, __broadcast_seq = 1}, 
      __size = "\377\377\377\377\002\000\000\000", '\377' <repeats 12 times>, "\000\000\000\000\020\373\371\001\000\000\000\000\350\247\375\001\000\000\000\000\000\000\000\000\001\000\000", 
      __align = 12884901887}, active = false}, msg = 0x0, strings = 0x8800040004 <Address 0x8800040004 out of bounds>, id = 33400352, status = 0, pname = {jobid = 0, vpid = 3346164881}, 
  jobid = 32656, source = 0x1fda7e0, range = 3736059629, nondefault = 237, handler = 140259854156192, val = 0x7f9000000001, event_codes = 0x7f90c773adce, info = 0x5db, results = {super = {
      obj_magic_id = 0, obj_class = 0x100000000, obj_reference_count = 2, cls_init_file_name = 0x0, cls_init_lineno = 0}, opal_list_sentinel = {super = {obj_magic_id = 16046253922460893184, 
        obj_class = 0x0, obj_reference_count = 0, cls_init_file_name = 0x100000000 <Address 0x100000000 out of bounds>, cls_init_lineno = 1}, opal_list_next = 0x0, opal_list_prev = 0x0, 
      item_free = 33401016, opal_list_item_refcount = 0, opal_list_item_belong_to = 0x2}, opal_list_length = 1}, evhandler = 0x7f90c834d040, cbfunc = 0x0, opcbfunc = 0x0, pmixcbfunc = 0x0, 
  valcbfunc = 0x7f90c7726976 <getcbfunc>, lkcbfunc = 0x7f90caf9d2e8 <orte_process_info+104>, cbdata = 0x101fda710}

So what has happened here is that the "val" originally passed in the macro has been overlaid on the pmix2x_threadshift_t object. My best guess is that the cbdata pointers are getting confused as we weave our way thru all these thread-shifts.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 19, 2017

@jjhursey @ggouaillardet I believe we now have this fixed! See what you think

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 19, 2017

Rats - spoke too soon. Issues in comm_spawn, working them now.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 20, 2017

Appear to have the event handler problems resolved - still checking on some of the dynamics. Got comm_spawn fixed at least.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 20, 2017

FWIW: Mellanox is aware of the problem - there is an issue with GitHub rejecting their attempt to set the test result to "passed". It is being investigated

…d support in the opal/pmix framework to protect the framework-level structures.

This now passes the loop test, and so we believe it resolves the random hangs in finalize.

Changes in PMIx master that are included here:

* Fixed a bug in the PMIx_Get logic
* Fixed self-notification procedure
* Made pmix_output functions thread safe
* Fixed a number of thread safety issues
* Updated configury to use 'uname -n' when hostname is unavailable

Work on cleaning up the event handler thread safety problem
Rarely used functions, but protect them anyway
Fix the last part of the intercomm problem
Ensure we don't cover any PMIx calls with the framework-level lock.
Protect against NULL argv comm_spawn

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@jjhursey
Copy link
Member

I'm running the branch now. Looking good so far. I'm letting it run a bit longer over lunch. I'll let you know when it's done.

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 20, 2017

MTT results:

| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 4.0.0a1     | 00:00    | 1    |      |          |      | MPI_Install-my_installation-my_installation-4.0.0a1-my_installation.html |
| Test Build  | trivial         | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-trivial-my_installation-4.0.0a1-my_installation.html          |
| Test Build  | ibm             | 4.0.0a1     | 00:45    | 1    |      |          |      | Test_Build-ibm-my_installation-4.0.0a1-my_installation.html              |
| Test Build  | intel           | 4.0.0a1     | 01:14    | 1    |      |          |      | Test_Build-intel-my_installation-4.0.0a1-my_installation.html            |
| Test Build  | java            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-java-my_installation-4.0.0a1-my_installation.html             |
| Test Build  | orte            | 4.0.0a1     | 00:00    | 1    |      |          |      | Test_Build-orte-my_installation-4.0.0a1-my_installation.html             |
| Test Run    | trivial         | 4.0.0a1     | 00:06    | 8    |      |          |      | Test_Run-trivial-my_installation-4.0.0a1-my_installation.html            |
| Test Run    | ibm             | 4.0.0a1     | 10:57    | 505  |      | 1        |      | Test_Run-ibm-my_installation-4.0.0a1-my_installation.html                |
| Test Run    | spawn           | 4.0.0a1     | 00:13    | 13   |      |          | 1    | Test_Run-spawn-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | loopspawn       | 4.0.0a1     | 10:07    | 1    |      |          |      | Test_Run-loopspawn-my_installation-4.0.0a1-my_installation.html          |
| Test Run    | intel           | 4.0.0a1     | 16:10    | 474  |      |          | 4    | Test_Run-intel-my_installation-4.0.0a1-my_installation.html              |
| Test Run    | intel_skip      | 4.0.0a1     | 12:06    | 431  |      |          | 47   | Test_Run-intel_skip-my_installation-4.0.0a1-my_installation.html         |
| Test Run    | java            | 4.0.0a1     | 00:01    | 1    |      |          |      | Test_Run-java-my_installation-4.0.0a1-my_installation.html               |
| Test Run    | orte            | 4.0.0a1     | 00:44    | 19   |      |          |      | Test_Run-orte-my_installation-4.0.0a1-my_installation.html               |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+


    Total Tests:    1459
    Total Failures: 1
    Total Passed:   1452
    Total Duration: 3145 secs. (52:25)

@jjhursey
Copy link
Member

This passed clean for me at 10K iterations on powerpc. I think this is good to go. Thanks @rhc54 !

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 20, 2017

very good - thanks!

@rhc54 rhc54 merged commit 814e858 into open-mpi:master Jun 20, 2017
@rhc54 rhc54 deleted the topic/pmix3 branch June 20, 2017 17:38
@jjhursey
Copy link
Member

@rhc54 Let me know when the PR to v3.0.x is ready and I'll check it out and retest there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants