Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/pt2pt fails with multiple threads #2614

Closed
jjhursey opened this issue Dec 20, 2016 · 18 comments
Closed

osc/pt2pt fails with multiple threads #2614

jjhursey opened this issue Dec 20, 2016 · 18 comments

Comments

@jjhursey
Copy link
Member

jjhursey commented Dec 20, 2016

Testing with the osc/pt2pt component revealed multiple hangs and wrong answers when running with two threads. Each thread is working with it's own communicator copy of MPI_COMM_WORLD and their own private windows.

The test is here:

% mpicc -o x mt_1sided.c mt_1sided_td1.c mt_1sided_td2.c
% mpirun -host hostA,hostB -mca osc pt2pt -mca pml ob1 -mca btl openib,self,vader ./x

PR #2630 will need to be reverted when a resolution to this issue is committed on v2.0.x branch.

@jjhursey jjhursey added this to the v2.0.3 milestone Dec 20, 2016
hppritcha added a commit to hppritcha/ompi that referenced this issue Dec 22, 2016
As a workaround for issue open-mpi#2614 for the v2.0.2 release,
do not allow for selection of the OSC PT2PT when
creating an MPI RMA window.  Print a hopefully helpful
message and return an not-supported error.

This PR should be reverted once a fix for open-mpi#2614
is in place.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@hjelmn
Copy link
Member

hjelmn commented Jan 7, 2017

I can't get this to hang with osc/rdma. @markalle Was the 1sided.c test hanging with osc/rdma or just osc/pt2pt?

hjelmn added a commit to hjelmn/ompi that referenced this issue Jan 7, 2017
This commit fixes a bug in the timer check. When -fPIC is used we need
to save/restore ebx. The code copied from patcher was meant for 32-bit
systems and did not work correctly on 64-bit systems. This commit
updates the save/restore to use rbx instead of ebx.

Fixes open-mpi#2614

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member

hjelmn commented Jan 7, 2017

Opps. Wrong bug :). Deleting those.

@hjelmn
Copy link
Member

hjelmn commented Jan 9, 2017

Ok, I can reproduce the issue with osc/pt2pt. Looks like something is still not right with PSCW. Taking a look now.

@hjelmn
Copy link
Member

hjelmn commented Jan 9, 2017

Found the bug. Its an artifact of the original design. I haven't had the time to move the counters into the sync object. Think I have a workaround. Testing it now.

@hjelmn
Copy link
Member

hjelmn commented Jan 10, 2017

Ok, definitely fixed. Running it through the tests a couple more times to shake out any remaining bugs. Will have a PR open for master, v2.x, and v2.0.3.

@hjelmn
Copy link
Member

hjelmn commented Jan 11, 2017

Still have one stubborn bug holding up the fix. It now hangs about 10% of the time. Will hopefully have this resolved today.

@hppritcha
Copy link
Member

@hjelmn can this issue be closed?

@hppritcha hppritcha modified the milestones: v2.0.3, v2.0.4 Jun 1, 2017
@markalle
Copy link
Contributor

I had been meaning to re-evaluate this one for a while. I built a vanilla OMPI v2.0.x with --enable-mpi-thread-multiple, and my results were:

mpirun -mca osc rdma -host hostA:2 ... : passed
mpirun -mca osc rdma -host hostA:4 ... : passed
mpirun -mca osc rdma -host hostA:1,hostB:1 ... : segv quickly
mpirun -mca osc pt2pt ... : message that "OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release"

@hppritcha
Copy link
Member

@gpaulsen will check to see if we have more data on this issue. @hjelmn says he'll work on it.

@hjelmn
Copy link
Member

hjelmn commented Sep 14, 2017

Finally found the time to track this down. I see what is going wrong in osc/pt2pt. I should have a fix ready for testing tomorrow.

@hppritcha hppritcha modified the milestones: v2.0.4, v2.1.3 Sep 19, 2017
@hjelmn
Copy link
Member

hjelmn commented Sep 20, 2017

Damn this is a nasty bug. Its getting a lot further but now I am running into another issue. Will keep cranking away at it until I know what the root cause is. Will start a parallel effort to ensure osc/rdma passes next month.

@gpaulsen
Copy link
Member

@hjelmn, Is this issue specific to osc_pt2pt or osc progress / multithreaded in general and only tickled by osc_pt2pt?

@hppritcha hppritcha modified the milestones: v2.1.3, v2.1.4 Mar 15, 2018
@jsquyres
Copy link
Member

Per discussion with @hjelmn and @hppritcha on 26 Mar 2018:

  • v2.0.x already disables osc/pt2pt in THREAD_MULTIPLE scenarios (because of this issue)
  • v2.x does not disable itself in THREAD_MULTIPLE scenarios
    • It looks like we thought this would get fixed in time for v2.1.x. But it didn't.
  • At this point, it doesn't seem worth fixing this issue in the v2.x series.
  • @hjelmn said he will fix in the v3.0.x series.
    • So I'll re-target this issue for v3.0.x.
  • I will PR cherry-pick d0ffd66 to the v2.x branch (i.e., disable osc/pt2pt in THREAD_MULTIPLE scenarios)

@jsquyres jsquyres modified the milestones: v2.1.4, v3.0.2 Mar 26, 2018
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Mar 26, 2018
As a workaround for issue open-mpi#2614 for the v2.0.2 release,
do not allow for selection of the OSC PT2PT when
creating an MPI RMA window.  Print a hopefully helpful
message and return an not-supported error.

This PR should be reverted once a fix for open-mpi#2614
is in place.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit d0ffd66)

The original commit message is shown above.  Followup: as of this
writing (26 Mar 2018), we do not plan to fix this issue for the v2.0.x
or v2.x.  Hence, the osc/pt2pt component will continue to disable
itself in THREAD_MULTIPLE scenarios for the life of all v2.x series.
It is possible (likely?) that this will be fixed in a v3.0.x release
(where x>1).
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Mar 26, 2018
As a workaround for issue open-mpi#2614 for the v2.0.2 release,
do not allow for selection of the OSC PT2PT when
creating an MPI RMA window.  Print a hopefully helpful
message and return an not-supported error.

This PR should be reverted once a fix for open-mpi#2614
is in place.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit d0ffd66)

The original commit message is shown above.  Followup: as of this
writing (26 Mar 2018), we do not plan to fix this issue for the v2.0.x
or v2.x.  Hence, the osc/pt2pt component will continue to disable
itself in THREAD_MULTIPLE scenarios for the life of all v2.x series.
It is possible (likely?) that this will be fixed in a v3.0.x release
(where x>1).
@jsquyres
Copy link
Member

jsquyres commented May 29, 2018

Per discussion 2018-05-29: @hjelmn says that a better solution would be to get everything to support osc/rdma. E.g., get vader, TCP, and the upcoming OFI BTLs to support RDMA, which then works with osc/rdma.

So here's the way forward:

  • Disable osc/pt2pt in 3.0.x and 3.1.x (just like we did in all of 2.x)
  • Add put/get support in vader, and (upcoming) OFI BTL
  • Eventually delete osc/pt2pt (perhaps after v4.0.x series? Would be good to keep it around just until all the new osc/rdma support in the BTLs solidifies/matures)

jsquyres pushed a commit to jsquyres/ompi that referenced this issue May 29, 2018
As a workaround for issue open-mpi#2614 for the v2.0.2 release,
do not allow for selection of the OSC PT2PT when
creating an MPI RMA window.  Print a hopefully helpful
message and return an not-supported error.

This PR should be reverted once a fix for open-mpi#2614
is in place.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit d0ffd66)
jsquyres pushed a commit to jsquyres/ompi that referenced this issue May 29, 2018
Per discussion at
open-mpi#2614 (comment),
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active.  Print a helpful message and
return a not-supported error.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit d0ffd66)
jsquyres pushed a commit to jsquyres/ompi that referenced this issue May 29, 2018
Per discussion at
open-mpi#2614 (comment),
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active.  Print a helpful message and
return a not-supported error.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit d0ffd66)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@matcabral
Copy link
Contributor

Eventually delete osc/pt2pt (perhaps after v4.0.x series? Would be good to keep it around just until all the new osc/rdma support in the BTLs solidifies/matures)

Currently psm and psm2 MTLs use the osc/pt2pt understanding the limitations. There is a workaround using the OFI path, but still a workaround.

@jsquyres
Copy link
Member

@matcabral I think @hjelmn's plan is to make all the BTL's support put/get, and then all transports can use osc/rdma -- therefore the need for osc/pt2pt can go away. That would be the purpose of the OFI BTL -- just for one-sided. Make sense?

@matcabral
Copy link
Contributor

@jsquyres, yes, thanks for the clarification.

@hjelmn
Copy link
Member

hjelmn commented May 29, 2018

I expect all transports that do not currently work with osc/rdma will see a performance improvement. How large will depend on the AMO and RDMA implementations.

jsquyres pushed a commit to jsquyres/ompi that referenced this issue May 29, 2018
Per discussion at
open-mpi#2614 (comment),
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active.  Print a helpful message and
return a not-supported error.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit d0ffd66)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 5b7c866)
jsquyres pushed a commit to jsquyres/ompi that referenced this issue May 29, 2018
Per discussion at
open-mpi#2614 (comment),
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active.  Print a helpful message and
return a not-supported error.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit d0ffd66)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 5b7c866)
bwbarrett pushed a commit that referenced this issue May 30, 2018
Per discussion at
#2614 (comment),
do not allow for selection of the OSC PT2PT when creating an MPI RMA
window when THREAD_MULTIPLE is active.  Print a helpful message and
return a not-supported error.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit d0ffd66)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 5b7c866)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants