Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comm/cid: use ibcast to distribute result in intercomm case #2061

Merged
merged 1 commit into from
Sep 7, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Sep 7, 2016

This commit updates the intercomm allgather to do a local comm bcast
as the final step. This should resolve a hang seen in intercomm
tests.

Signed-off-by: Nathan Hjelm hjelmn@me.com

@hjelmn
Copy link
Member Author

hjelmn commented Sep 7, 2016

@ggouaillardet This should resolve the hang. Can you verify?

@ggouaillardet
Copy link
Contributor

Thsnks, will try.

Btw, why do we iallreduce ?
Could we ireduce with root=0 instead ?

@ggouaillardet
Copy link
Contributor

I think i understand better now ...
ibcast on the intercomm basically cannot work since it is not bidirectional, hence iallgatherv, but with scount=1 on roots and scount=0 elsewhere, and rcounts[0]=1 and rcounts[i>0]=0 everywhere
These settings had gone during the refactor process.
Anyway, that was an overkill since ibcast on the local comm is just what is needed

I am confident ireduce can be used instead of iallreduce, and so we can only allocate tmpbuf on roots

Also, allreduce_fn functions take a count parameter, but they are only invoked with count=1, so we might want to simplify that too

I will test all these tomorrow

@hjelmn
Copy link
Member Author

hjelmn commented Sep 7, 2016

@ggouaillardet I agree. I think the count was used at some point in the past. I agree about the reduce vs allreduce. Will make that change.

@bosilca
Copy link
Member

bosilca commented Sep 7, 2016

I don't think that using bcast on an inter-communicator is equivalent to using allgatherv. The result of this operation is that only one group will get the data.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 7, 2016

@bosilca We first do a reduce on the intercommunicator. This gives the leaders (rank 0 in each comm) the result from the other group. Then we exchange the info between the leaders and do the reduction. At this point both leaders have the same result. They then bcast to their local groups.

@hjelmn
Copy link
Member Author

hjelmn commented Sep 7, 2016

@bosilca Just to be clear the ibcast is on the intra-communicator we keep with every inter-communicator.

@bosilca
Copy link
Member

bosilca commented Sep 7, 2016

@hjelmn the bcast on a intercommunicator is not symmetric. Only one of the leaders (which is the process providing MPI_ROOT as the root) will broadcast it's data to the remote group.

@bosilca
Copy link
Member

bosilca commented Sep 7, 2016

if the ibcast is done on a intra-communicator, then the result should be correct.

@bosilca
Copy link
Member

bosilca commented Sep 7, 2016

If we have an intra-communicator then why all the pain to deal with the reduce on an inter-communicator ?

@hjelmn
Copy link
Member Author

hjelmn commented Sep 7, 2016

@bosilca Not sure why the code was so complicated before. Once we added the c_local_comm we should have updated the algorithm to do what this PR makes it do. This is much less complicated than we made it out to be :).

@hjelmn
Copy link
Member Author

hjelmn commented Sep 7, 2016

@ggouaillardet Nevermind on the not using reduce. I forgot that I had already changed the initial reduce to be on the intra-communicator. Made the change from allreduce->reduce.

This commit updates the intercomm allgather to do a local comm bcast
as the final step. This should resolve a hang seen in intercomm
tests.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
@hjelmn
Copy link
Member Author

hjelmn commented Sep 7, 2016

Might go ahead an merge this before 9pm EDT to get it into MTT tonight.

@hjelmn hjelmn merged commit 63d73a5 into open-mpi:master Sep 7, 2016
@ggouaillardet
Copy link
Contributor

It works great for me, thanks !

@rhc54
Copy link
Contributor

rhc54 commented Sep 8, 2016

I don't see any change in MTT, though - still a ton of hangs. However, the tests that are now hanging are comm_dup.

ggouaillardet added a commit that referenced this pull request Sep 9, 2016
use MPI_MIN instead of MPI_MAX when appropriate, otherwise
a currently used CID can be reused, and bad things will likely happen.

Refs #2061
@ggouaillardet
Copy link
Contributor

3b968ec fixes MPI_Comm_dup_c
(that was a separate issue, if it did not occur earlier it is probably because it was hidden by the bug fixed in this PR)
i am now running some other tests from the intel_tests suite, and see if i hit some more bugs

@rhc54
Copy link
Contributor

rhc54 commented Sep 9, 2016

just wondering - are these some of the same problems we are seeing in 2.x? I don't know if these fixes need to go over or not.

@ggouaillardet
Copy link
Contributor

these two issues came from the refactor, and it was not brought to v2.x
i ll check v2.x from now, maybe something was backported and should have not ...

@ggouaillardet
Copy link
Contributor

several hangs occur with the *_stride* collective tests from the ibm test suite. strictly speaking, these are not hangs since these tests complete in less than 10 minutes, but take more than 3 minutes so MTT consider them as hangs.

that should be fixed by open-mpi/ompi-tests@53ad6f22e1f9ceda580a06bf8e00324a5f15c6c0
(i simply #ifdef'ed out sanity checks that are time very consumming since they access strided data)

alltoallvw_zero from the ibm test suite (and maybe some more) will be fixed when open-mpi/ompi-release#1368 is merged

bosilca pushed a commit to bosilca/ompi that referenced this pull request Sep 15, 2016
use MPI_MIN instead of MPI_MAX when appropriate, otherwise
a currently used CID can be reused, and bad things will likely happen.

Refs open-mpi#2061
hjelmn pushed a commit to hjelmn/ompi that referenced this pull request Oct 12, 2016
use MPI_MIN instead of MPI_MAX when appropriate, otherwise
a currently used CID can be reused, and bad things will likely happen.

Refs open-mpi#2061

(cherry picked from commit 3b968ec)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hjelmn pushed a commit to hjelmn/ompi that referenced this pull request Nov 7, 2016
use MPI_MIN instead of MPI_MAX when appropriate, otherwise
a currently used CID can be reused, and bad things will likely happen.

Refs open-mpi#2061

(cherry picked from commit 3b968ec)
hjelmn pushed a commit to hjelmn/ompi that referenced this pull request Nov 7, 2016
use MPI_MIN instead of MPI_MAX when appropriate, otherwise
a currently used CID can be reused, and bad things will likely happen.

Refs open-mpi#2061

(cherry picked from commit 3b968ec)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants