v2.x communicator code updates #2215

hjelmn · 2016-10-12T18:02:00Z

This PR contains the following:

Cleanup of the CID code to only have non-blocking. This reduces the overhead of trying to maintain multiple CID paths. (Cleanup)
Optimization for CID generation on intercomm.
Optimization for MPI_Comm_split_type.

The last "feature" is probably the most important part of the PR. Before this PR MPI_Comm_split_type used MPI_Allgather on the communicator. This scales very poorly and is a performance bug in Open MPI. As such I see this PR as a performance bug fix not necessarily a new feature. If you disagree then postpone this to 2.2.0.

(cherry picked from commit 7397276) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

This commit simplifies the communicator context ID generation by removing the blocking code. The high level calls: ompi_comm_nextcid and ompi_comm_activate remain but now call the non-blocking variants and wait on the resulting request. This was done to remove the parallel paths for context ID generation in preperation for further improvements of the CID generation code. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 035c2e2) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

This commit introduces a new algorithm for MPI_Comm_split_type. The old algorithm performed an allgather on the communicator to decide which processes were part of the new communicators. This does not scale well in either time or memory. The new algorithm performs a couple of all reductions to determine the global parameters of the MPI_Comm_split_type call. If any rank gives an inconsistent split_type (as defined by the standard) an error is returned without proceeding further. The algorithm then creates a communicator with all the ranks that match the split_type (no communication required) in the same order as the original communicator. It then does an allgather on the new communicator (which should be much smaller) to determine 1) if the new communicator is in the correct order, and 2) if any ranks in the new communicator supplied MPI_UNDEFINED as the split_type. If either of these conditions are detected the new communicator is split using ompi_comm_split and the intermediate communicator is freed. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 4c49c42) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit 36a9063) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

Back-ported from 01a653d Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

…_allreduce_intra_pmix_nb() (cherry picked from commit bbc6d4b) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit ba77d9b) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

That was causing CUDA collective to crash. (cherry picked from commit 61e900e) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

This commit should restore the pre-non-blocking behavior of the CID allocator when threads are used. There are two primary changes: 1) do not hold the cid allocator lock past the end of a request callback, and 2) if a lower id communicator is detected during CID allocation back off and let the lower id communicator finish before continuing. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit fbbf743) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

This commit updates the intercomm allgather to do a local comm bcast as the final step. This should resolve a hang seen in intercomm tests. Signed-off-by: Nathan Hjelm <hjelmn@me.com> (cherry picked from commit 54cc829) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

use MPI_MIN instead of MPI_MAX when appropriate, otherwise a currently used CID can be reused, and bad things will likely happen. Refs open-mpi#2061 (cherry picked from commit 3b968ec) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit 803897a) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

(cherry picked from commit 6c6e35b) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

hjelmn · 2016-10-12T18:03:15Z

@jsquyres FYI. Please let me know whether this meets the requirements of a bug fix. This was originally intended for 2.1.0 when the target was December. The code has been soaking on master for awhile and is probably ready to go now.

bosilca · 2016-10-12T19:17:18Z

@hjelmn I just checked the MPI standard and it is illegal to supply different values for the split_type (page 247 lien 45) with the exception of MPI_UNDEFINED. Thus, I wonder if we really need the validity check. Second, I understand this operation as a different form of MPI_Comm_split, where the color is globally defined based on prior local knowledge. In other words, as each process has the information about the entire process placement and architecture, it can decide the local color based on this. Once the color defined it can simply call MPI_Comm_split.

hjelmn · 2016-10-12T19:33:08Z

@bosilca Processes that supply MPI_UNDEFINED still need to know what the split type is or we will hang.

MPI_Comm_split_type is indeed just a special case of MPI_Comm_split but the restrictions allow us to do some optimization. The algorithm I implemented does the following:

Form the local and remote groups based on information about the process placement.
Use the above groups to create a new intermediary communicator.
Perform an all-gather on the (hopefully much-smaller) intermediate communicator if either 1) the procs may need to be reordered, or 2) if any procs supplied MPI_UNDEFINED.
If reordering/dropping ranks is needed MPI_Comm_split is run on the intermediate communicator to do the dirty work.

bosilca · 2016-10-12T21:02:24Z

I see. So instead of the MPI_Comm_split all gather you assume that a 4 int reduction, followed by a communicator creation, and by smaller allgather will lead to better results. Do you have any pointer to what the improvement is ?

hjelmn · 2016-10-12T22:54:22Z

There is a graph on #1873 that shows the improvement on an XC40 on up to 2048 ranks.

See https://cloud.githubusercontent.com/assets/1226817/16821220/5658435c-4912-11e6-8e9c-bde7e8639711.png

jsquyres · 2016-10-15T20:18:40Z

bot:lanl:retest

bosilca · 2016-10-17T00:48:54Z

the code looks good. 👍

jsquyres · 2016-10-17T16:37:57Z

@bosilca We're using the Github reviews these days -- the 👍 is no longer enough. 😄

bosilca · 2016-10-17T16:40:37Z

For some reason I don't have the review at the top on this ticket (but I did on Gilles's PR).

hppritcha · 2016-10-17T18:22:05Z

This is way too big a code change this late in the attempt to release 2.1.0. If the release is delayed considerably we'll think about merging this in.

jsquyres · 2016-10-31T19:41:40Z

After discussion with @hppritcha, I moved the milestone back to v2.1.0.

I have also confirmed that this fixes COMM_SPAWN (i.e., #2234).

It would be nice if we could have a smaller version of this for v2.0.x -- e.g., could we leave out 91337bf?

hjelmn · 2016-10-31T23:00:31Z

@jsquyres Sure. I don't see why it can't be done. Will take a look tomorrow.

jsquyres · 2016-11-03T13:16:37Z

@hjelmn Any progress on making a smaller PR for v2.0.2?

bosilca and others added 13 commits October 12, 2016 11:48

Remove an apparently useless function.

1f21f54

(cherry picked from commit 7397276) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

Silence warnings

3b06c89

(cherry picked from commit 36a9063) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

Remove a debug print in comm_cid.c

b8c9f13

Back-ported from 01a653d Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

ompi/communicator: remove an other debug print statement in ompi_comm…

c7d2e47

…_allreduce_intra_pmix_nb() (cherry picked from commit bbc6d4b) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

Remove forced debugs

63a8c22

(cherry picked from commit ba77d9b) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

Fix typo calling allreduce with the allgather module.

b56417b

That was causing CUDA collective to crash. (cherry picked from commit 61e900e) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

ompi/communicator: fix typos in CID generation

b8b5c31

use MPI_MIN instead of MPI_MAX when appropriate, otherwise a currently used CID can be reused, and bad things will likely happen. Refs open-mpi#2061 (cherry picked from commit 3b968ec) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

Correctly indent the code.

085c8ae

(cherry picked from commit 803897a) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

ompi/communicator: silence warnings

a7b8d16

(cherry picked from commit 6c6e35b) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

hjelmn added bug enhancement code cleanup Target: v2.x labels Oct 12, 2016

hjelmn added this to the v2.1.0 milestone Oct 12, 2016

hjelmn assigned bosilca Oct 12, 2016

jsquyres approved these changes Oct 17, 2016

View reviewed changes

jsquyres added this to the v2.2.0 milestone Oct 17, 2016

jsquyres removed this from the v2.1.0 milestone Oct 17, 2016

jsquyres modified the milestones: v2.1.0, v2.2.0 Oct 31, 2016

jsquyres added the RM approved label Oct 31, 2016

jsquyres merged commit 87a79fa into open-mpi:v2.x Oct 31, 2016

jsquyres mentioned this pull request Oct 31, 2016

COMM_SPAWN broken in v2.0.x #2234

Closed

jjhursey mentioned this pull request Nov 7, 2016

MPI_Comm_idup hang in v2.0.x with MPI_THREAD_MULTIPLE #2380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.x communicator code updates #2215

v2.x communicator code updates #2215

hjelmn commented Oct 12, 2016 •

edited

Loading

hjelmn commented Oct 12, 2016

bosilca commented Oct 12, 2016

hjelmn commented Oct 12, 2016

bosilca commented Oct 12, 2016

hjelmn commented Oct 12, 2016

jsquyres commented Oct 15, 2016

bosilca commented Oct 17, 2016

jsquyres commented Oct 17, 2016

bosilca commented Oct 17, 2016

hppritcha commented Oct 17, 2016 •

edited

Loading

jsquyres commented Oct 31, 2016

hjelmn commented Oct 31, 2016

jsquyres commented Nov 3, 2016

v2.x communicator code updates #2215

v2.x communicator code updates #2215

Conversation

hjelmn commented Oct 12, 2016 • edited Loading

hjelmn commented Oct 12, 2016

bosilca commented Oct 12, 2016

hjelmn commented Oct 12, 2016

bosilca commented Oct 12, 2016

hjelmn commented Oct 12, 2016

jsquyres commented Oct 15, 2016

bosilca commented Oct 17, 2016

jsquyres commented Oct 17, 2016

bosilca commented Oct 17, 2016

hppritcha commented Oct 17, 2016 • edited Loading

jsquyres commented Oct 31, 2016

hjelmn commented Oct 31, 2016

jsquyres commented Nov 3, 2016

hjelmn commented Oct 12, 2016 •

edited

Loading

hppritcha commented Oct 17, 2016 •

edited

Loading