Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MPI_Comm_split_type scalability #1873

Merged
merged 2 commits into from
Jul 19, 2016

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Jul 13, 2016

This is built on top of the refactor work in #1855.

@hjelmn
Copy link
Member Author

hjelmn commented Jul 13, 2016

@bosilca The idea behind this change is most programs are not going to either reorder or drop ranks (with MPI_UNDEFINED). We can optimize the most common case and leave off the global all gather operation.

@hjelmn
Copy link
Member Author

hjelmn commented Jul 13, 2016

comm_split_type

There is a slight slowdown for small communicators when reordering (dropping ranks should be similar) but all around the performance looks a lot better than the old algorithm.

This benchmark runs MPI_Comm_split_type in a loop 1000 times and takes the average time to complete. The lines w/out reordering all give MPI_COMM_TYPE_SHARED for split_type and 0 for key. The lines w/ reordering all give MPI_COMM_TYPE_SHARED for split_type and comm_size - rank for key (reversing the rank ordering).

@hjelmn hjelmn added this to the v2.1.0 milestone Jul 13, 2016
@hjelmn hjelmn self-assigned this Jul 13, 2016
@rhc54
Copy link
Contributor

rhc54 commented Jul 17, 2016

I'm not sure why there is concern over the performance of comm_split, but @hjelmn did raise a question with me about the time lost in looking up hostnames when that operation is tagged as "optional". Looking into it, the time appears to be spent performing several hash table lookups because we don't have general rules about naming conventions.

Specifically, we could improve this algorithm by:

  • requiring that any lookup request for job-level info (i.e., info that is not directly tied to a specific rank) by done by specifying a rank of PMIX_RANK_WILDCARD or PMIX_RANK_INVALID. This would let us eliminate 1/3 of the time.
  • requiring that any modex info use a non-PMIx namespace key - i.e., that user-provided data cannot have a key starting with "pmix". This would eliminate an additional 1/3 of the time.

I cannot see a way to eliminate more than this, but the combination should reduce the time by nearly 70%, and I don't think these rules would be particularly onerous. Is this worth pursuing?

@hjelmn
Copy link
Member Author

hjelmn commented Jul 18, 2016

This is comm_split_type so it is a little more important to optimize it. At a minimum it is a not a great idea to do a all gather on a large comm when any rank only needs to know the result on a small subset of ranks. BTW, in order to do a fair comparison of old vs new I removed the modex receive for the old algorithm as well. On master no ompi_proc_t means not node local.

@rhc54
Copy link
Contributor

rhc54 commented Jul 18, 2016

Okay, I'll propose adding those requirements to the PMIx specification - if nobody objects, I can update the algorithm.

This commit simplifies the communicator context ID generation by
removing the blocking code. The high level calls: ompi_comm_nextcid
and ompi_comm_activate remain but now call the non-blocking variants
and wait on the resulting request. This was done to remove the
parallel paths for context ID generation in preperation for further
improvements of the CID generation code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit introduces a new algorithm for MPI_Comm_split_type. The
old algorithm performed an allgather on the communicator to decide
which processes were part of the new communicators. This does not
scale well in either time or memory.

The new algorithm performs a couple of all reductions to determine the
global parameters of the MPI_Comm_split_type call. If any rank gives
an inconsistent split_type (as defined by the standard) an error is
returned without proceeding further. The algorithm then creates a
communicator with all the ranks that match the split_type (no
communication required) in the same order as the original
communicator. It then does an allgather on the new communicator (which
should be much smaller) to determine 1) if the new communicator is in
the correct order, and 2) if any ranks in the new communicator
supplied MPI_UNDEFINED as the split_type. If either of these
conditions are detected the new communicator is split using
ompi_comm_split and the intermediate communicator is freed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@rhc54
Copy link
Contributor

rhc54 commented Jul 18, 2016

Working its way thru the PMIx approval process: openpmix/openpmix#114

@bosilca
Copy link
Member

bosilca commented Jul 19, 2016

@hjelmn nice catch, this patch indeed improves the case where no processes are processes are using MPI_UNDEFINED. 👍

@hjelmn hjelmn merged commit 40f71f2 into open-mpi:master Jul 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants