-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big MPI---large-count and displacement support--collective chapter #80
Comments
We are going to read this in Barcelona. Just this base ticket, not all its relatives that were spawned on June 14 (97, 98, 99, 100). We will bring those forward later. Tickets #98, 99, and 100 are all important and no more controversial than this ticket (#80), while #97 remains highly controversial. Also, The latest text for Ticket #80 is here: mpi32-report-ticket80-04sep2018.pdf [reductions] (Note there is other work that we need to consider under s-collectives and v-collectives, but they are not part of this pull request.) |
There are some small errors in argument lists: |
Thank you — I will apply corrections
Anthony Skjellum, PhD
205-807-4968
… On Sep 5, 2018, at 8:58 AM, Hubert Ritzdorf ***@***.***> wrote:
There are some small errors in argument lists:
Page 164, Line 20: add INTENT(IN) ::
INTEGER(KIND = MPI_COUNT_KIND) sendcounts(), recvcount -> INTEGER(KIND = MPI_COUNT_KIND), INTENT(IN) :: sendcounts(), recvcount
Line 21: add ::
INTEGER(KIND = MPI_ADDRESS_KIND), INTENT(IN) displs() -> INTEGER(KIND = MPI_ADDRESS_KIND), INTENT(IN) :: displs()
Line 30: add space between ) DISPLS
INTEGER(KIND = MPI_ADDRESS_KIND)DISPLS() -> INTEGER(KIND = MPI_ADDRESS_KIND) DISPLS()
Page 207, Line 16: remove root
INTEGER(KIND = MPI_COUNT_KIND), INTENT(IN) :: count, root -> INTEGER(KIND = MPI_COUNT_KIND), INTENT(IN) :: count
Page 213, Line 19, 20, 30: Same corrections as Page 164
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I am having trouble with Git so this is delaying publishing a new version; not sure why we are no longer seeing those repos. |
I apologize for Git issues. Nobody has ever tried to contribute to the large-count effort before so I was not aware that I was the only person who could write to the repo. Everyone in the GitHub group now has write access. Anyone who wants to contribute just needs to request access to that group. |
At the Barcelona WG meeting, @jdinan suggested that everyone in HPC world moving to 64-bit ABI (ILP64) ; that would make integers 64-bit. See https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models . |
Noting that the topology chapter is not covered by this version of the ticket nor the proposed reading material. A separate ticket will be made for that so this can proceed. If there should be an objection at the reading of this ticket that it does not address the topology chapter, we will point to the second ticket. |
Rolf notes that MPI_Alltoallw is inconsistent in its definition because it has byte displacements, yet they are defined as int, not MP_Aint. Therefore, the new API must account for this inconsistency and should it handle it via MPI_Aint for displacements; that is in fact what is currently proposed in the pull request as written. So, we need a ticket as an "Advice to Users." |
Per Rolf, there are two kinds of displacements: index displacements within an array (declared as int displacements), and bytes displacements (which are declared as MPI_Aint)... For index displacements within an array, all arithmetic (for example: count = disp2-disp1) are done with normal built-in plus and/or minus operators). For the byte displacements, they can always be used as relative displacements to the beginning of a buffer, or they can be used as absolute displacements (relative to MPI_BOTTOM). Thus, they must always be MPI_Aint. Additionally, the difference of two relative displacements should always be calculated MPI_Aint_diff(), not with an arithmetic minus (-) operator. The same applies for MPI_Aint_add() for the summation of an absolute address plus a relative address. Therefore:
It is necessary that the size of the integer representing MPI_Count >= size of the integer representing MPI_Aint. This rule is already in the standard. [See p 17 of MPI-3.1 standard, Section 2.5.8 Counts. Lines 15-19. Already covered in the standard.] In MPI-3.0, we already have MPI_GET_EXTENT_X that is using MPI_Count. So MPI_Count is not new. What we are recommending is to change the text of this proposal as follows: We will not put MPI_Aint on all displacements. We will put MPI_Aint on displacements involving bytes; we will put MPI_Count on displacements that are of index type. |
At the Barcelona WG meeting, @jdinan <https://github.com/jdinan>
suggested that everyone in HPC world moving to 64-bit ABI (ILP64) ; that
would make integers 64-bit.
@jdinan may not be aware, but his employer tried that a little over ten years ago.😜
|
Rolf notes that MPI_Alltoallw is inconsistent in its definition because it
has byte displacements, yet they are defined as int, not MP_Aint.
Therefore, the new API must account for this inconsistency and should it
handle it via MPI_Aint for displacements; that is in fact what is currently
proposed in the pull request as written.
It’s also a significant point in the BigMPI paper, which people should read
to understand the proposals Tony is reading.
|
The key outcome of the reading is the plan for an holistic look at the API across the entire API; a voting strategy followed by a final vote on the entire API addition was discussed and accepted as Forum-compliant (by acclamation / without objection). There were no specific objections to the API as presented currently in this ticket, ticket #105 It was pointed out that we still have more tickets to write and implement besides those already open fir "Big MPI." We have to look at the entire standard end-to-end. The current goal is to read "all" Big MPI tickets in December. |
Note also the creation of issue #107 and, in particular, the consequential question of whether we should actually replace MPI_COUNT with size_t in all C bindings and replace MPI_AINT with ptrdiff_t in all C bindings (with similar appropriate changes towards using language-specified types for the Fortran bindings). Assertion: using the naturally-sized types specified in the C language would achieve the goal of all the Big MPI issues for the C bindings. The short-term consequence (huge one-off churn affecting most APIs) is identical. Question: are there similar appropriate types specified in the Fortran language? Observation: the datatype naming rule proposed in issue #74 (if accepted) will permit the addition of MPI datatypes for size_t and ptrdiff_t (plus Fortran equivalents, if any) without further changes to the MPI Standard. Corollary: issues #107 and #109 become moot. @jdinan had a good reason to keep the MPI-namespaced types but I have completely forgotten it. @jdinan: please could you comment? Do we want MPI to continue to move in the direction of a DSL for communication or return to its roots of a library for communication? Note, IMHO, the concept of this/these proposal(s) is essential (cope with big machines); only the presentation style in the API is being debated. If we cannot find a technical reason to choose between language-specified and MPI-defined types, then we need the Architecture Review Board to reconvene and expurgate via a fiat. |
If Fortran doesn’t have those options what s we do?
Anthony Skjellum, PhD
205-807-4968
… On Sep 23, 2018, at 3:36 PM, Dan Holmes ***@***.***> wrote:
Note also the creation of issue #107 and, in particular, the consequential question of whether we should actually replace MPI_COUNT with size_t in all C bindings and replace MPI_AINT with ptrdiff_t in all C bindings (with similar appropriate changes towards using language-specified types for the Fortran bindings).
#107 (comment)
Assertion: using the naturally-sized types specified in the C language would achieve the goal of all the Big MPI issues for the C bindings. The short-term consequence (huge one-off churn affecting most APIs) is identical.
Question: are there similar appropriate types specified in the Fortran language?
Observation: the datatype naming rule proposed in issue #74 (if accepted) will permit the addition of MPI datatypes for size_t and ptrdiff_t (plus Fortran equivalents, if any) without further changes to the MPI Standard.
Corollary: issues #107 and #109 become moot.
Corollary: MPI_AINT_ADD and MPI_AINT_DIFF become superfluous.
@jdinan had a good reason to keep the MPI-namespaced types but I have completely forgotten it. @jdinan: please could you comment?
Do we want MPI to continue to move in the direction of a DSL for communication or return to its roots of a library for communication?
Note, IMHO, the concept of this/these proposal(s) is essential (cope with big machines); only the presentation style in the API is being debated. If we cannot find a technical reason not to choose between language-specified and MPI-defined types, then we need the Architecture Review Board to reconvene and expurgate via a fiat.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@tonyskjellum deprecate Fortran? <end_troll_mode> That possibly constitutes a technical reason not to choose language-specific types, at least for the Fortran bindings. |
@dholmes-epcc-ed-ac-uk Please remember that if we change As far as I can tell, this didn't happen with POSIX when those APIs switched from |
@dholmes-epcc-ed-ac-uk Fortran does not have unsigned integers, so it is rather hard to support |
@jeffhammond You have C99 so you can say |
@mhoemmen What does C99 have to do with Fortran not supporting unsigned types? In any case, the MPI standard does not require C99, although it supports |
@mhoemmen Assuming https://stackoverflow.com/a/1089204/2189128 is reliable, ISO C recommends that |
What I mean is that switching from |
@mhoemmen We are never going to replace |
ah ok never mind then :) |
Having the vector arguments be typed with MPI_COUNT or MPI_AINT does not help with ABI portability with respect to using size_t or ptrdiff_t instead. Both sets of types are of a fixed length on a particular machine but could be different between machines. If I write code that assumes the size of any of these it will break when that size changes. For the avoidance of doubt, I say above that the consequences to the API of using size_t are identical to using MPI_COUNT because the proposal is to churn the API in exactly the same manner. Specifically, if it is decided that we will have two symbols, the existing function signature and one with "_X" appended, then the "_X" variant will have the new type(s), whichever type of types that ends up being. Users can continue to compile against the existing symbols with their existing code and variable declarations. If and only if they wish to switch do they have to verify they are using suitably sizes variables and arrays. If the MPI Forum decides to fork MPI (seriously discussed as an option at the Sept 2018 meeting, straw poll 16,2,0 in favour), then MPI-4.0 may change the types in the existing API function definitions without changing their symbol names, which breaks backward compatibility. This option imposes a burden on the MPI Forum and on MPI library writers to continue support for a line of MPI-3.x releases that contain existing MPI-3.1 interfaces plus minor fixes and updates cherry-picked from the MPI-4 fork. |
@dholmes-epcc-ed-ac-uk Sorry, I misread your comment and thought you were suggesting replacing If we are going to fork the standard, I suggest that we use |
@jeffhammond I like that. So, we are suggesting that part of the C binding as defined in MPI-4k (pronounced MPI-fork) should be:
That allows humans and compilers alike to see the equivalence and use whichever they are more comfortable with. The Fortran binding can do whatever seems appropriate for that language (probably these will remain "opaque" types). Issue #107 becomes moot. Issue #109 is not, in fact it should be expanded to include F2C and C2F conversion functions or a promise of automatic representation conversion during heterogenous MPI communication. |
@dholmes-epcc-ed-ac-uk We need to stopping talking about forks. Python forking was/is a disaster for users and maintainers of dependent projects. MPI-4 needs to be one standard with two well-defined ABIs. |
Agree. The word "fork" has connotations of splitting and becoming two entirely different things. Even though I'm not there at the meeting, I get the sense that that's not what the Forum is talking about here. @dholmes-epcc-ed-ac-uk I appreciate the pun "MPI-4K" =~ "MPI Fork", but I think it sends the wrong message. |
I am not entirely sure I agree. There is a discussion to break backward compatibility goin forward and provide for only a 64-bit clean interface. How this is done is under discussion. BTW, I was there for the discussion. I will vote no on any attempt at adding additional _x symbols unless we plan to fork afterward. |
I suspect The Register is already writing a salacious article about the forking of MPI that will terrify users and cause them to rewrite their apps in Spark. |
Dan et al, Correct me if I am wrong, but I interpreted the 16-2 straw poll
to allow breaking backward compatibility to imply this kind of thinking,
not a wholesale change:
* New proposals, if backward incompatible, do not arrive DOA (or get struck
down immediately) simply for the lack of backward compatibility.
Tony
…On Mon, Sep 24, 2018 at 5:45 PM Jeff Hammond ***@***.***> wrote:
I suspect The Register is already writing a salacious article about the
forking of MPI that will terrify users and cause them to rewrite their apps
in Spark.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#80 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA38idxOpa08se7fipERG2gEeJbvzX0eks5ueP4YgaJpZM4SCLK->
.
--
Anthony Skjellum, PhD
skjellum@gmail.com
Cell: +1-205-807-4968
|
To clarify for those not present at the meeting, the discussion prior to the 16-2 in favour straw-poll covered a number of possible API changes related to how we should express the Big MPI adjustments (and others). There was a general (and strong) feeling that creating "_X" versions in MPI-4 only to be faced later with the necessity of creating "_Y" versions in future for some other API change was a really bad idea. The straw-poll itself immediately followed a suggestion that MPI-4 should define two APIs, possibly to be expressed via two header files in C (and, I guess, two modules in Fortran), for example, "mpi3.h" and "mpi4.h". The straw poll question was carefully worded to extract maximum support, something like "given the dislike for the _X mess, could you countenance supporting a proposal that breaks backwards compatibility, for example, in this way?" with the other option being "I will never support anything that is not backwards compatible under any circumstances". Despite heavily biasing the question, I was not expecting the strength of support for such a radical idea. Perhaps, "fork" is the wrong word. However, Python was mentioned as a cautionary tale during the discussion and before the straw-poll. Others present can correct me, if I am mis-remembering or over-editorialising. |
Latest update (Chapter 5 and change log) |
@tonyskjellum / @puribangalore - Is this issue replaced by #137? Can we close this? |
Yes, we can close this issue.
…On Wed, Oct 7, 2020 at 10:33 AM Wesley Bland ***@***.***> wrote:
@tonyskjellum <https://github.com/tonyskjellum> / @puribangalore
<https://github.com/puribangalore> - Is this issue replaced by #137
<#137>? Can we close this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#80 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD2CIBJZX7Q4RWN2URO2XTDSJSC4BANCNFSM4EQIWK7A>
.
|
Problem
Sending more than 2Gi elements in MPI is a pain.
The general strategy for implementing large-count operations is to use datatypes. In some cases, this is straightforward, but it appears to be a very poor solution in the case of v-collectives and reductions. In order to use the datatype solution for v-collectives, one has to map
(counts[],type)
to(newcounts[],newtypes[])
, which then requires the w-collective, since only it takes a vector of types. For reductions, one has to unwind the datatype inside of a user-defined reduction. None of the solutions available outside of MPI work for nonblocking collectives, due to the allocation of temporary vector arguments. If it is possible with generalized requests, it is onerous.A more subtle issue is the large-displacement problem, which exists even if all of the counts are less than INT_MAX because of the limitations of the offset vector. If the sum of
counts[i]
up to anyi<comm_size
exceedsINT_MAX
, thendispls[i]
will overflow. This means that one cannot use any of the v-collectives for relatively small data sets, e.g. 3B floats, which is only 12 GB per process. This is likely to be limiting when implementing 3D FFT, matrix transpose and IO aggregation, all of which are likely use v-collectives. Neighborhood collectives fixed the large-displacement problem, but if a user wants to use those as a drop-in replacement, they have to create a new communicator.The displacement issue is exacerbated in the large-count case because all the displacements are interpreted in bytes rather than the extent of the datatype, so there is no way to index beyond 2GB of data, irrespective of the datatype and the counts.
Using the w-collective for large-count v-collectives has these issues:
MPI_ALLTOALLW
takes displacements of typeint
and interprets these irrespective of the extent of the datatype (see page 173 of MPI-3), it is hard to index more than 2GB of data ''using any datatype''. There is a solution using datatypes encoded with the offset internally (e.g. viaMPI_Type_create_struct
), but it is far from user-friendly.In the absence of proper support in the MPI standard, the most reasonable implementation of large-count v-collectives uses point-to-point, which means that users must make relatively nontrivial changes to their code to support large counts, or they have to use something like BigMPI, which already implements these functions (vcollectives_x.c)). An RMA-based implementation is also possible, but users are unlikely to accept this suggestion.
One can map also the v-collectives to
MPI_Neighborhood_alltoallw
, but in a far-from-efficient manner, and this is not particularly useful for the nonblocking case becauseMPI_Dist_graph_create_adjacent
is blocking.Proposal
The straightforward, user-friendly solution to this problem is to add new functions that use
MPI_Count
andMPI_Aint
for counts and displacements, respectively.We are not proposing to add new functions for everything, just the standard collectives (neighborhood collectives will be proposed later as a separate ticket).
Adding _x versions of the v-collectives and w-collectives that have the count of type
MPI_Count
and displacement vectors of typeMPI_Aint[]
is the most direct solution and prevents users from having to allocate and set O(Nproc) vectors in the course of mapping to the most general collective available (e.g.MPI_NEIGHBORHOOD_ALLTOALLW
).We add reductions (reduce, allreduce, reduce_scatter, reduce_scatter_block, scan, exscan) as well, with the limitation that user-defined reductions are not supported because these would require a new version of
MPI_User_function
,MPI_Op_create
, andMPI_Op_free
, which is error-prone. For user-defined reductions, it is feasible to use user-defined datatypes without an obvious loss of efficiency. Furthermore, there are other issues (mpi-forum/mpi-forum-historic#339) with user-defined reductions that should be addressed if this change is made.Alternative solution
Another solution would be to add large-count support to derived datatypes, e.g.
MPI_Type_contiguous_x
, but this is not user-friendly. We should not ask users to start using derived datatypes to broadcast a contiguous array of 2.2 billion elements, for example.Changes to the Text
These changes have been made in https://github.com/mpi-forum/mpi-standard/pull/34.
Impact on implementations
BigMPI implements large-count variants of most of the proposed functions, sometimes in more than one way. For example, large-count blocking collectives were implemented using point-to-point, neighbor_alltoallw, and one-sided. Nonblocking collectives are a problem, which is one of the big motivations for this ticket.
The implementations inside of MPI libraries is straightforward assuming they convert message sizes to bytes internally and support e.g. 1B 4-byte types correctly.
Impact on Users
This ticket is the result of user complaints about MPI (e.g. http://gentryx.de/news_the_troubling_state_of_MPI.html, which was prominently cited in https://www.hpcwire.com/2014/04/30/time-look-beyond-mpi/).
The BigMPI project thoroughly evaluated the Forum's contention that datatypes were sufficient to address the large-count issue and found that this solution is unlikely to satisfy the majority of users, due to a number of performance and usability issues.
References
The text was updated successfully, but these errors were encountered: