-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROMIO 3.4.1 refresh #8740
ROMIO 3.4.1 refresh #8740
Conversation
@markalle Are your patches in this refresh? |
I'd still like for some of the commits here to become part of the mpich codebase. Eg I'm sure that modernization commit would go in pretty easy. And I bet some of the OMPI patch commit could be #ifdef in a way that would work for both projects, like where it sets/clears errhandlers differently I guess based on OMPI having a different load/unload location for using romio. But I think the main thing everybody's wanting from me is that list of MPICH commits and whether they're represented here. So the good news is they're all at least completely checked in to MPICH's "main" branch now, so we're at least not stuck waiting for anything more at that level. I don't think any of them are in romio 3.4 unfortunately and thus aren't represented here, they're just in mpich "main". And here's the list (of PRs, some of which contain multiple commits): So I'm thinking I should just make another PR added on top of this one to reflect all the above additions. It actually shouldn't take real long now that I've gone back and looked up all the PRs. Getting stuff like that organized is most of the battle. |
The IBM CI (XL) build failed! Please review the log, linked below. Gist: https://gist.github.com/2cac31ac8c48da2554c089095605aec3 |
The IBM CI (GNU/Scale) build failed! Please review the log, linked below. Gist: https://gist.github.com/06c2a0ba433ca56f7f25034d06f8a2c6 |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/a1bd47d6d8d8f2631996b916838d921c |
@ggouaillardet |
@ggouaillardet Any time to look at @markalle's comment? Love to get this into master soon. Thanks, |
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
c671da5
to
78bcbec
Compare
The IBM CI (GNU/Scale) build failed! Please review the log, linked below. Gist: https://gist.github.com/8989f9e5bf0d0dfd2c5ce5289a26ed2a |
The IBM CI (XL) build failed! Please review the log, linked below. Gist: https://gist.github.com/07985725cda3be985b9e71ab4c5072d0 |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/6e9e1c3a7819e2bbec48ebb09bc439a0 |
78bcbec
to
e8f7a68
Compare
The IBM CI (GNU/Scale) build failed! Please review the log, linked below. Gist: https://gist.github.com/9e9a320b174431fe1c4a6d3fdd820842 |
The IBM CI (GNU/Scale) build failed! Please review the log, linked below. Gist: https://gist.github.com/5a33ba5b3bb640688f9631cf229ce37e |
The IBM CI (XL) build failed! Please review the log, linked below. Gist: https://gist.github.com/2e8751b5ce3e8e8da22f66b88671d111 |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/5eb4f03c5dcdf5b6b9c789db7a6b4b3a |
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
257273a
to
99a02c4
Compare
:bot:as:retest |
:bot:aws:retest |
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
…ubtype creation > Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee When Type_create_darray() uses the block()/cyclic() function to construct a subtype for whatever dimension it's processing, it needs to resize the resulting type before it goes into the type construction as the input for processing the next dimension. The same problem is in the main path's type creation, and in romio's mirroring of it. Gist for a testcase: https://gist.github.com/markalle/940de93d64fd779e304ee124855b8e6a The darray_bug_romio.c testcase creates a darray using * 4 ranks in a 2x2 grid * looking at the type made for rank 0 * inner dimension: 4 ints distributed block over 2 ranks with 2 items each * outer dimension: 6 of the above distributed cyclic over 2 ranks with 2 items each The type created by processing the first dimension should look like this [ x x . . ] And then distributing those for the second dimension becomes [ x x x x ] [ . . . . ] [ . . . . ] [ . . . . ] [ x x x x ] [ . . . . ] Going to the MPI standard to justify why the first layout is right, it's where the definiton of the cyclic() function has a ub_marker of gsize*ex, eg 4 ints for that first type. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee ADIOI_Count_contiguous_blocks and ADIOI_Flatten are very similar functions, but they diverge a lot around the edges and in degenerate cases. In Spectrum MPI I spent some time making them consistent with each other but found that to be a losing battle. So the approach I used here is to not have Count() be as definitive, and rather let Flatten() have the last word on what the final flat->count really is. Eg Flatten's *curr_index is the real count. The changes I made are 1. Fix a couple places in Flatten where *curr_index was updated out of sync with what was actually being added to flat->indices[] and flat->blocklens[]. That's one of the important book-keeping rules Flatten should follow. There were a couple places (when counts[] arrays have 0s) where that wasn't the case (see the "nonzeroth" var in this commit). 2. The main change I made was to reset flat->count based on Flatten's curr_index. This is because the divergence between the two functions is too great to reliably fix. 3. A third change is just a safety valve, using a flatlist_node_grow() function just in case the Count function returns a smaller value than Flatten ends up trying to write into the array, and related to this I got rid of the various --(flat->count) lines, since that var now represents the allocated size of the array, until right after the Flatten function when it's reset to represent the data written to the array like it used to be. I don't think we even need ADIOI_Count_contiguous_blocks() anymore, but I didn't remove it. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee The ADIOI_GEN_WriteStrided funcion uses data sieving on non-contiguous types. That is, if it wants to write data at locations [W...X...Y...Z...] it reads the whole buffer [dddddddddddddddd] changes the locations it wants to write to [WdddXdddYdddZddd] then writes the whole thing back. It uses locks to make this safe, but the problem is this only protects against other parts of the product that are using locks. And without this PR a peer who is simultaneously making a simple non-contiguous write wouldn't have locked. A testcase to demonstrate the original problem is here: https://gist.github.com/markalle/d7da240c19e57f095c5d1b13240dae24 % mpicc -o x romio_write_timing.c % mpirun -np 4 ./x Note: you need to use a filesystem that uses ADIOI_GEN_WriteStrided to hit the bug. I was using GPFS. This commit is pulled from wkliao after discussing where to put the new lock. It adds locks to contiguous writes in independent write functions when data sieving write is not disabled Signed-off-by: Mark Allen <markalle@us.ibm.com>
…ments > Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee typecasting a (void*) due to compiler complaint when -Wpedantic is used Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee Configuring mpich with --with-file-system=nfs+ufs+gpfs it wouldn't build because the last arg of ADIOI_OneSidedWriteAggregation() changed, so this PR adds an & to it Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee Configuring mpich with --with-file-system=nfs+ufs+gpfs it wouldn't run because there are a couple free() calls that reported null pointers, so I changed a couple places to "if (x) { ADIOI_Free(x); }" Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee Romio GPFS had a runtime problem due to a .mem_ptrs field that was uninitialized that it was potentially trying to free. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee adio.h must be included before any system headers so that config settings are taken into account. Fixes an issue where aligned_alloc may be used without a declaration. See similar fix in [4fcffbe695cb]. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee mpl.h must be included before any system headers so that config settings are taken into account. Fixes an issue where aligned_alloc may be used without a declaration. Signed-off-by: Mark Allen <markalle@us.ibm.com>
> Pulled in from mpich romio, branch "main". > Their commit message is below. > > This is part of a batch of commits from the > following set of PRs: > * pmodels/mpich#4943 > -- darray fix which contains a flatten fix > 73a3eba > c4b5762 > * pmodels/mpich#4995 > -- write strided fix > bbb5210 > * pmodels/mpich#5100 > -- build fix for -Wpedantic > ad0e435 > * pmodels/mpich#5099 > -- build fix, they had let file-system=...gpfs bit rot > e1d42af > 313289a > 83bbb82 > * pmodels/mpich#5150 > -- build fix, configure-related _GNU_SOURCE > a712b56 > 5a036e7 > * pmodels/mpich#5184 > -- build fix, continuation of _GNU_SOURCE fix > d97c4ee There was a problem with romio's use of MPL and the include order sometimes having system files before the config files generated by configure. So for example at configure time _GNU_SOURCE was used and thus configure concluded that aligned_memory() was already declared, but then later in romio sometimes stdlib.h or similar sys includes were coming before mplconfig.h and maybe even before romioconf.h. My understanding of the design described by hzhou is that universlaly putting the config files at the top is the best solution to this ordering. All over the romio code they seem to use adio.h as their top level include, which has romioconf.h as its first include. So in this PR I'm moved adio.h to the top of a few more files. And I also made romioconf.h include mplconfig.h as its first include. Otherwise what we have for romio's use of MPL is adio.h: includes romioconf.h <-- we're relying on _GNU_SOURCE being here too includes many things including system files includes adioi.h includes mpl.h includes mplconfig.h with _GNU_SOURCE so having _GNU_SOURCE down in mplconfig.h doesn't really ensure it's in front of the system files unless we can rely on it being there in romioconf.h (Maybe we can? I'm not sure). Anyway as long as the top level romioconf.h knows it's consuming MPL it can put mplconfig.h at its top and then all the rest of the romio code only needs to follow the rule of putting adio.h first and then both romio and mpl will get their config settings included at the top Signed-off-by: Mark Allen <markalle@us.ibm.com>
The official mpich romio PRs I checked into included old-style MPI calls, so those needed modernized same as an earlier commit. I do think both of the modernize PRs could be put into mpich. Signed-off-by: Mark Allen <markalle@us.ibm.com>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
99a02c4
to
f186076
Compare
@gpaulsen I think this PR is now ready for prime time! My apologies for the delay, I have been unfortunately unable to spare more time than I hoped for Open MPI... @markalle I could not figure out some paperwork, so I am afraid my contributions to MPICH won't be accepted. That being said, I prepared a commit for MPICH at https://github.com/ggouaillardet/mpich/commit/055562718e685fcbe2e3b08fff9c3a48bc76d9e1.patch |
No description provided.