Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpich additions for Romio341 refresh #5

Commits on Apr 13, 2021

  1. [mpich romio 73a3eba] darray needs a resize in the block()/cyclic() s…

    …ubtype creation
    
    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    When Type_create_darray() uses the block()/cyclic() function to
    construct a subtype for whatever dimension it's processing, it
    needs to resize the resulting type before it goes into the type
    construction as the input for processing the next dimension.
    
    The same problem is in the main path's type creation, and in romio's
    mirroring of it.
    
    Gist for a testcase:
    https://gist.github.com/markalle/940de93d64fd779e304ee124855b8e6a
    
    The darray_bug_romio.c testcase creates a darray using
    * 4 ranks in a 2x2 grid
    * looking at the type made for rank 0
    * inner dimension: 4 ints distributed block over 2 ranks with 2 items each
    * outer dimension: 6 of the above distributed cyclic over 2 ranks with 2 items each
    
    The type created by processing the first dimension should look like this
    [ x x . . ]
    
    And then distributing those for the second dimension becomes
    
    [ x x x x ]
    [ . . . . ]
    [ . . . . ]
    [ . . . . ]
    [ x x x x ]
    [ . . . . ]
    
    Going to the MPI standard to justify why the first layout is right,
    it's where the definiton of the cyclic() function has a ub_marker
    of gsize*ex, eg 4 ints for that first type.
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    fd408dd View commit details
    Browse the repository at this point in the history
  2. [mpich romio c4b5762] romio flatten has flat->count consistency problems

    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    ADIOI_Count_contiguous_blocks and ADIOI_Flatten are very similar functions,
    but they diverge a lot around the edges and in degenerate cases.  In Spectrum
    MPI I spent some time making them consistent with each other but
    found that to be a losing battle.
    
    So the approach I used here is to not have Count() be as definitive,
    and rather let Flatten() have the last word on what the final flat->count
    really is.  Eg Flatten's *curr_index is the real count.
    
    The changes I made are
    
    1. Fix a couple places in Flatten where *curr_index was updated out
    of sync with what was actually being added to flat->indices[] and
    flat->blocklens[].  That's one of the important book-keeping rules
    Flatten should follow.  There were a couple places (when counts[] arrays
    have 0s) where that wasn't the case (see the "nonzeroth" var in this
    commit).
    
    2. The main change I made was to reset flat->count based on
    Flatten's curr_index.  This is because the divergence between the
    two functions is too great to reliably fix.
    
    3. A third change is just a safety valve, using a
    flatlist_node_grow() function just in case the
    Count function returns a smaller value than Flatten ends up
    trying to write into the array, and related to this I
    got rid of the various --(flat->count) lines, since that var now
    represents the allocated size of the array, until right after
    the Flatten function when it's reset to represent the data written
    to the array like it used to be.
    
    I don't think we even need ADIOI_Count_contiguous_blocks() anymore,
    but I didn't remove it.
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    66b76ae View commit details
    Browse the repository at this point in the history
  3. [mpich romio bbb5210] make ADIOI_GEN_WriteStrided not step on itself

    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    The ADIOI_GEN_WriteStrided funcion uses data sieving on non-contiguous
    types. That is, if it wants to write data at locations
        [W...X...Y...Z...]
    it reads the whole buffer
        [dddddddddddddddd]
    changes the locations it wants to write to
        [WdddXdddYdddZddd]
    then writes the whole thing back. It uses locks to make this safe, but
    the problem is this only protects against other parts of the product that
    are using locks. And without this PR a peer who is simultaneously making
    a simple non-contiguous write wouldn't have locked.
    
    A testcase to demonstrate the original problem is here:
    https://gist.github.com/markalle/d7da240c19e57f095c5d1b13240dae24
    
    % mpicc -o x romio_write_timing.c
    % mpirun -np 4 ./x
    
    Note: you need to use a filesystem that uses ADIOI_GEN_WriteStrided to
    hit the bug. I was using GPFS.
    
    This commit is pulled from wkliao after discussing where to put the
    new lock.  It adds locks to contiguous writes in independent write
    functions when data sieving write is not disabled
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    wkliao authored and markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    8a5ea1b View commit details
    Browse the repository at this point in the history
  4. [mpich romio ad0e435] romio fix when used in -Wpedantic build environ…

    …ments
    
    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    typecasting a (void*) due to compiler complaint when -Wpedantic is used
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    6bda513 View commit details
    Browse the repository at this point in the history
  5. [mpich romio e1d42af] making romio GPFS able to build

    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    Configuring mpich with --with-file-system=nfs+ufs+gpfs it wouldn't
    build because the last arg of ADIOI_OneSidedWriteAggregation()
    changed, so this PR adds an & to it
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    956c53e View commit details
    Browse the repository at this point in the history
  6. [mpich romio 313289a] romio gpfs: avoid freeing NULL

    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    Configuring mpich with --with-file-system=nfs+ufs+gpfs it wouldn't
    run because there are a couple free() calls that reported null
    pointers, so I changed a couple places to "if (x) { ADIOI_Free(x); }"
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    ba39cd2 View commit details
    Browse the repository at this point in the history
  7. [mpich romio 83bbb82] romio GPFS: missing initialization

    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    Romio GPFS had a runtime problem due to a .mem_ptrs field that
    was uninitialized that it was potentially trying to free.
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    5d9deb3 View commit details
    Browse the repository at this point in the history
  8. [mpich romio a712b56] romio/gpfs: Fix header include order

    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    adio.h must be included before any system headers so that config
    settings are taken into account. Fixes an issue where aligned_alloc
    may be used without a declaration. See similar fix in [4fcffbe695cb].
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    raffenet authored and markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    736a186 View commit details
    Browse the repository at this point in the history
  9. [mpich romio 5a036e7] mpl/cuda: Fix header include order

    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    mpl.h must be included before any system headers so that config
    settings are taken into account. Fixes an issue where aligned_alloc
    may be used without a declaration.
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    raffenet authored and markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    4e2ba6f View commit details
    Browse the repository at this point in the history
  10. [mpich romio d97c4ee] romio: ensure config headers are included first

    > Pulled in from mpich romio, branch "main".
    > Their commit message is below.
    >
    > This is part of a batch of commits from the
    > following set of PRs:
    >  *  pmodels/mpich#4943
    >     -- darray fix which contains a flatten fix
    >         73a3eba
    >         c4b5762
    >  *  pmodels/mpich#4995
    >     -- write strided fix
    >         bbb5210
    >  *  pmodels/mpich#5100
    >     -- build fix for -Wpedantic
    >         ad0e435
    >  *  pmodels/mpich#5099
    >     -- build fix, they had let file-system=...gpfs bit rot
    >         e1d42af
    >         313289a
    >         83bbb82
    >  *  pmodels/mpich#5150
    >     -- build fix, configure-related _GNU_SOURCE
    >         a712b56
    >         5a036e7
    >  *  pmodels/mpich#5184
    >     -- build fix, continuation of _GNU_SOURCE fix
    >         d97c4ee
    
    There was a problem with romio's use of MPL and the include order
    sometimes having system files before the config files generated
    by configure.  So for example at configure time _GNU_SOURCE was
    used and thus configure concluded that aligned_memory() was
    already declared, but then later in romio sometimes stdlib.h or
    similar sys includes were coming before mplconfig.h and maybe
    even before romioconf.h.
    
    My understanding of the design described by hzhou is that
    universlaly putting the config files at the top is the best
    solution to this ordering.
    
    All over the romio code they seem to use adio.h as their top
    level include, which has romioconf.h as its first include.
    So in this PR I'm moved adio.h to the top of a few more files.
    
    And I also made romioconf.h include mplconfig.h as its first
    include.  Otherwise what we have for romio's use of MPL is
    
    adio.h:
      includes romioconf.h  <-- we're relying on _GNU_SOURCE being here too
      includes many things including system files
      includes adioi.h
        includes mpl.h
          includes mplconfig.h with _GNU_SOURCE
    
    so having _GNU_SOURCE down in mplconfig.h doesn't really
    ensure it's in front of the system files unless we can
    rely on it being there in romioconf.h (Maybe we can? I'm not
    sure).  Anyway as long as the top level romioconf.h knows
    it's consuming MPL it can put mplconfig.h at its top and then
    all the rest of the romio code only needs to follow the rule
    of putting adio.h first and then both romio and mpl will get
    their config settings included at the top
    
    Signed-off-by: Mark Allen <markalle@us.ibm.com>
    markalle committed Apr 13, 2021
    Configuration menu
    Copy the full SHA
    d6998de View commit details
    Browse the repository at this point in the history