Skip to content

2022 10 31

Aurelien Bouteiller edited this page Oct 31, 2022 · 1 revision

Agenda

Ignacio presenting the changes made the ReInit based on feedback from the last meeting

Reinit

Tools interactions

Lots of issues were related to the tools interactions

New processes will execute the MPI_INIT call, but not the existing processes, which means that if the tool PMPI_Init, it will deadlock the tool (if the tool does something synchronizing, as it is normally allowed to)

Idea to fix this is ‘reinit callbacks’ Before cleanup (before the state of MPI is cleaned up, e.g. before the longjmp, maybe it’s not needed?) After cleanup (just before the resilient_fn, e.g., after the longjmp, just before the handover to user code)

Question: what about attribute callbacks (e.g., delete callbacks) When should they be called? Should they be called for comm_world,self? Probably not (comm is still present after reinit) Should they be called for user comms? Maybe (Ignacio and Aurelien disagree on this, Aurelien wants to call them) They would be called in between the before cleanup and after cleanup Are the attributes delete functions synchronizing (or potentially synchronizing)? Are the attribute functions PMPI-able

Potential problems: Q-MPI will have a very different interface and there may be problems there Q-MPI need to chain multiple tools and callbacks, TBD

Dynamic processes and Reinit

Question: what is the status of spawnees after MPI reinit? The intercom is broken (at least from the spawner side it has been deleted, so it connects nothing from the spawnee side) Should MPI_COMM_SPAWN call become erroneous when MPI_ERRORS_REINIT is set? What about CONNECT/ACCEPT? Should that also be erroneous to use with REINIT? Some papers (from Bill Gropp) argued that it can be used for lightly coupled FT GET_PARENT? More? General idea is to write a text that makes using any of these functions erroneous when Reinit is active

Reinit and multiple faults during reinit

Question: multiple failures during reinit could drift the state into becoming unrecoverable? World group membership? Can it be inconsistent after a reinit? To Be discussed next meeting What if a failure strike before the replacement call reinit? To be discussed next meeting

Clone this wiki locally