2017 12 05

Working Group

FT Interoperability Test

How does the user pick a model? How do we tell the user what was picked and/or is available?

Options

Requested / Provided
- Positives:
  - Requesting an FT model is atomic.
  - The user will always get an answer on the first try.
  - If someone else has "requested" something else, that would be "provided".
- Negatives:
  - FT models don't "fall back" on each other the way threading models do.
  - The user might not get the second choice that they want (if there are more than two).
Request + Yes / No
- Positives:
  - User has complete control over which model they pick.
  - Doesn't cause an error if the model is not available.
- Negatives:
  - Could have to iterate over lots of models.
  - Can't change your mind after you've picked (could pair with an API to get the list of models).
- This seems to be the best option.
  - It should be be paired with an API to give the list of models that the implementation supports.
Request + Error / Success

Next Step

Write up some text for this and create a separate ticket.

Data Resilience

We went over the comments of the data resilience (https://github.com/mpiwg-ft/mpi-standard/pull/4) and decided to pause work on this in the context of ULFM. If work continues on it, it will be as a separate proposal.

Ishrink

Keita asked about the status of MPI_ISHRINK, but Aurelien was not present to give an update.

Reinit

What are the parts of the Reinit proposal that might be difficult to standardize?

Function Pointers

We require a function pointer that we can use to long jump.
This will force the application to use a recent enough version of Fortran at least for this part of the application.

Cleaning Up

How to we handle files that are still open when the application jumps back to reinit?
- Do we close the file? Leave it open and transparently deal with it inside MPI?
- If we want to leave the files open, we need to figure out what fopen does and try to do something similar.
What do we do with allocated memory?
The original solution was to have the lightweight error handlers that will decide whether to free memory or close files or not.
- What can you do inside a cleanup handler?
How do we handle dynamic processes?
- We probably have to disconnect all dynamic processes.
What do we do with PVARs? Do they get reset on reinit? How do we handle all of the different kinds of PVARs?
Same thing with CVARs. These probably need to be reset to their initial values.
Need to reset attributes and info keys on MPI_COMM_WORLD (and friends).

Readings

Error Handler Changes

The reading was generally successful. There were a few minor changes that people asked for and were made. These will need to be voted on at a no-no vote at the February meeting.

The sentence about MPI being undefined after an error was removed from this proposal given that the catastrophic error proposal is going to tackle that problem in a different way.

Catastrophic Errors

The Forum felt strongly that the way to detect catastrophic errors should not be via an API call, but should come from the error class itself. The initial concern about the fact that not all errors have an error class was dismissed because you would never have checked for an error until you received an error code anyway.

Furthermore, the Forum decided that it would rather remove the notion of catastrophic errors completely and just treat all errors the same, as non-catastrophic errors. It would be up to the user to determine which errors are actually catastrophic and which ones aren't.

This has these main consequences:

If the MPI library has what it considers a "catastrophic error", it might have to just abort. The set of errors that falls into this category should be very limited, however.
The user will be responsible for deciding which kinds of errors it wants to handle and which ones it doesn't. This means that we'll need to provide more specific error classes whenever possible. We should look at what kinds of error classes might be useful. One example would be to look at errno for similar errors that we could borrow.
The proposal should be changed to remove all of the notions of catastrophic errors and just remove the sentence about MPI being undefined after an error.
Catastrophic (or any other) errors cannot be permanent. If they are, the library is probably in a situation where it probably just has to abort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly