Skip to content

2017 09 27

Wesley Bland edited this page Sep 27, 2017 · 1 revision

Attendees

  • Intel - Wesley
  • Argonne - Ken, Yanfei
  • UTK - Aurelien
  • ORNL - Geoffroy

Error Handlers

  • Wesley made edits based on the feedback from the face-to-face.
    • There are still a couple of very minor edits that need to be made

Process Fault Tolerance

  • Is it possible to use ULFM and Reinit at the same time?
    • Not sure how they can be composed (even if the smaller communicator used ULFM) because the error handler for the larger communicator is still likely to be triggered after a process failure, which would trigger reinit.
  • We don't think it's a problem to use error handlers, but if using MPI_ERRORS_REINIT, it would need to be consistent across all communicators.
    • We still like using error handlers better than an API call
      • It doesn't create a new API interface
      • Changing the error handler is already required for process fault tolerance anyway.

TODO Items

  • Aurelien - Write first draft of ULFM composability/recovery advice to have libraries repair MPI in one place.
  • Aurelien - Merge MPI_COMM_ISHRINK branch
  • Aurelien - Go back over other ULFM branches so we can discuss them next time
  • Wesley - Go back through ULFM RMA discussions to see what we need to do (if anything to move forward).
  • Wesley - Improve slides for catastrophic errors to include example use cases
Clone this wiki locally