-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
master/v5.0.x: no de-duplication of mpirun launch failure messages #10157
Comments
FWIW: I can only reproduce a single duplicate message no matter how many nodes I involve. The culprit seems to be this commit, added 2 years ago to prrte: openpmix/prrte@d702d8a it's sending the message to the tool and stderr, as opposed to just one or the other. |
Nice work tracking it down. |
If the IO was successfully delivered to the tool via PMIx, let the tool display the show_help() message and not duplicate it to stderr. Refs: open-mpi/ompi#10157 Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
If the IO was successfully delivered to the tool via PMIx, let the tool display the show_help() message and not duplicate it to stderr. Refs: open-mpi/ompi#10157 Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Print the help to stderr only, don't send it to the tool. This restores behavior to be consistent with that of ORTE. Refs: open-mpi/ompi#10157 Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Just FYI: The fix I provided doesn't address the issue of "show_help" messages from the OMPI or OPAL layers (or PMIx for that matter). Those messages don't flow thru the same code path - I believe they go thru the PMIx_Log API, or they may just appear on stderr. So those messages are not aggregated or otherwise handled in the "show_help" manner. So if you get any show_help messages from those sources, I'm afraid you'll still be missing the aggregation feature. |
PMIx Updates: 6692c28a - Support the common "-np" option PRRTE updates: 7ae2c08318 - Revise show_help to use PMIx IOF cd4bd7333d - prtereachable: missed something in pr 1315 de5b560e85 - prte: check if dvm actually got set up 2fbc6a8555 - prtereachable: fix problem with nl-route 65545059b6 - Correct --do-not-launch option 52bd7dbf88 - Restore use of "--cpu-bind=none". 4dfcdd3cb0 - ompi/schizo: Expose "--mca" when parsing command line. eb9502718f - Pass the allow-run-as-root option to the backend daemons ab7fae01c7 - Fix indirect slurm launch db0f97b5db - Bugfix: ompi_schizo would modify a const string in base_expose() OMPI refs: open-mpi#10159 open-mpi#10157 open-mpi#10153 Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Changes: f3828e8307 - Revise show_help to use PMIx IOF 9998bfadad - schizo/ompi: Convert all single dashes to double dashes. 744710d33f - prte: check if dvm actually got set up Refs: open-mpi#10097 open-mpi#10157 Note: There are no OpenPMIx changes to pull in at this time (4/06/2022). Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
This brings in changes to have PMIx/PRRTE handle and aggregreate opal/pmix/prte_show_help() messages using PMIx_Log(). Refs: open-mpi#10157 openpmix/prrte#1326 PMIx changes: 26ff1684 - Cleanup the show_help changes a bit d490e100 - Refactor show_help() to use the PMIx_Log() api. 69e6965c - Remove unnecessary function call in pmix_gds_hash_fetch(). 6c2e4f4e - Include Value_compare in pmix_deprecated.h 43715832 - Missed one 0ad46040 - Fix descriptions - ensure they are NULL terminated 3a8ef094 - Streamline operations 3797102a - Remove enum assignment fec3c6f6 - Improve value compare coverage 8ef6b16c - Initial implementation of the memory footprint reduction 5a62b782 - Add a struct for storing data in the hash 5230c2a4 - Provide a mechanism for registering/looking up attributes dbcaf776 - Add an index to dictionary of PMIx standard keys 07ba28c6 - Make pmix_common.h stand alone 1e31b7f2 - Error out if no atomic support is available 481074a9 - Rename function pointer members in pmix_tma_t. 1e24bfc6 - Checkpoint gds/shmem work. PRRTE changes: 47b5ad1653 - Convert to use pmix_show_help and control aggregation behavior b80d1311ca - Use PMIx_Log() for show_help() messages. 6a66855529 - Update to account for PMIX_MYSERVER_URI e2c78e9aac - Streamline operations a bit 2bfb452012 - Error out if no atomic support is available Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
This brings in changes to have PMIx/PRRTE handle and aggregreate opal/pmix/prte_show_help() messages using PMIx_Log(). Refs: open-mpi#10157 openpmix/prrte#1326 PMIx changes: 26ff1684 - Cleanup the show_help changes a bit d490e100 - Refactor show_help() to use the PMIx_Log() api. 69e6965c - Remove unnecessary function call in pmix_gds_hash_fetch(). 6c2e4f4e - Include Value_compare in pmix_deprecated.h 43715832 - Missed one 0ad46040 - Fix descriptions - ensure they are NULL terminated 3a8ef094 - Streamline operations 3797102a - Remove enum assignment fec3c6f6 - Improve value compare coverage 8ef6b16c - Initial implementation of the memory footprint reduction 5a62b782 - Add a struct for storing data in the hash 5230c2a4 - Provide a mechanism for registering/looking up attributes dbcaf776 - Add an index to dictionary of PMIx standard keys 07ba28c6 - Make pmix_common.h stand alone 1e31b7f2 - Error out if no atomic support is available 481074a9 - Rename function pointer members in pmix_tma_t. 1e24bfc6 - Checkpoint gds/shmem work. PRRTE changes: 47b5ad1653 - Convert to use pmix_show_help and control aggregation behavior b80d1311ca - Use PMIx_Log() for show_help() messages. 6a66855529 - Update to account for PMIX_MYSERVER_URI e2c78e9aac - Streamline operations a bit 2bfb452012 - Error out if no atomic support is available Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
This brings in the changes to support opal_show_help() aggregation and de-duplication. OpenPMIx changes: 8c39d8e6a9 - Cleanup the show_help changes a bit ed812e275a - Refactor show_help() to use the PMIx_Log() api. e3974c6cd9 - Make pmix_common.h stand alone 0f8ace8735 - Error out if no atomic support is available PRRTE changes: f75647a051 - Remove setting of PRTE_MCA_prte_base_help_aggregate. 8cd2c6a191 - Protect against compiling with PMIx without agg support. dee64c5b77 - Restore noloop for logging. 06bd67ef6f - Control aggregation behavior. ebfba531d7 - Use PMIx_Log() for show_help() messages. 6489a88203 - Cleanup report of bad executable name 9b3b16833f - Remove non-existent function cac281f42c - Bugfix: ompi_schizo would modify a const string in base_expose() fb1043dd88 - Add missing CLI option and parsing 09ed5f3d0b - Minor cleanups 804475bc1b - Minor formatting cleanups bd8559209c - Correct --do-not-launch option 448c37d5d0 - Restore use of "--cpu-bind=none". 05c007fe63 - Pass the allow-run-as-root option to the backend daemons f2f33e1f27 - Fix indirect slurm launch 7903cd5ba7 - Protect against proxy confusion 0d80549ad5 - Add some missing help verbiage f235d08152 - Correctly determine when to daemonize backend prted 25a36cb2ba - Some really minor cleanups e235d830c7 - build: check_package static improvements 19f0d87764 - prtereachable: missed something in pr 1315 b962f9d8a3 - prtereachable: fix problem with nl-route a4fb790472 - Error out if no atomic support is available Refs: open-mpi#10157 openpmix/prrte#1326 Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
On HEAD of master and v5.0.x, we seem to have lost the ability to de-duplicate launch failure messages.
For example, in v4.1.x, we only see one copy of the launch failure message:
But on master / v5.0.x HEAD, the same message is repeated once per host, and overlays itself in
mpirun
's output, making it very difficult to read/understand (the problem gets worse if there are more nodes involved):The text was updated successfully, but these errors were encountered: