Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulties with spawning new processes on the victim's node #33

Closed
abouteiller opened this issue Feb 7, 2018 · 2 comments
Closed
Labels
bug Something isn't working major

Comments

@abouteiller
Copy link

Original report by George Bosilca (Bitbucket: bosilca, GitHub: bosilca).


As reported on the ULFM mailing-list the use of a machinefile to restrict or drive the allocation of new processes is difficult.

@abouteiller
Copy link
Author

Original comment by George Bosilca (Bitbucket: bosilca, GitHub: bosilca).


This issue is rooted in OMPI and is due to the forwarding of job-level constraints from the original job to all spawnees. In this particular case adding "-npernode 1" restricts all future processes from sharing a node, across all jobid handled by the same HNP. In a normal MPI application such behavior might be desired, but in context of ULFM we need to be able to reuse nodes, which means to respawn processes on a node where older processes failed.

Multiple solution might be envisioned, but I think the cleanest solution is to provide an info key to prevent the original job parameters inheritance. I have create an OMPI issue related to this topic open-mpi/ompi#5376.

@abouteiller
Copy link
Author

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


open-mpi#5376 has been imported, as well as fixing the 'oversubscribe' non-propagation issue; this should resolve the problem.

@abouteiller abouteiller added major bug Something isn't working labels Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

1 participant