Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix finding "prte" for singleton comm_spawn #12390

Merged
merged 1 commit into from
Mar 13, 2024
Merged

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Mar 4, 2024

Copy the search code from mpirun to use when starting "prte" in support of comm_spawn. This respects things like OPAL_PREFIX.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 4, 2024

We don't have a way of putting a time limit on how long a PR can be posted, so I'll just put one here by comment. I'll close this PR out on Mar 11 if not committed by then.

@hppritcha hppritcha self-requested a review March 4, 2024 20:12
ompi/dpm/dpm.c Outdated Show resolved Hide resolved
@hppritcha
Copy link
Member

@jsquyres did you add the suggested strdup? I don't see it in the diff.

Copy the search code from mpirun to use when starting
"prte" in support of comm_spawn. This respects things
like OPAL_PREFIX.

Signed-off-by: Ralph Castain <rhc@pmix.org>
@jsquyres
Copy link
Member

@jsquyres did you add the suggested strdup? I don't see it in the diff.

Wow; I goofed there. I think I fixed it; have a look.

@hppritcha
Copy link
Member

@dalcinl does this patch fix your issue #12349 ?

@dalcinl
Copy link
Contributor

dalcinl commented Mar 12, 2024

I had no spare time to actually test the patch, but the code looks totally good. I'm still concerned about a the use of getenv("OMPI_PRTERUN") and OMPI_PRTERUN_PATH. I'd rather submit a PR myself in the future to open up a discussion about my concerns rather than keep being bullied and called to shut up.

@jsquyres
Copy link
Member

Here's my $0.02:

  • If this PR fixes @dalcinl's issue
  • Even if it doesn't fix all issues
  • And this PR isn't harmful to other cases

Then I think we should merge it and file an issue with the remaining case(s) that need to be fixed (pointing back to the original issue and pair of PRs so that all the history / context / discussion is preserved, etc.).

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 12, 2024

rather than keep being bullied and called to shut up.

Wow, that's quite a comment. If true, my apologies - not my intent. I was simply trying to say that while I understand your point, the fact is that we have decided to go a different route. No point in continually arguing a point that we simply decided against following. Not every recommendation or concern can be addressed - sometimes, the answer is just "thanks, but no".

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 12, 2024

@jsquyres You want me to update this branch and squash it?

@dalcinl
Copy link
Contributor

dalcinl commented Mar 12, 2024

Wow, that's quite a comment. If true, my apologies - not my intent.

No matter how polite you pretended to be, I took this comment #12363 (comment) as an explicit request to shut up. No problem at all, really, it's your prerogative and I'm not a snowflake. Just don't get surprised if people dislike you way to approach discussions.

I was simply trying to say that while I understand your point, the fact is that we have decided to go a different route.

I don't think you understood my point, not at all. And rather than trying, or asking for clarification, you just told me shutting me down.

I'm going to write here, once again, what is my concern, which is mostly about aesthetics, consistency, symmetry, and usability. Although often dismissed, attention to detail prevent bugs and surprises, so I consider my addmitedly annoying insistence worthwhile.

  • The mpirun code uses the environment variable OMPI_PRTERUN and a configure variable OMPI_PRTERUN_PATH to specify the path to the prterun executable. All good with this, the PRETERUN name for a prterun command makes total sense and is what everyone would expect.
  • The fixes in this PR are using the SAME environment variable name OMPI_PRTERUN and configure variable name OMPI_PRTERUN_PATH, but this time to specify the path to the (different?) prte executable to start the DVM in singleton init. IHMO, the environment variable should be named OMPI_PRTE and the configure variable should OMPI_PRTE_PATH. Otherwise, the naming convention is confusing and leads to inconsistencies.

I ask again: I'm getting all this wrong? To add to my confussion, prterun and prte are the same binary but with different names via a symlink. Do they effectively have different personalities depending on argv[0]? If they don't, then all is good. In other words, if either prterun and prte can start the DVM, then I'll stop complaining.
However, If prterun and prte are indeed different things (as I understood from the PRRTE documentation that I did read), then requiring users to set export OMPI_PRETERUN=/path/to/prte for running singleton, but export OMPI_PRETERUN=/path/to/prterun for running with mpirun is a misfeature and an invitation to confusion. Ideally, users should be able to specify each command separately with OMPI_PRETERUN and OMPI_PRTE.

@dalcinl
Copy link
Contributor

dalcinl commented Mar 12, 2024

Sorry, I'll not be able to test this today. I'll do my best to leave a note tomorrow with a confirmation.

@jsquyres
Copy link
Member

@dalcinl Ok, thanks. Please let us know how the test goes.

I think the root issue here is that we probably named the environment variable poorly: it probably really shouldn't have been named OMPI_PRTERUN, because that tends to imply things that it is not. You're right that the whole point of that environment variable is to look for the tree where PRRTE is installed. I propose the following:

  1. If this PR allows you to setenv OPAL_PREFIX and it fixes your problem, we merge it.
  2. We open another issue:
    • We should make a better name for the OMPI_PRTERUN environment variable. Since it's an env variable exposed to the Open MPI user, let's name it what it is meant to be used for: OMPI_PRRTE_PATH (or OMPI_PRRTE_PREFIX, or ... I'm open to suggestions here -- on the new issue, not the existing issue/PRs).
    • We'll have both mpirun (the OMPI executable) and the DPM look for this new env variable first, and if it finds it, it'll use it. If it doesn't find it, look for the legacy OMPI_PRTERUN env variable, and if it finds it, it'll use it. This is not a lot of code to write; we just agree on a new env variable name and add a little logic in mpirun and the DPM.
  3. I notice that the env variable OMPI_PRTERUN isn't documented anywhere 🤦‍♂️ (!). We need to document the new env variable name in mpirun(1).
    • I don't think we should even document the old name OMPI_PRTERUN anywhere. All the confusion and discussion on these 2 PRs and corresponding issue are evidence as to why we should just jettison the old name and never speak of it again.

Note that I also talked to @bosilca (on the phone): we did clear up something. He was thinking that this PR would set OPAL_PREFIX in the current MPI process (which would be far too late in the bootstrapping sequence to set it). However, the whole premise of this PR is that OPAL_PREFIX (etc.) are set before the MPI process was invoked. That changes the discussion significantly.

@jsquyres
Copy link
Member

I should point out that @rhc54 and I chatted on the phone about this for a while this afternoon. My comment above represents the results of that discussion.

@dalcinl
Copy link
Contributor

dalcinl commented Mar 13, 2024

As per my local testing, things seem to be working OK when setting OPAL_PREFIX and LD_LIBRARY_PATH. I occasionally get the following output in low-stress spawn runs:

prte: ../../../../../ompi-main/3rd-party/openpmix/src/class/pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.

but that is probably unrelated.

I tried to NOT set OPAL_PREFIX and instead use OMPI_PRTERUN (and setting PRTE_PREFIX because I build with embedded PRRTE) but it does not seem to work, and I cannot make sense of it. Anyway, I don't care that much about that case, the fact that OPAL_PREFIX is honored without need of setting PATH is all what I was concerned about.

@hppritcha
Copy link
Member

As per my local testing, things seem to be working OK when setting OPAL_PREFIX and LD_LIBRARY_PATH. I occasionally get the following output in low-stress spawn runs:

prte: ../../../../../ompi-main/3rd-party/openpmix/src/class/pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.

but that is probably unrelated.

I tried to NOT set OPAL_PREFIX and instead use OMPI_PRTERUN (and setting PRTE_PREFIX because I build with embedded PRRTE) but it does not seem to work, and I cannot make sense of it. Anyway, I don't care that much about that case, the fact that OPAL_PREFIX is honored without need of setting PATH is all what I was concerned about.

I think this assert is fixed by this - openpmix/prrte#1929

@hppritcha
Copy link
Member

@jsquyres per the 1-2-3 steps you outlined above i will merge this PR as step 1. I will open an issue for step 2 hopefully today but more likely tomorrow.

@hppritcha hppritcha merged commit 921e3db into open-mpi:main Mar 13, 2024
11 checks passed
@rhc54 rhc54 deleted the topic/dpm branch March 13, 2024 15:36
@hppritcha
Copy link
Member

@jsquyres do you think we should still do number two item above? maybe specific to v5.0.x?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants