Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mca_base_param_files option is no longer read and should be removed #11532

Closed
wckzhang opened this issue Mar 24, 2023 · 28 comments
Closed

mca_base_param_files option is no longer read and should be removed #11532

wckzhang opened this issue Mar 24, 2023 · 28 comments

Comments

@wckzhang
Copy link
Contributor

ompi_info and the code show an mca parameter as:

MCA mca base: parameter "mca_base_param_files" (current value: "/home/ec2-user/.openmpi/mca-params.conf,/home/ec2-user/ompimaininstall/etc/openmpi-mca-params.conf", data source: default, level: 2 user/detail, type: string, synonyms: mca_param_files)

However, when I actually provide this parameter, it doesn't seem to be getting picked up.

@bosilca
Copy link
Member

bosilca commented Mar 24, 2023

Thanks ! I was just struggling with the same issue. The reason is still eludes me.

@rhc54
Copy link
Contributor

rhc54 commented Mar 24, 2023

Please see the explanation in #11459

@wckzhang
Copy link
Contributor Author

We should probably remove the code and have it not show up in ompi_info considering it's not being used from mpirun

@wckzhang wckzhang changed the title mca_base_param_files option doesn't seem to be working mca_base_param_files option is no longer read and should be removed Mar 28, 2023
@jsquyres
Copy link
Member

We should probably remove the code and have it not show up in ompi_info considering it's not being used from mpirun

Please see #11459 (comment)

@wckzhang
Copy link
Contributor Author

We decided during the weekly telecon that our course of action here is to have a hard abort if the user attempts to use this parameter.

@wckzhang
Copy link
Contributor Author

Created an issue in PRRTE - openpmix/prrte#1731
There's a conflict with how we prefix different projects and needs more community feedback.

@awlauria
Copy link
Contributor

awlauria commented May 2, 2023

@wenduwan
Copy link
Contributor

wenduwan commented May 2, 2023

@jsquyres @rhc54 Can we assign this issue to another person as William is no longer working on ompi?

@jsquyres
Copy link
Member

jsquyres commented May 2, 2023

Sure, who do you want me to assign it to?

@wenduwan
Copy link
Contributor

wenduwan commented May 2, 2023

Thanks Jeff. Could you assign it to @qkoziol for now?

@jsquyres jsquyres assigned qkoziol and unassigned wckzhang May 2, 2023
@qkoziol
Copy link
Contributor

qkoziol commented Jun 23, 2023

After extensive exploration of the issues, PRs, wiki notes, and code for this issue, I believe that the final task at this point is this action item from the (above) wiki page notes:

Have OMPI / OPAL layer read in param/tuned files, shove everything into the environment. Then, later, when PMIx is
initialized, we'll have to pass an attribute that says "this is an OMPI process" so that PMIx can see/react the OMPI_MCA
env vars and translate them to the appropriate PMIx and PRTE MCA params.

I'm going to look at that piece of code and see what it's actually doing currently and how to effect the change desired.

@qkoziol
Copy link
Contributor

qkoziol commented Jun 28, 2023

Update from @rhc54 about this not being the correct approach:

no, that is not what we discussed. the point i made was that you had to play some games in OPAL to deal with the case when mpirun was NOT used to launch your processes - e.g., when srun is used. In that lone case, you have to deal with the param files yourself, parsing out any PMIx-related params and exposing them so PMIx will see them.

which is why PRRTE sets an envar to tell OPAL "I already handled everything - don't do it again"

there already is a check in OPAL to avoid the duplication. the only thing that had to be added was the parsing out of PMIx-related params when OPAL actually has to do the work itself. There is a PMIx utility for that purpose, so it isn't a heavy lift.

@qkoziol
Copy link
Contributor

qkoziol commented Jun 28, 2023

I've asked Ralph for more details (about whet section of code he believes is involved) and am also investigating in parallel also.

@qkoziol
Copy link
Contributor

qkoziol commented Jun 28, 2023

Jeff chatted w/Ralph earlier today, Ralph is looking at whether this is possibly already addressed w/changes in PRRTE already. If it has, then it may need to be backported to the branch of PRRTE that OMPI is using and the submodule pointer updated.

@rhc54
Copy link
Contributor

rhc54 commented Jun 29, 2023

So this is pretty much complete, minus my comment above. Specifically, here is how this works:

For the case of mpirun-based launch:

  • when the PMIx server in the daemon is setting itself up, it initializes its MCA base. This causes the detection of any param file MCA params and the subsequent reading of those files. The info is stored on an internal list - including any "override" settings.
  • the server then passes that list across all the programming model (pmdl) components so they can extract any recognized values. This includes the OMPI component, which has a list of currently known components - this can be overridden by a list provided by envar (code copied from PRRTE). Recognized values are stored on a component-internal list.
  • when procs for an app are being fork'd, we use the pmdl component corresponding to that app (was included in the app description) to load any provided MCA params into their environment.

This has all been implemented and is in the v4.2.4 tag. Note that I don't currently worry about PRRTE params - imagine this is something I will add at some point. One can specify multiple files in the "param_files" value, so you can combine a file of PMIx-specific values with one from OMPI.

For direct launch (e.g., srun):
You already detect the PRRTE-provided envar indicating that you were launched by PRRTE, and thus do not need to process any MCA param files. In the event that you were direct launched, your code already handles the parsing of the files. However, you simply place them on an internal list, and you don't check to see if they relate to PMIx as opposed to OMPI.

If you care about that, then you would need to check each entry to see if it references a PMIx framework/component. You could copy the code in PRRTE's src/mca/schizo/base/schizo_base_stubs.c:prte_schizo_base_check_pmix_param. I'm going to create a PMIx utility that contains that code so it doesn't get copied all over the place, but that means you'd have to bump to PMIx v4.2.5 - dunno if you care.

Feel free to test this and report problems. From my perspective, you can close this issue - or you can leave it open if someone wants to implement the direct launch support.

@rhc54
Copy link
Contributor

rhc54 commented Jun 29, 2023

Actually, I take that back about direct launch. In that scenario, both OPAL and PMIx are going to read the param files, each extracting what it needs. So there is nothing further to do here. Sadly, it means each proc is opening/reading these files twice - I suppose someone can optimize that if they care.

@rhc54
Copy link
Contributor

rhc54 commented Jun 29, 2023

Sorry for the noise, but returning to direct launch. The only way to pass param file options in this situation is by envar. This means that the user must set an appropriate OMPI_MCA_ value pointing at the file to be processed. Since it is an OMPI prefix, PMIx won't recognize it - and thus, only OMPI will process it.

Of course, the user could provide both OMPI and PMIx values - and we currently handle that just fine, whether they point to the same or different files.

If someone wants to process the OMPI file and handle any PMIx-related values in it, feel free. Another option would be to replicate the OMPI envar with a PMIx prefix and push it into the environment for PMIx to pick up and process.

I'm not sure it is worth it myself, but up to you folks.

@qkoziol
Copy link
Contributor

qkoziol commented Jun 30, 2023

Thanks for the detailed analysis @rhc54 ! I'll bring it up at the next OMPI developer meeting to see if the PMIx value aspect should be addressed.

@qkoziol
Copy link
Contributor

qkoziol commented Jul 17, 2023

We discussed this at last week's OMPI developer meeting. The outcome of that discussion was that OMPI should replicate the appropriate envvar(s) with a PMIx prefix and put those back into the environment for PMIx to process.

@rhc54
Copy link
Contributor

rhc54 commented Jul 18, 2023

I'm not sure I totally grok the comment, so let me just try to see if I can clarify a bit. My prior suggestion only dealt with the mca_base_param_files parameter. Pushing a replicated version of that param into the environ will indeed cause PMIx to pick it up, and any PMIx names in it will get handled. Just ensure it only gets done if the envar indicating that PRRTE already handled it isn't present.

What will not be handled are any translations or overlaps of params in those files. For example, an ORTE oob param specifying the network to use for runtime TCP connections will be ignored as the PMIx param has a different name. This may (likely will?) lead to unexpected behaviors when coming from prior OMPI versions.

Again, only pertains to direct launch. There are functions in PRRTE for detecting and dealing with these translations and overlaps if you want to copy them. Unfortunately, we cannot just move them to PMIx as it would have to be executed in the pmdl/ompi component, and that gets opened and called too late to affect many of the more common MCA params.

@jsquyres
Copy link
Member

I'm recording my own notes here, just because I only pop in and out of PRTE/PMIx issues periodically, and the information does not stay resident in my brain cache. This is what I discovered via code diving and talking to @rhc54.

  • For direct launch: in the case of mca_base_param_files if we also rename this var and effectively give it to PMIx, then every process opens the file twice (in the OMPI layer and the PMIx layer) -- this is bad for scale.
  • Also, renaming+giving mca_base_param_files doesn't get us backwards compatibility for the reasons @rhc54 cited (e.g., OOB name needs to be translated)
    • Note: PRTE has MCA-var-name translator functions. The PRTE schizo OMPI component calls these translator functions -- you can find the specific function names there.
  • What we should do (in OPAL in the MPI process, not mpirun):
    • if direct launched (note: there's an env var that indicates whether the process was direct launched or mpirun launched, and I'm pretty sure that there's already an "if" statement in OPAL somewhere that checks this var)
    • then we should call these MCA var name translation functions. We'll need to actually copy these functions from PRTE to OMPI->OPAL, since there's no libprte.

qkoziol added a commit to qkoziol/ompi that referenced this issue Aug 14, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
@qkoziol
Copy link
Contributor

qkoziol commented Aug 14, 2023

PRs to OMPI, PRRTE, and OpenPMIX to address the direct launch case:

#11854
openpmix/prrte#1777
openpmix/openpmix#3133

qkoziol added a commit to qkoziol/ompi that referenced this issue Aug 15, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
qkoziol added a commit to qkoziol/ompi that referenced this issue Aug 15, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
@qkoziol
Copy link
Contributor

qkoziol commented Aug 15, 2023

Dropped the PMIX PR, and updated the other two, based on review feedback. Please re-review.

qkoziol added a commit to qkoziol/ompi that referenced this issue Aug 21, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
qkoziol added a commit that referenced this issue Sep 5, 2023
Address Github issue #11532 by translating legacy parameters for direct launches
@qkoziol
Copy link
Contributor

qkoziol commented Sep 5, 2023

OMPI PR merged to main, needs merged to 5.x branch

qkoziol added a commit to qkoziol/ompi that referenced this issue Sep 7, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
qkoziol added a commit to qkoziol/ompi that referenced this issue Sep 7, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
(cherry picked from commit 5d236e9)
Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
@qkoziol
Copy link
Contributor

qkoziol commented Sep 7, 2023

PR for merge to 5.x branch: #11916

qkoziol added a commit to qkoziol/ompi that referenced this issue Sep 8, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
(cherry picked from commit 5d236e9)
Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
qkoziol added a commit to qkoziol/ompi that referenced this issue Sep 9, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
(cherry picked from commit 5d236e9)
Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
qkoziol added a commit to qkoziol/ompi that referenced this issue Sep 9, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
(cherry picked from commit 5d236e9)
Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
@qkoziol
Copy link
Contributor

qkoziol commented Sep 9, 2023

PR for merge to 5.x branch: #11916

Corrected PR for merge to 5.x branch: #11921

qkoziol added a commit to qkoziol/ompi that referenced this issue Sep 11, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
(cherry picked from commit 5d236e9)
Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
qkoziol added a commit to qkoziol/ompi that referenced this issue Sep 11, 2023
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
(cherry picked from commit 5d236e9)
Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
@qkoziol
Copy link
Contributor

qkoziol commented Sep 11, 2023

New new PR for bringing this to the v5.0.x branch: #11923

TIL: Clicking "sync branch" on Github will close any PRs open on that branch, likely due to the branch discarding those commits

awlauria added a commit that referenced this issue Sep 11, 2023
Address Github issue #11532 by translating legacy parameters for direct launches
@janjust
Copy link
Contributor

janjust commented Sep 11, 2023

Addressed with #11923

@janjust janjust closed this as completed Sep 11, 2023
bosilca pushed a commit to bosilca/ompi that referenced this issue Feb 14, 2024
…for direct launches

Borrow code from the OMPI schizo module in PRRTE that translates legacy
MCA parameters when an application is direct launched (PRRTE will translate
legacy parameters when natively launched).

Signed-off-by: Quincey Koziol <qkoziol@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants