investigate PMIx #12

rongou · 2018-06-05T17:37:48Z

The Open MPI people suggested running a PMIx server on each worker pod, and use the PMIx API to launch. Need to investigate whether that's a better approach.

SLURM has some useful information: https://slurm.schedmd.com/mpi_guide.html
PMIx home: https://pmix.org/

yncxcw · 2018-06-08T04:15:04Z

hi, how to involve this issue.

rongou · 2018-06-08T04:32:45Z

I really don't know much about PMIx. If you are interested, you can try to prototype a solution.

Right now we start the worker pods and sleep, the launcher than calls mpirun to launch the processes remotely. With PMIx, my understanding is each worker pod would start a PMIx server, then the launcher can start the processes using the PMIx API.

Probably need to dig a bit into Open MPI and/or SLURM code to figure this out.

yncxcw · 2018-06-08T04:36:18Z

I see. I can take a try on this issue.

rhc54 · 2018-06-08T18:35:43Z

Just to help clarify a bit: PMIx is just a library - there is no PMIx server "daemon" to run. The way you use it is to have your local launcher daemon on each node dlopen (or link against) the PMIx library and initialize it as a "server" (instead of a "client" or "tool"). This provides access to all the PMIx APIs, including the ones dedicated to server-side operations (see the pmix_server.h header for a list of them).

You would use these to get a launch "blob" for configuring the network prior to starting procs on compute nodes and other various operations. Your launcher would still start the processes itself.

I'd be happy to advise/help get this running - we'd love to see it in Kubernetes!

rhc54 · 2018-06-19T15:00:10Z

Please do holler if/when I can be of help. I confess my ignorance of the Kubernetes internals, but am happy to advise (and hopefully be educated along the way). I'd like to see all the MPIs and workflow-based models that rely on interprocess messaging get supported.

gaocegege · 2021-04-16T07:24:57Z

/assign @zw0610

zw0610 · 2021-04-16T07:40:59Z

From my experience with Slurm built with PMIx, user need no 'launcher pod' for each job submitted. This seems a clear benefit to mpi-operator.

I've been searching for a minimal workable example/tutorial for using openpmix Without Slurm for a while but did not succeed. So @rhc54 , would you mind providing us with such a tutorial/example to set a PMIx environment and launch an mpi task? It seems related to prte but whole workflow is not very clear to me.

rhc54 · 2021-04-16T13:50:29Z

Happy to try. I'll read up a bit on Kubernetes and Kubeflow so I can try to provide more concrete direction (pointers to material would be welcome!). Meantime, you might be able to gain some insights from the following:

https://www.sciencedirect.com/science/article/abs/pii/S0167819118302424?via%3Dihub is a paper that explains PMIx and how it fits within an application lifecycle. In particular, it walks you thru the launch and wireup procedure for an MPI app. Everything in there remains accurate, though we have extended some of those features to support async wireup of process collections to support non-MPI programming models such as data analytics and deep learning.
https://openpmix.github.io/uploads/2019/04/PMIxSUG2019.pdf is a presentation given to the Singularity user group about how one can utilize PMIx in a containerized "cloud" setting. There is a video of the presentation (https://sylabs.io/videos) - it is the one entitled "PMIx: Bridging the Container Boundary".
PRRTE is the PMIx reference RTE, so it does provide a complete example of a full PMIx supported environment. This link (https://openpmix.github.io/code/getting-the-pmix-reference-server) starts you on a step-by-step walkthru for installing it and running applications with it. I can point you to the relevant areas of the code base, but it can be a little dense to jump into, I'm afraid.
There is a 3-part video (produced by the EasyBuild folks) that covers the ABCs of OpenMPI and PMIx (starts with https://www.youtube.com/watch?v=WpVbcYnFJmQ). It is mostly focused on how the code is organized, but I did spend a fair amount of the time giving an overview of how jobs get launched, so it might be worth a look.

I'll work on a wiki specifically aimed at Kubernetes as a couple of organizations have expressed interest in such an integration, especially with the PMIx support for app-directed optimized operations becoming more popular with the workflow community. Can't promise a completion date, but I'll do my best.

Meantime, please feel free to ask questions.

zw0610 · 2021-04-17T07:12:29Z

Thank you so much for your prompt help, Ralph.

I watched your presentation video and believe there look two scenarios to work with Kubernetes and PMIx:

each container is considered as a RM, with its RM daemon running as the entry point process. With new mpi task dispatched, new processes can be launched via the RM daemon.
making the PMIx client wrapped so the Kubernetes (kubelet) can take it as a container runtime.

While the second scenario looks more native to Kubernetes, the first one is much similar to the contemporary design of this repo (mpi-operator) and should take less effort to achieve. So I will prefer the first one as the short-term & minor-scoped approach and go through the material you offered.

As some users/developers suggest taking mpi-operator to the Kubernetes community from Kubeflow community, we can try the second option as a long-term and broader scoped project after we accumulate abundant experience from the first try.

rhc54 · 2021-04-17T14:59:09Z

The negative to the first option is that you still have to completely instantiate a secondary RM - e.g., if that RM is Slurm, then one of the containers must include the slurmctld, and the other containers must include the required info for the slurmd daemons to connect back to that slurmctld. This means that the users who construct these containers must essentially be Slurm sys admins, or at least know how to install and setup Slurm.

Alternatively, someone (probably the sys admin for the target Kubernetes environment) could provide users with a "base" container that has Slurm setup in it. However, that places constraints on the user as (for instance) the sys admin is unlikely to provide a wide array of containers users can choose from based on various operating systems. The sys admin would also have to guess when configuring Slurm as to how a user plans to utilize the container - or else the user will have to learn enough about Slurm to at least modify the configuration as required for their use-case.

The objective of the second option is to eliminate the need for a secondary RM and allow the user's container to strictly focus on the application. As you note, it does require that the PMIx server be integrated into Kubernetes itself so that the applications in the containers can remain simple PMIx clients. However, I believe it would best support the growing interest in HPC workflow computing (i.e., breaking down the traditional MPI bulk-synchronous model into many independent tasks that coalesce into the answer) and hybrid (data analytics + deep learning + MPI) programming models. It is the only method that lets the user focus solely on their application instead of having to learn how to setup and manage an RM, and the only method that allows the container to be portable across systems (e.g., a Kubernetes-based cloud and a Slurm-based HPC cluster).

Personally, I believe the second option is the better one and have focused my attention on it. However, I certainly understand it is more challenging and you may choose to pursue the first option in its place, at least for now. Let me know if/how I can help.

zw0610 · 2021-05-10T08:14:44Z

let me update the progress so far. But first, sorry for the late update as I was working on the python-sdk-for-mpijob feat.

I went through most material mentioned by Ralph and got a quite basic understanding on pmix. Following this article from PBS Pro, I've made a docker image with both openpmix and prrte installed.
(Please note that 1) neither /opt/pmix/bin nor /opt/prrte/bin is included in PATH; 2) no entrypoint specified for the image)

After start the pprte with prte -d, I was able to use prun to launch process within the same container (where prte is running). So far, I was blocked by two issues:

prun -H <hostfile> -n 2 xxx failed when prun is executed in another container (on k8s). In short, I was not able to launch process via prun and prte remotely.
Even if I was able to launch process remotely with prte and prun, that still does not give me the exact idea to get rid of the launcher container, which means start a job without prun. Maybe we can follow the PMI standard and let mpi-operator to tell prte on each worker pod directly about what processes should be launched. Is that workable? Is that a really good idea?

But anyway, let me fix the first issue and I'll keep updating here.

ArangoGutierrez · 2022-03-05T17:58:33Z

/assign

thoraxe · 2022-11-14T18:48:46Z

There hasn't been much motion on this issue. Is there anything I can do to help?

rhc54 · 2022-11-14T18:58:58Z

I would love to see this completed, if possible. I learned that a colleague at IBM (@jjhursey) is going to describe a Kubernetes/PMIx effort this week - perhaps he could share how that might relate here?

alculquicondor · 2022-11-14T19:01:44Z

It can also be shared during a kubeflow training meeting. Please let me know if you plan to do so, as I'd like to attend.

rhc54 · 2022-11-14T19:08:37Z

Also, for those of you at SC this week - Jai Dayal (@jaidayal) of Samsung is going to give a very brief description of the work we are collaborating on to use PMIx to integrate dynamic applications to a dynamic scheduler at the PMIx BoF meeting. There will also be a couple of other talks about similar efforts. I would heartily recommend attending, if you can. If nothing else, it might be worth your while to make the connections so as to follow those works.

My expectation is that we will be releasing several updates next year focused on dynamic operations - preemption of running jobs, request for resource allocations/adjustments, etc. The BoF will provide an introduction to those efforts.

jjhursey · 2022-11-15T16:03:41Z

My talk at SC22 was presented at the CANOPIE-HPC workshop

"A separated model for running rootless, unprivileged PMIx-enabled HPC applications in Kubernetes"
https://canopie-hpc.org/program/

The organizers should be posting the slides at some point. I'm planning on giving a more PMix focused version of the talk during the PMIx Standard ASC meeting in Jan link.

alculquicondor · 2022-11-15T16:23:34Z

Thanks, I'll take a look once the talks are available.

Shameless plug, in case you didn't know: https://opensource.googleblog.com/2022/10/kubeflow-applies-to-become-a-cncf-incubating-project.html

It would be great if the PMIx solution could be integrated with kubeflow or (even better) the kubernetes Job API directly.

ahg-g · 2022-11-15T16:27:32Z

/cc

rhc54 · 2022-11-15T17:13:30Z

Shameless plug, in case you didn't know: https://opensource.googleblog.com/2022/10/kubeflow-applies-to-become-a-cncf-incubating-project.html

Ah - no I was unaware of this! Congrats to all involved. I'm retired and so wouldn't really be able to write the integration code, but I am happy to advise and/or contribute where possible if someone wishes to pursue this. Having a "native" way of starting parallel applications in a Kubeflow environment would seem desirable, and extending that later to support dynamic integration to the scheduler itself would seem a win-win for all.

rongou added the help wanted Extra attention is needed label Jun 5, 2018

rongou assigned yncxcw Jun 8, 2018

rongou mentioned this issue Jul 9, 2018

are there any concerns to replace sshd with kubectl exec in MPI interaction? #21

Closed

rongou unassigned yncxcw Jan 10, 2019

terrytangyuan mentioned this issue Feb 9, 2019

Launcher and worker statuses do not correctly indicate the underlying states #90

Open

google-oss-robot assigned zw0610 Apr 16, 2021

rongou mentioned this issue Feb 23, 2022

[Discussion] Do we need the kubeflow dependency #341

Open

google-oss-prow bot assigned ArangoGutierrez Mar 5, 2022

alculquicondor mentioned this issue Nov 16, 2022

MPIJobs and Istio #480

Open

andreyvelich mentioned this issue Jul 18, 2024

KEP-2170: Kubeflow Training V2 API kubeflow/training-operator#2171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate PMIx #12

investigate PMIx #12

rongou commented Jun 5, 2018

yncxcw commented Jun 8, 2018

rongou commented Jun 8, 2018

yncxcw commented Jun 8, 2018

rhc54 commented Jun 8, 2018

rhc54 commented Jun 19, 2018

gaocegege commented Apr 16, 2021

zw0610 commented Apr 16, 2021

rhc54 commented Apr 16, 2021

zw0610 commented Apr 17, 2021 •

edited

Loading

rhc54 commented Apr 17, 2021

zw0610 commented May 10, 2021 •

edited

Loading

ArangoGutierrez commented Mar 5, 2022

thoraxe commented Nov 14, 2022

rhc54 commented Nov 14, 2022

alculquicondor commented Nov 14, 2022

rhc54 commented Nov 14, 2022

jjhursey commented Nov 15, 2022

alculquicondor commented Nov 15, 2022 •

edited

Loading

ahg-g commented Nov 15, 2022

rhc54 commented Nov 15, 2022

investigate PMIx #12

investigate PMIx #12

Comments

rongou commented Jun 5, 2018

yncxcw commented Jun 8, 2018

rongou commented Jun 8, 2018

yncxcw commented Jun 8, 2018

rhc54 commented Jun 8, 2018

rhc54 commented Jun 19, 2018

gaocegege commented Apr 16, 2021

zw0610 commented Apr 16, 2021

rhc54 commented Apr 16, 2021

zw0610 commented Apr 17, 2021 • edited Loading

rhc54 commented Apr 17, 2021

zw0610 commented May 10, 2021 • edited Loading

ArangoGutierrez commented Mar 5, 2022

thoraxe commented Nov 14, 2022

rhc54 commented Nov 14, 2022

alculquicondor commented Nov 14, 2022

rhc54 commented Nov 14, 2022

jjhursey commented Nov 15, 2022

alculquicondor commented Nov 15, 2022 • edited Loading

ahg-g commented Nov 15, 2022

rhc54 commented Nov 15, 2022

zw0610 commented Apr 17, 2021 •

edited

Loading

zw0610 commented May 10, 2021 •

edited

Loading

alculquicondor commented Nov 15, 2022 •

edited

Loading