Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Comm_spawn inherit job options from the original job #5376

Closed
bosilca opened this issue Jul 4, 2018 · 9 comments
Closed

MPI_Comm_spawn inherit job options from the original job #5376

bosilca opened this issue Jul 4, 2018 · 9 comments

Comments

@bosilca
Copy link
Member

bosilca commented Jul 4, 2018

Thank you for taking the time to submit an issue!

Background information

Processes started with MPI_Comm_spawn inherit [all] parameters from the original job. As an example, if the original job was spawned with "-npernode 1", all future dynamic processes carry the same constraint. I could not figured out a way using MPI info keys to clean the environment to remove the constraints.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Original issue has been discovered in master, and can be replicated on all 3.x branches (and certainly in older versions but I haven't checked).

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Developer installation, aka. git clone followed configure and make, with --enable-debug.

Please describe the system on which you are running

Issue can be replicated in multiple environments, with and without RM, and with multiple networks.

@rhc54
Copy link
Contributor

rhc54 commented Jul 5, 2018

Yeah, this has always been a debatable point. Here is the relevant code (recall that npernode is translated to a ppr value):

    if (NULL == jdata->map->ppr && NULL != orte_rmaps_base.ppr) {
        jdata->map->ppr = strdup(orte_rmaps_base.ppr);
    }
    if (NULL != jdata->map->ppr) {
        /* get the procs/object */
        ppx = strtoul(jdata->map->ppr, NULL, 10);
        if (NULL != strstr(jdata->map->ppr, "node")) {
            pernode = true;
        } else {
            pernode = false;
        }
    } else {
        if (orte_rmaps_base_pernode) {
            ppx = 1;
            pernode = true;
        } else if (0 < orte_rmaps_base_n_pernode) {
            ppx = orte_rmaps_base_n_pernode;
            pernode = true;
        } else if (0 < orte_rmaps_base_n_persocket) {
            ppx = orte_rmaps_base_n_persocket;
            persocket = true;
        }
    }
    if (0 == jdata->map->cpus_per_rank) {
        jdata->map->cpus_per_rank = orte_rmaps_base.cpus_per_rank;
    }

You can see that we apply the MCA params given at the start of the job unless you override them. However, it is a one-to-one process - i.e., you can change the value of a specific MCA param directive, but you can't turn it "off".

So I guess the questions are: do MCA params only apply to the initial launch? Is that true for all MCA params (e.g., does it include BTL directives)? If only some, then which ones? Does the user decide, and if so, how do they tell us?

@bosilca
Copy link
Member Author

bosilca commented Jul 6, 2018

Let's assume my app is created by 2 different services and that I need to start my app in 2 phases, first a set of processes (one per node) that will act as managers, and then additional processes equitably divided among the available nodes. The npernode for the original mpirun is convenient, so that I don't really need to know how many nodes my allocation has (they are automatically extracted from the RM). But then I can't figure out how to start my second set of processes without handling HWLOC information in my application, and then messing around with MPI view of the resources.

I see your concern about the scope of the original mpirun parameters. Personally, I think the mpirun parameters should only apply to the original app, while those in the MCA configuration file must be global. More precisely, we should treat all MCA parameters not from configuration files as equal and provide a mean either to clean the environment for spawned applications so that the user can populate the new environment with only thee informations necessary for the new job, or to inherit all MCA parameters from the original job.

Going one step further, when spawning new processes, we should not only be able either to add more processes to the current allocation (current behavior), but also request for a new allocation and spawn the new processes directly there. I am not sure how we can mix these two together yet, but if we want to provide generic dynamic processes support we need to support for all cases.

@rhc54
Copy link
Contributor

rhc54 commented Jul 6, 2018

I grok your suggestion about the mpirun cmd line params, and it would be relatively easy to remove those from the environment passed to the child processes. Solving your immediate problem, however, only requires that we not apply params related to launch (mapping, ranking, etc.) to dynamically spawned apps unless directed to do so. This would be a trivial change. How do we get the community to bless it?

I agree with your "one step further", and PMIx v2 supports that request. Problem is that we don't yet have an RM that supports the PMIx_Allocate API. I expect we will start to see that next year. Meantime, I could mock that behavior in PRRTE if you need a place to test it.

@bosilca
Copy link
Member Author

bosilca commented Jul 6, 2018

The community that would be impact by such a change is minimal, as there are right now very few users of the dynamic processing capabilities. We can bring this up during one of our weekly calls to see what the rest of the community thinks about.

For the addition of PMIx_Allocate we have the perfect testing environment, ULFM. We are currently working to add a non-blocking spawn to help applications that want to mange spare nodes, and in this context being able to request a new allocation would be an interesting capability for many of our users.

@rhc54
Copy link
Contributor

rhc54 commented Jul 10, 2018

We talked about this on today's telecon and decided on a "first step" for OMPI v4.0 which branches at the end of this week. I'll add a new MCA param and cmd line option to indicate if launch directives are to be inherited or not (default to not) and then modify ORTE accordingly. This will affect the map-by, rank-by, bind-to, npernode, pernode, npersocket, persocket, and cpus-per-rank directives. I'll review the code and report any others I can identify that fit in this category.

The broader issue of inheritance got too thorny to resolve in time for the OMPI v4.0 branch - we'll deal with those later.

@bosilca
Copy link
Member Author

bosilca commented Jul 10, 2018

👍

@gpaulsen gpaulsen reopened this Jul 10, 2018
@gpaulsen
Copy link
Member

oops, sorry for closing.

@rhc54
Copy link
Contributor

rhc54 commented Jul 10, 2018

In looking at it, I wonder if the oversubscribe and overload directives should always be inherited? These seem like a pretty common condition to occur when dynamically spawning. For now, I've included them in the "do not inherit by default" category as someone can always put those flags in their spawn request, but wanted to raise the question.

@rhc54
Copy link
Contributor

rhc54 commented Sep 11, 2018

Committed to v4.0.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants