restoring more hwloc --cpu-set behavior from OMPI 3.x #6755

markalle · 2019-06-11T23:58:00Z

There are two commits here

restoring --cpu-set behavior that OMPI 3.x had
addressing an error detection (for oversubscription) that was too aggressive

I made an older checkin bdd92a7
that partially restored --cpu-set behavior using OMPI 3.x code,
and I was about to restore more code from 3.x, but I'm taking a
different approach here and reverting the previous addition of
3.x code too.

For the (partial) revert:
commit: bdd92a7.
title: "-cpu-set as a constraint rather than as a binding"
I'm reverting the functional changes to how the hwloc tree is
iterated over, but keeping the error-detection changes modeled
after 3.x.

For the new approach to --cpu-set:

The --cpu-set option is logically similar to running under a cgroup
just without the OS-level enforcement that comes with a cgroup. For
cgroups the new code loads the topology without the WHOLE_SYSTEM
flag so the tree only contains what's in the cgroup. We can do the
same thing with a hwloc_topology_restrict() call to constrain the
topology to whatever --cpu-set the user enters. Then all the 3.x
code that skips over unavailable tree entries becomes unnecessary.

Here are the cases that were fixed in bdd92a7,
and which are confirmed to still work with the new approach:

hardware: [..../..../..../....] numbered sequentially 0-15

% mpirun -np 2 --report-bindings --bind-to hwthread \
  --map-by hwthread --cpu-set 6,7 hostname
MCW rank 0 [..../..B./..../....]
MCW rank 1 [..../...B/..../....]

% mpirun -np 2 --report-bindings --bind-to hwthread \
  --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
MCW rank 0 [..../..BB/..../....]
MCW rank 1 [..../..../..../BB..]

The additional case I was in progress to fix here and which
is also handled by the new approach is

% mpirun -np 2 --report-bindings --bind-to hwthread \
  --map-by ppr:2:node:pe=3 --cpu-set 4,5,9,11,14,15 ./x
MCW rank 0 [..../BB../.B../....]
MCW rank 1 [..../..../...B/..BB]

jjhursey · 2019-06-12T00:51:16Z

bot:ibm:retest

jjhursey · 2019-06-12T01:04:16Z

bot:ibm:retest

bgoglin · 2019-06-12T05:30:08Z

hwloc_topology_get_allowed_cpuset(topo) is only useful if hwloc is configured to list disallowed CPUs (those outside of our cgroup). AFAIK, that's not the case in OMPI anymore (HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM is never passed to hwloc_topology_set_flags())

jjhursey · 2019-06-12T14:23:48Z

bot:ibm:retest

markalle · 2019-06-12T18:22:12Z

Oh gosh, I was about to ask how --report-bindings manages to display the affinity as a portion of the whole machine if WHOLE_SYSTEM isn't used... but it doesn't anymore. I guess that's a matter for a separate discussion what --report-bindings should show. I prefer the old way where the whole machine is there. Eg if I run in a tiny cgroup that contains cores 6,7,10,11 I'd prefer to see

[..../..BB/..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../....]

than [BB/..][]
Do you know if this change was intentional or just a side effect of other changes?

I think I agree this makes hwloc_topology_get_allowed_cpuset() unnecessary, but the intersections with rdata->available would still be needed

bgoglin · 2019-06-12T19:35:30Z

Enabling WHOLE_SYSTEM makes things much more complicated since you have to manually ignore disallowed objects when iterating, counting objects, etc. I don't remember the actual reason why it was disabled, but enabling it just for having a nicer --report-bindings would seem crazy to me.

I don't see an easy way to bring back the feature. hwloc still has the info that some other PUs exist, but we don't know which cores/sockets they correspond to.

If the transmitted XML contains disallowed PUs, one could load a secondary topology for --report-bindings with the WHOLE_SYSTEM flag.

rhc54 · 2019-06-12T21:45:07Z

It was a very deliberate decision - the inclusion of disallowed PUs confused people (and software) that didn't realize they had to ignore them. As @bgoglin explained, it made things much easier and less confusing to users.

Please don't bring it back 😄

markalle · 2019-06-13T22:31:54Z

I'll open a separate PR related to --report-bindings.

For the purpose of this PR I like the decision to not use WHOLE_SYSTEM, and I'm curious if an analogous approach should be used with rdata->available. I'd consider --cpu-set to be an analogous concept, it's like specifying a cgroup minus the enforcement that comes with a real cgroup. It seems like the bitwise intersections with rdata->available are scattered around quite a bit in the 3.x code with lots of checks and skips in the loops like mentioned above. I'd rather see the whole topology modified early on to restrict it to rdata->available, close to topology_load time.

I haven't gone down that path, but might consider it for the future if that sounds plausible to others.

markalle · 2019-06-18T17:03:14Z

I was about to modify this PR to remove the calls to hwloc_topology_get_allowed_cpuset() since none of the topologies use WHOLE_SYSTEM anymore, but I heard there's discussion about whether the mappings are right under the new system and it might be coming back (?). I'm not clear on why they'd be wrong, empirically I'm able to put different cgroups on each host and mpirun is seeing a correct looking just-what's-available topology for each host and the assignments it makes are translating out correctly back at the ranks, at least for my runs.

But anyway where does that put this PR? Here I'm restoring 3.x behavior for the --cpu-set option and currently still have code that intersects with hwloc_topology_get_allowed_cpuset() which shouldn't be harmful but might be unnecessary

rhc54 · 2019-06-19T17:54:58Z

Afraid I don't know what you are talking about relative to this PR. Removing get_allowed_cpuset makes sense per the comments from @bgoglin . The other discussion related to your report-bindings PR which has some problems associated with it.

jjhursey · 2019-06-19T21:03:14Z

@rhc54 You mentioned on the OMPI teleconf that you were concerned that the process mapping was going to be wrong in an environment with cgroups set this way - since the hwloc signature doesn't include the offline/online information. I think that's what @markalle is asking about.

rhc54 · 2019-06-19T22:22:20Z

Understood, but that doesn't have anything to do with this PR. Looking at it again, my question would be: if you remove the get_allowed_cpuset changes, is there anything left?

markalle · 2019-07-02T00:08:08Z

The reason I brought up the teleconference conversation was to ask if the WHOLE_SYSTEM + get_allowed_cpuset() might be coming back due to the concerns about mapper output with the new non-WHOLE_SYSTEM topologies or if the former WHOLE_SYSTEM topologies are really staying gone. That question would affect whether to keep the get_allowed_cpuset() in this PR.

But even if we keep the topologies as non-WHOLE_SYSTEM and remove the get_allowed_cpuset() here, there is still active code change in this PR. The --cpu-set option puts a mask in
rdata->available
In OMPI 3.x that value was checked a bunch of places as the hwloc tree was searched and iterated over. Much of that code was removed in 4.x producing issues like the one addressed here.

jsquyres · 2019-07-02T15:36:35Z

This is also planned to be discussed on the 9 July 2019 webex.

markalle · 2019-07-18T20:59:57Z

I reverted not just the current addition of 3.x code but also a previous checkin from april that had also added 3.x code for skipping over unavailable tree entries. And I switched to a hwloc_topology_restrict() call that makes --cpu-set get handled analogously to how a cgroup is handled. That approach always seemed more logical to me; I just hadn't bothered to see if it would work, figuring restoring 3.x behavior was safer than taking a new approach.

But I bet cgroups get more attention and testing than --cpu-set, so piggybacking that approach ought to be good

rhc54 · 2019-07-24T04:21:21Z

You can squash these if you like - probably need to do so to quiet the signed-off-by checker. I simply fixed some code style violations. Something to perhaps keep in mind for future PRs.

I have no issue with these changes.

rhc54 · 2019-07-24T22:15:55Z

Just a second - something just hit me.

@bgoglin What happens if an application proc calls hwloc_topology_restrict on a shared memory topology tree owned by the daemon? What if it tries to add an object to the "userdata" field?

bgoglin · 2019-07-25T06:00:25Z

@rhc54 It's not allowed, it's a read-only shared mapping.

jsquyres · 2019-07-29T19:25:39Z

@markalle Are you planning to update this PR (per @bgoglin's comments)?

markalle · 2019-07-31T19:14:00Z

Well, I think it should be okay, but would like confirmation.

The code we're concerned about is where I added a call at the bottom of

int opal_hwloc_base_filter_cpus(hwloc_topology_t topo)
{
    ...

// Treat --cpu-set analogously to how a cgroup is treated, eg make
// the loaded topology only contain what is available.
    hwloc_topology_restrict(topo, avail, 0);
}

to shrink the topology in a manner that I expect is similar to what would happen if I had created an equivalent cgroup and had done a topology_load() from inside that cgroup.

My two reasons for believing this is okay are

looking at where opal_hwloc_base_filter_cpus() is called, it sure looks like all its uses come from topologies that had just been created by ordinary calls to hwloc_topology_init() and hwloc_topology_load(), and
nearly the first line of opal_hwloc_base_filter_cpus() is already setting root->userdata so I think that means it's already not a legal function to call on a shared topology

markalle · 2019-09-10T21:22:16Z

I repushed with Ralph's code violation fixes squashed into my commits.

In bind_generic() there's a loop that picks a starting trg_obj and then walks through a loop of next = trg_obj->next_cousin until it has made total_cpus assignments. But the code doesn't accept that those assignments might not be adjacent objects. Example: % mpirun -np 2 --report-bindings --map-by ppr:2:node:pe=3 \ --cpu-set 4,5,7,8,9,11 -bind-to hwthread:overload-allowed > MCW 0 : [..../BB.B/..../....] > MCW 1 : [..../..../BB.B/....] It will want to assign 3 cpus and will loop through trg_obj 00001 (with ncpus 1) trg_obj 000001 (with ncpus 1) trg_obj 0000001 (with ncpus 0) trg_obj 000000011 (with ncpus 1) The original code on the third entry would see num_bound for the object become too high for its ncpus and think oversubscription was happening. I changed it to only ++num_bound eg to use that object if the object has cpus in its cpuset after intersected with the allowed/available masks. The error message from the original code (if you remove the overload-allowed) would be > A request was made to bind to that would result in binding more > processes than cpus on a resource: > Bind to: HWTHREAD > Node: ... > #processes: 1 > #cpus: 0 Signed-off-by: Mark Allen <markalle@us.ibm.com> [ This checkin also includes a squashed commit from Ralph Castain to fix code style violations. ]

I made an older checkin bdd92a7 that partially restored --cpu-set behavior using OMPI 3.x code, and I was about to restore more code from 3.x, but I'm taking a different approach here and reverting the previous addition of 3.x code too. For the (partial) revert: commit: bdd92a7. title: "-cpu-set as a constraint rather than as a binding" I'm reverting the functional changes to how the hwloc tree is iterated over, but keeping the error-detection changes modeled after 3.x. For the new approach to --cpu-set: The --cpu-set option is logically similar to running under a cgroup just without the OS-level enforcement that comes with a cgroup. For cgroups the new code loads the topology without the WHOLE_SYSTEM flag so the tree only contains what's in the cgroup. We can do the same thing with a hwloc_restrict_topology() call to constrain the topology to whatever --cpu-set the user enters. Then all the 3.x code that skips over unavailable tree entries becomes unnecessary. Here are the cases that were fixed in bdd92a7, and which are confirmed to still work with the new approach: hardware: [..../..../..../....] numbered sequentially 0-15 % mpirun -np 2 --report-bindings --bind-to hwthread \ --map-by hwthread --cpu-set 6,7 hostname > MCW rank 0 [..../..B./..../....] > MCW rank 1 [..../...B/..../....] % mpirun -np 2 --report-bindings --bind-to hwthread \ --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname > MCW rank 0 [..../..BB/..../....] > MCW rank 1 [..../..../..../BB..] The additional case I was in progress to fix here and which is also handled by the new approach is % mpirun -np 2 --report-bindings --bind-to hwthread \ --map-by ppr:2:node:pe=3 --cpu-set 4,5,9,11,14,15 ./x > MCW rank 0 [..../BB../.B../....] > MCW rank 1 [..../..../...B/..BB] Signed-off-by: Mark Allen <markalle@us.ibm.com> [ This checkin also includes a squashed commit from Ralph Castain to fix code style violations. ]

rhc54 · 2019-09-10T21:49:41Z

Something may be off here - we have reports that cpu-set isn't working on v3.1.x, and so I'm not sure restoring it is the correct answer. Probably just need to start by looking at what the behavior actually is vs what we want it to be and fix it. I'll see what is going on.

markalle · 2019-09-11T19:35:15Z

I think I've got the commit message updated to reflect what this commit is doing, although I guess the one-liner about "restoring behavior" is a little misleading. This PR isn't trying to restore old OMPI 3.1 code anymore.

It's even reverting an older commit bdd92a7 where I had previously restored some pieces of the 3.1 code for --cpu-set behavior. Initially this PR was going to pull in even more 3.1 code, but after our discussions I've ended up reversing course and reverting most of bdd92a7 too.

The new direction for supporting the old behavior is very analogous to what would happen if the user created a cgroup and all the affinity code was running inside of that. I feel like the new code should be quite a bit more reliable than the 3.1 code that had dozens of locations where it would be iterating over the hwloc tree and would have extra checks to see if the things it was iterating on were part of the rdata->available mask that comes from --cpu-set. I suspect cgroups are more frequently used and tested than -cpu-set, and they're logically doing the same thing, so I think it makes a lot of sense for them to be the same code path.

jsquyres · 2019-09-12T15:38:47Z

@markalle I'm a bit confused: can you clarify what the problem is that we're solving here?

I ask because I'm not entirely sure that we really know what we want our mapping/binding options to do. It feels like we have a huge sprawl of mapping/binding options, and there's decent-enough documentation of the basics in the man page, and they seem to work well (--map-by core --bind-to core and friends). But there seem to be a lot of corner cases and legacy CLI options that I don't know if they are tested well or even well-defined (much less documented). Additionally, I don't know if it's well-defined what happens when these CLI options are mixed together.

E.g., @rhc54 and I were talking about #6966 and we realized that we didn't know the difference between --cpu-set and --cpu-list. ...and then it turns out that although mpirun(1) says that they are different, they look like they behave the same to us (for v3.x and master). Indeed, if this PR is about restoring 3.1.x behavior, I'm not convinced that that behavior is "correct" / what we want.

It might be a good idea to have a webex to discuss: what exactly do we want? (perhaps in the context of v5.0.x)

For example, maybe we could cut down/deprecate some of these older CLI options. It might be nice if everything could be done through --map-by X --bind-to Y, which might reduce the possibility of mixing-n-matching different corner cases...? Shrug. Just a thought / food for thought.

markalle · 2019-09-13T20:21:45Z

I agree there's a lot of sprawl, and some options might be "one off" designs that aren't clear on their interactions with other options.

My fixes are generally in response to our test team, which is probably trying out options based on the documentation more often than they're making tests based on customer usage.

Running normal affinity options inside of a cgroup though is pretty fundamental (and works I think). And this option is loosely related to that. For cgroups you probably have a job scheduler who knows what resources are available, and maybe another job is on the first socket, and the current job is being given the second socket so the job scheduler creates a cgroup for the part of the machine you're allowed to be on. Then this option would be kind of a cheap mimicry for people without a job scheduler who are writing their own scripts to do things like that by hand.

I wouldn't say it's the most important option in the world, but I still think it can be handy.

lanl-ompi · 2020-10-25T21:49:42Z

Can one of the admins verify this patch?

ibm-ompi · 2021-05-12T20:36:42Z

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/a426a1c0813e015c139743456eeaf7d8

rhc54 · 2021-05-18T03:18:12Z

For example, maybe we could cut down/deprecate some of these older CLI options. It might be nice if everything could be done through --map-by X --bind-to Y, which might reduce the possibility of mixing-n-matching different corner cases...? Shrug. Just a thought / food for thought.

That's precisely the direction we went in PRRTE.

I don't understand this PR at all as it (a) modifies a non-existent file in the non-existent ORTE tree and (b) modifies a file in hwloc/base that is no longer used (and should probably be removed from OMPI's master branch). Why does this PR even exist?

jsquyres · 2021-05-18T14:08:36Z

I'd be willing to bet that this PR is from before we PRTE-ized. It's probably stale / moot now.

jjhursey · 2021-09-21T15:09:58Z

Ref #9299 and #9298

awlauria · 2021-09-21T15:20:51Z

As of 9/21 ompi call, it was decided this is too risky to take this late in the v4 series. Instead we will print a warning saying that this option is broken, as well as update the man page.

jjhursey · 2021-09-21T15:22:45Z

Someone should check the behavior on master/v5.0.x to see if we need to fix the problem there. The concern with fixing the v4.0 and v4.1 branches was that it was a behavior change. In v5.0 we should do the right thing.

open-mpi deleted a comment from ibm-ompi Jun 12, 2019

markalle force-pushed the cpu_set branch from b2338de to 0015c07 Compare June 12, 2019 01:19

open-mpi deleted a comment from ibm-ompi Jun 12, 2019

markalle force-pushed the cpu_set branch from 0015c07 to 19dd4fa Compare July 18, 2019 20:51

bwbarrett added the Target: main label Jul 23, 2019

rhc54 added the ⚠️ WIP-DNM! label Jul 24, 2019

markalle force-pushed the cpu_set branch from ec7e82c to 2233146 Compare September 10, 2019 21:18

markalle added 2 commits September 10, 2019 17:23

markalle mentioned this pull request Oct 3, 2019

Cpu set v40x #7034

Closed

awlauria closed this Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restoring more hwloc --cpu-set behavior from OMPI 3.x #6755

restoring more hwloc --cpu-set behavior from OMPI 3.x #6755

markalle commented Jun 11, 2019 •

edited by jsquyres

Loading

jjhursey commented Jun 12, 2019

jjhursey commented Jun 12, 2019

bgoglin commented Jun 12, 2019 •

edited

Loading

jjhursey commented Jun 12, 2019

markalle commented Jun 12, 2019 •

edited by jsquyres

Loading

bgoglin commented Jun 12, 2019

rhc54 commented Jun 12, 2019

markalle commented Jun 13, 2019

markalle commented Jun 18, 2019

rhc54 commented Jun 19, 2019

jjhursey commented Jun 19, 2019

rhc54 commented Jun 19, 2019

markalle commented Jul 2, 2019

jsquyres commented Jul 2, 2019

markalle commented Jul 18, 2019

rhc54 commented Jul 24, 2019

rhc54 commented Jul 24, 2019

bgoglin commented Jul 25, 2019

jsquyres commented Jul 29, 2019

markalle commented Jul 31, 2019

markalle commented Sep 10, 2019

rhc54 commented Sep 10, 2019

markalle commented Sep 11, 2019

jsquyres commented Sep 12, 2019

markalle commented Sep 13, 2019

lanl-ompi commented Oct 25, 2020

ibm-ompi commented May 12, 2021

rhc54 commented May 18, 2021

jsquyres commented May 18, 2021

jjhursey commented Sep 21, 2021

awlauria commented Sep 21, 2021

jjhursey commented Sep 21, 2021

restoring more hwloc --cpu-set behavior from OMPI 3.x #6755

restoring more hwloc --cpu-set behavior from OMPI 3.x #6755

Conversation

markalle commented Jun 11, 2019 • edited by jsquyres Loading

jjhursey commented Jun 12, 2019

jjhursey commented Jun 12, 2019

bgoglin commented Jun 12, 2019 • edited Loading

jjhursey commented Jun 12, 2019

markalle commented Jun 12, 2019 • edited by jsquyres Loading

bgoglin commented Jun 12, 2019

rhc54 commented Jun 12, 2019

markalle commented Jun 13, 2019

markalle commented Jun 18, 2019

rhc54 commented Jun 19, 2019

jjhursey commented Jun 19, 2019

rhc54 commented Jun 19, 2019

markalle commented Jul 2, 2019

jsquyres commented Jul 2, 2019

markalle commented Jul 18, 2019

rhc54 commented Jul 24, 2019

rhc54 commented Jul 24, 2019

bgoglin commented Jul 25, 2019

jsquyres commented Jul 29, 2019

markalle commented Jul 31, 2019

markalle commented Sep 10, 2019

rhc54 commented Sep 10, 2019

markalle commented Sep 11, 2019

jsquyres commented Sep 12, 2019

markalle commented Sep 13, 2019

lanl-ompi commented Oct 25, 2020

ibm-ompi commented May 12, 2021

rhc54 commented May 18, 2021

jsquyres commented May 18, 2021

jjhursey commented Sep 21, 2021

awlauria commented Sep 21, 2021

jjhursey commented Sep 21, 2021

markalle commented Jun 11, 2019 •

edited by jsquyres

Loading

bgoglin commented Jun 12, 2019 •

edited

Loading

markalle commented Jun 12, 2019 •

edited by jsquyres

Loading