Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restoring more hwloc --cpu-set behavior from OMPI 3.x #6755

Closed
wants to merge 2 commits into from

Conversation

markalle
Copy link
Contributor

@markalle markalle commented Jun 11, 2019

There are two commits here

  1. restoring --cpu-set behavior that OMPI 3.x had
  2. addressing an error detection (for oversubscription) that was too aggressive

I made an older checkin bdd92a7
that partially restored --cpu-set behavior using OMPI 3.x code,
and I was about to restore more code from 3.x, but I'm taking a
different approach here and reverting the previous addition of
3.x code too.

For the (partial) revert:
commit: bdd92a7.
title: "-cpu-set as a constraint rather than as a binding"
I'm reverting the functional changes to how the hwloc tree is
iterated over, but keeping the error-detection changes modeled
after 3.x.

For the new approach to --cpu-set:

The --cpu-set option is logically similar to running under a cgroup
just without the OS-level enforcement that comes with a cgroup. For
cgroups the new code loads the topology without the WHOLE_SYSTEM
flag so the tree only contains what's in the cgroup. We can do the
same thing with a hwloc_topology_restrict() call to constrain the
topology to whatever --cpu-set the user enters. Then all the 3.x
code that skips over unavailable tree entries becomes unnecessary.

Here are the cases that were fixed in bdd92a7,
and which are confirmed to still work with the new approach:

hardware: [..../..../..../....] numbered sequentially 0-15

% mpirun -np 2 --report-bindings --bind-to hwthread \
  --map-by hwthread --cpu-set 6,7 hostname
MCW rank 0 [..../..B./..../....]
MCW rank 1 [..../...B/..../....]
% mpirun -np 2 --report-bindings --bind-to hwthread \
  --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
MCW rank 0 [..../..BB/..../....]
MCW rank 1 [..../..../..../BB..]

The additional case I was in progress to fix here and which
is also handled by the new approach is

% mpirun -np 2 --report-bindings --bind-to hwthread \
  --map-by ppr:2:node:pe=3 --cpu-set 4,5,9,11,14,15 ./x
MCW rank 0 [..../BB../.B../....]
MCW rank 1 [..../..../...B/..BB]

@open-mpi open-mpi deleted a comment from ibm-ompi Jun 12, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Jun 12, 2019
@jjhursey
Copy link
Member

bot:ibm:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Jun 12, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Jun 12, 2019
@jjhursey
Copy link
Member

bot:ibm:retest

@bgoglin
Copy link
Contributor

bgoglin commented Jun 12, 2019

hwloc_topology_get_allowed_cpuset(topo) is only useful if hwloc is configured to list disallowed CPUs (those outside of our cgroup). AFAIK, that's not the case in OMPI anymore (HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM is never passed to hwloc_topology_set_flags())

@jjhursey
Copy link
Member

bot:ibm:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Jun 12, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Jun 12, 2019
@markalle
Copy link
Contributor Author

markalle commented Jun 12, 2019

Oh gosh, I was about to ask how --report-bindings manages to display the affinity as a portion of the whole machine if WHOLE_SYSTEM isn't used... but it doesn't anymore. I guess that's a matter for a separate discussion what --report-bindings should show. I prefer the old way where the whole machine is there. Eg if I run in a tiny cgroup that contains cores 6,7,10,11 I'd prefer to see

[..../..BB/..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../..../....]

than [BB/..][]
Do you know if this change was intentional or just a side effect of other changes?

I think I agree this makes hwloc_topology_get_allowed_cpuset() unnecessary, but the intersections with rdata->available would still be needed

@bgoglin
Copy link
Contributor

bgoglin commented Jun 12, 2019

Enabling WHOLE_SYSTEM makes things much more complicated since you have to manually ignore disallowed objects when iterating, counting objects, etc. I don't remember the actual reason why it was disabled, but enabling it just for having a nicer --report-bindings would seem crazy to me.

I don't see an easy way to bring back the feature. hwloc still has the info that some other PUs exist, but we don't know which cores/sockets they correspond to.

If the transmitted XML contains disallowed PUs, one could load a secondary topology for --report-bindings with the WHOLE_SYSTEM flag.

@rhc54
Copy link
Contributor

rhc54 commented Jun 12, 2019

It was a very deliberate decision - the inclusion of disallowed PUs confused people (and software) that didn't realize they had to ignore them. As @bgoglin explained, it made things much easier and less confusing to users.

Please don't bring it back 😄

@markalle
Copy link
Contributor Author

I'll open a separate PR related to --report-bindings.

For the purpose of this PR I like the decision to not use WHOLE_SYSTEM, and I'm curious if an analogous approach should be used with rdata->available. I'd consider --cpu-set to be an analogous concept, it's like specifying a cgroup minus the enforcement that comes with a real cgroup. It seems like the bitwise intersections with rdata->available are scattered around quite a bit in the 3.x code with lots of checks and skips in the loops like mentioned above. I'd rather see the whole topology modified early on to restrict it to rdata->available, close to topology_load time.

I haven't gone down that path, but might consider it for the future if that sounds plausible to others.

@markalle
Copy link
Contributor Author

I was about to modify this PR to remove the calls to hwloc_topology_get_allowed_cpuset() since none of the topologies use WHOLE_SYSTEM anymore, but I heard there's discussion about whether the mappings are right under the new system and it might be coming back (?). I'm not clear on why they'd be wrong, empirically I'm able to put different cgroups on each host and mpirun is seeing a correct looking just-what's-available topology for each host and the assignments it makes are translating out correctly back at the ranks, at least for my runs.

But anyway where does that put this PR? Here I'm restoring 3.x behavior for the --cpu-set option and currently still have code that intersects with hwloc_topology_get_allowed_cpuset() which shouldn't be harmful but might be unnecessary

@rhc54
Copy link
Contributor

rhc54 commented Jun 19, 2019

Afraid I don't know what you are talking about relative to this PR. Removing get_allowed_cpuset makes sense per the comments from @bgoglin . The other discussion related to your report-bindings PR which has some problems associated with it.

@jjhursey
Copy link
Member

@rhc54 You mentioned on the OMPI teleconf that you were concerned that the process mapping was going to be wrong in an environment with cgroups set this way - since the hwloc signature doesn't include the offline/online information. I think that's what @markalle is asking about.

@rhc54
Copy link
Contributor

rhc54 commented Jun 19, 2019

Understood, but that doesn't have anything to do with this PR. Looking at it again, my question would be: if you remove the get_allowed_cpuset changes, is there anything left?

@markalle
Copy link
Contributor Author

markalle commented Jul 2, 2019

The reason I brought up the teleconference conversation was to ask if the WHOLE_SYSTEM + get_allowed_cpuset() might be coming back due to the concerns about mapper output with the new non-WHOLE_SYSTEM topologies or if the former WHOLE_SYSTEM topologies are really staying gone. That question would affect whether to keep the get_allowed_cpuset() in this PR.

But even if we keep the topologies as non-WHOLE_SYSTEM and remove the get_allowed_cpuset() here, there is still active code change in this PR. The --cpu-set option puts a mask in
rdata->available
In OMPI 3.x that value was checked a bunch of places as the hwloc tree was searched and iterated over. Much of that code was removed in 4.x producing issues like the one addressed here.

@jsquyres
Copy link
Member

jsquyres commented Jul 2, 2019

This is also planned to be discussed on the 9 July 2019 webex.

@markalle
Copy link
Contributor Author

I reverted not just the current addition of 3.x code but also a previous checkin from april that had also added 3.x code for skipping over unavailable tree entries. And I switched to a hwloc_topology_restrict() call that makes --cpu-set get handled analogously to how a cgroup is handled. That approach always seemed more logical to me; I just hadn't bothered to see if it would work, figuring restoring 3.x behavior was safer than taking a new approach.

But I bet cgroups get more attention and testing than --cpu-set, so piggybacking that approach ought to be good

@rhc54
Copy link
Contributor

rhc54 commented Jul 24, 2019

You can squash these if you like - probably need to do so to quiet the signed-off-by checker. I simply fixed some code style violations. Something to perhaps keep in mind for future PRs.

I have no issue with these changes.

@rhc54
Copy link
Contributor

rhc54 commented Jul 24, 2019

Just a second - something just hit me.

@bgoglin What happens if an application proc calls hwloc_topology_restrict on a shared memory topology tree owned by the daemon? What if it tries to add an object to the "userdata" field?

@bgoglin
Copy link
Contributor

bgoglin commented Jul 25, 2019

@rhc54 It's not allowed, it's a read-only shared mapping.

@jsquyres
Copy link
Member

@markalle Are you planning to update this PR (per @bgoglin's comments)?

@markalle
Copy link
Contributor Author

Well, I think it should be okay, but would like confirmation.

The code we're concerned about is where I added a call at the bottom of

int opal_hwloc_base_filter_cpus(hwloc_topology_t topo)
{
    ...

// Treat --cpu-set analogously to how a cgroup is treated, eg make
// the loaded topology only contain what is available.
    hwloc_topology_restrict(topo, avail, 0);
}

to shrink the topology in a manner that I expect is similar to what would happen if I had created an equivalent cgroup and had done a topology_load() from inside that cgroup.

My two reasons for believing this is okay are

  1. looking at where opal_hwloc_base_filter_cpus() is called, it sure looks like all its uses come from topologies that had just been created by ordinary calls to hwloc_topology_init() and hwloc_topology_load(), and
  2. nearly the first line of opal_hwloc_base_filter_cpus() is already setting root->userdata so I think that means it's already not a legal function to call on a shared topology

@markalle
Copy link
Contributor Author

I repushed with Ralph's code violation fixes squashed into my commits.

In bind_generic() there's a loop that picks a starting trg_obj and then
walks through a loop of next = trg_obj->next_cousin until it has made
total_cpus assignments.  But the code doesn't accept that those assignments
might not be adjacent objects.

Example:
% mpirun -np 2 --report-bindings --map-by ppr:2:node:pe=3 \
    --cpu-set 4,5,7,8,9,11 -bind-to hwthread:overload-allowed
> MCW 0 : [..../BB.B/..../....]
> MCW 1 : [..../..../BB.B/....]

It will want to assign 3 cpus and will loop through
  trg_obj 00001 (with ncpus 1)
  trg_obj 000001 (with ncpus 1)
  trg_obj 0000001 (with ncpus 0)
  trg_obj 000000011 (with ncpus 1)

The original code on the third entry would see num_bound for the
object become too high for its ncpus and think oversubscription was
happening.  I changed it to only ++num_bound eg to use that object
if the object has cpus in its cpuset after intersected with the
allowed/available masks.

The error message from the original code (if you remove the overload-allowed)
would be
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>    Bind to:     HWTHREAD
>    Node:        ...
>    #processes:  1
>    #cpus:      0

Signed-off-by: Mark Allen <markalle@us.ibm.com>

[ This checkin also includes a squashed commit from Ralph Castain
to fix code style violations. ]
I made an older checkin bdd92a7
that partially restored --cpu-set behavior using OMPI 3.x code,
and I was about to restore more code from 3.x, but I'm taking a
different approach here and reverting the previous addition of
3.x code too.

For the (partial) revert:
  commit: bdd92a7.
  title: "-cpu-set as a constraint rather than as a binding"
I'm reverting the functional changes to how the hwloc tree is
iterated over, but keeping the error-detection changes modeled
after 3.x.

For the new approach to --cpu-set:

The --cpu-set option is logically similar to running under a cgroup
just without the OS-level enforcement that comes with a cgroup. For
cgroups the new code loads the topology without the WHOLE_SYSTEM
flag so the tree only contains what's in the cgroup. We can do the
same thing with a hwloc_restrict_topology() call to constrain the
topology to whatever --cpu-set the user enters.  Then all the 3.x
code that skips over unavailable tree entries becomes unnecessary.

Here are the cases that were fixed in bdd92a7,
and which are confirmed to still work with the new approach:

hardware: [..../..../..../....] numbered sequentially 0-15

% mpirun -np 2 --report-bindings --bind-to hwthread \
  --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0 [..../..B./..../....]
> MCW rank 1 [..../...B/..../....]

% mpirun -np 2 --report-bindings --bind-to hwthread \
  --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0 [..../..BB/..../....]
> MCW rank 1 [..../..../..../BB..]

The additional case I was in progress to fix here and which
is also handled by the new approach is
% mpirun -np 2 --report-bindings --bind-to hwthread \
  --map-by ppr:2:node:pe=3 --cpu-set 4,5,9,11,14,15 ./x
> MCW rank 0 [..../BB../.B../....]
> MCW rank 1 [..../..../...B/..BB]

Signed-off-by: Mark Allen <markalle@us.ibm.com>

[ This checkin also includes a squashed commit from Ralph Castain
to fix code style violations. ]
@rhc54
Copy link
Contributor

rhc54 commented Sep 10, 2019

Something may be off here - we have reports that cpu-set isn't working on v3.1.x, and so I'm not sure restoring it is the correct answer. Probably just need to start by looking at what the behavior actually is vs what we want it to be and fix it. I'll see what is going on.

@markalle
Copy link
Contributor Author

I think I've got the commit message updated to reflect what this commit is doing, although I guess the one-liner about "restoring behavior" is a little misleading. This PR isn't trying to restore old OMPI 3.1 code anymore.

It's even reverting an older commit bdd92a7 where I had previously restored some pieces of the 3.1 code for --cpu-set behavior. Initially this PR was going to pull in even more 3.1 code, but after our discussions I've ended up reversing course and reverting most of bdd92a7 too.

The new direction for supporting the old behavior is very analogous to what would happen if the user created a cgroup and all the affinity code was running inside of that. I feel like the new code should be quite a bit more reliable than the 3.1 code that had dozens of locations where it would be iterating over the hwloc tree and would have extra checks to see if the things it was iterating on were part of the rdata->available mask that comes from --cpu-set. I suspect cgroups are more frequently used and tested than -cpu-set, and they're logically doing the same thing, so I think it makes a lot of sense for them to be the same code path.

@jsquyres
Copy link
Member

@markalle I'm a bit confused: can you clarify what the problem is that we're solving here?

I ask because I'm not entirely sure that we really know what we want our mapping/binding options to do. It feels like we have a huge sprawl of mapping/binding options, and there's decent-enough documentation of the basics in the man page, and they seem to work well (--map-by core --bind-to core and friends). But there seem to be a lot of corner cases and legacy CLI options that I don't know if they are tested well or even well-defined (much less documented). Additionally, I don't know if it's well-defined what happens when these CLI options are mixed together.

E.g., @rhc54 and I were talking about #6966 and we realized that we didn't know the difference between --cpu-set and --cpu-list. ...and then it turns out that although mpirun(1) says that they are different, they look like they behave the same to us (for v3.x and master). Indeed, if this PR is about restoring 3.1.x behavior, I'm not convinced that that behavior is "correct" / what we want.

It might be a good idea to have a webex to discuss: what exactly do we want? (perhaps in the context of v5.0.x)

For example, maybe we could cut down/deprecate some of these older CLI options. It might be nice if everything could be done through --map-by X --bind-to Y, which might reduce the possibility of mixing-n-matching different corner cases...? Shrug. Just a thought / food for thought.

@markalle
Copy link
Contributor Author

I agree there's a lot of sprawl, and some options might be "one off" designs that aren't clear on their interactions with other options.

My fixes are generally in response to our test team, which is probably trying out options based on the documentation more often than they're making tests based on customer usage.

Running normal affinity options inside of a cgroup though is pretty fundamental (and works I think). And this option is loosely related to that. For cgroups you probably have a job scheduler who knows what resources are available, and maybe another job is on the first socket, and the current job is being given the second socket so the job scheduler creates a cgroup for the part of the machine you're allowed to be on. Then this option would be kind of a cheap mimicry for people without a job scheduler who are writing their own scripts to do things like that by hand.

I wouldn't say it's the most important option in the world, but I still think it can be handy.

@markalle markalle mentioned this pull request Oct 3, 2019
@lanl-ompi
Copy link
Contributor

Can one of the admins verify this patch?

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/a426a1c0813e015c139743456eeaf7d8

@rhc54
Copy link
Contributor

rhc54 commented May 18, 2021

For example, maybe we could cut down/deprecate some of these older CLI options. It might be nice if everything could be done through --map-by X --bind-to Y, which might reduce the possibility of mixing-n-matching different corner cases...? Shrug. Just a thought / food for thought.

That's precisely the direction we went in PRRTE.

I don't understand this PR at all as it (a) modifies a non-existent file in the non-existent ORTE tree and (b) modifies a file in hwloc/base that is no longer used (and should probably be removed from OMPI's master branch). Why does this PR even exist?

@jsquyres
Copy link
Member

I'd be willing to bet that this PR is from before we PRTE-ized. It's probably stale / moot now.

@jjhursey
Copy link
Member

Ref #9299 and #9298

@awlauria
Copy link
Contributor

As of 9/21 ompi call, it was decided this is too risky to take this late in the v4 series. Instead we will print a warning saying that this option is broken, as well as update the man page.

@awlauria awlauria closed this Sep 21, 2021
@jjhursey
Copy link
Member

Someone should check the behavior on master/v5.0.x to see if we need to fix the problem there. The concern with fixing the v4.0 and v4.1 branches was that it was a behavior change. In v5.0 we should do the right thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants