v4.1.x: -cpu-set as a constraint rather than as a binding #9299

awlauria · 2021-08-19T18:16:02Z

The first category of issue I'm addressing is that recent code changes
seem to only consider -cpu-set as a binding option. Eg a command like
this
% mpirun -np 2 --report-bindings --use-hwthread-cpus
--bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname
which just round robins over the --cpu-set list.

Example output which seems fine to me:

MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

It should also be possible though to pass a --cpu-set to most other
map/bind options and have it be a constraint on that binding. Eg
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

The first command above errors that

Conflicting directives for mapping policy are causing the policy
to be redefined:
New policy: RANK_FILE
Prior policy: BYHWTHREAD

The error check in orte_rmaps_rank_file_open() is likely too aggressive.
The intent seems to be that any option like "--map-by whatever" will
check to see if a rankfile is in use, and report that mapping via rmaps
and using an explicit rankfile is a conflict.

But the check has been expanded to not just check
NULL != orte_rankfile
but also errors out if
(NULL != opal_hwloc_base_cpu_list &&
!OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only recognizing -cpu-set as a binding option and
ignoring -cpu-set as a constraint on other binding policies.

For now I've changed the
NULL != opal_hwloc_base_cpu_list
to
OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy)
so it hopefully only errors out if -cpu-set is being used as a binding
policy. Whether I did that right or not it's enough to get to the next
stage of testing the example commands I have above.

Another place similar logic is used is hwloc_base_frame.c where it has
/* did the user provide a slot list? */
if (NULL != opal_hwloc_base_cpu_list) {
OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
}
where it used to (long ago) only do that if
!OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
I think the new code is making it impossible to use --cpu-set as anything
other than a binding policy.

That brings us past the error detection and into the real functionality, some of
which has been stripped out, probably in moving to hwloc-2:
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname

MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The rank_by() function in rmaps_base_ranking.c makes an array out of objects
returned from
opal_hwloc_base_get_obj_by_type(,,,i,)
which uses df_search(). That function changed quite a bit from hwloc-1 to 2
but it used to include a check for
available = opal_hwloc_base_get_available_cpus(topo, start)
which is where the bitmask from --cpu-set goes. And it used to skip objs that
had hwloc_bitmap_iszero(available).

So I restored that behavior in ds_search() by adding a "constrained_cpuset" to
replace start->cpuset that it was otherwise processing. With that change in
place the first command works:
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname

MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The other command uses a different path though that still ignored the
available mask:
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname

MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
In bind_generic() the code used to call
opal_hwloc_base_find_min_bound_target_under_obj() which used
opal_hwloc_base_get_ncpus(), and that's where it would
intersect objects with the available cpuset and skip over ones
that were't available. To match the old behavior I added a few
lines in bind_generic() to skip over objects that don't intersect
the available mask. After that we get
% mpirun -np 2 --report-bindings
--bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

I think the above changes are improvements, but I don't feel like they're
comprehensive. I only traced through enough code to fix the two specific
bugs I was dealing with.

Fixes #6540

Signed-off-by: Mark Allen markalle@us.ibm.com
(cherry picked from commit bdd92a7)

The first category of issue I'm addressing is that recent code changes seem to only consider -cpu-set as a binding option. Eg a command like this % mpirun -np 2 --report-bindings --use-hwthread-cpus \ --bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname which just round robins over the --cpu-set list. Example output which seems fine to me: > MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] It should also be possible though to pass a --cpu-set to most other map/bind options and have it be a constraint on that binding. Eg % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname The first command above errors that > Conflicting directives for mapping policy are causing the policy > to be redefined: > New policy: RANK_FILE > Prior policy: BYHWTHREAD The error check in orte_rmaps_rank_file_open() is likely too aggressive. The intent seems to be that any option like "--map-by whatever" will check to see if a rankfile is in use, and report that mapping via rmaps and using an explicit rankfile is a conflict. But the check has been expanded to not just check NULL != orte_rankfile but also errors out if (NULL != opal_hwloc_base_cpu_list && !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy)) which seems to be only recognizing -cpu-set as a binding option and ignoring -cpu-set as a constraint on other binding policies. For now I've changed the NULL != opal_hwloc_base_cpu_list to OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy) so it hopefully only errors out if -cpu-set is being used as a binding policy. Whether I did that right or not it's enough to get to the next stage of testing the example commands I have above. Another place similar logic is used is hwloc_base_frame.c where it has /* did the user provide a slot list? */ if (NULL != opal_hwloc_base_cpu_list) { OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET); } where it used to (long ago) only do that if !OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy) I think the new code is making it impossible to use --cpu-set as anything other than a binding policy. That brings us past the error detection and into the real functionality, some of which has been stripped out, probably in moving to hwloc-2: % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname > MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] The rank_by() function in rmaps_base_ranking.c makes an array out of objects returned from opal_hwloc_base_get_obj_by_type(,,,i,) which uses df_search(). That function changed quite a bit from hwloc-1 to 2 but it used to include a check for available = opal_hwloc_base_get_available_cpus(topo, start) which is where the bitmask from --cpu-set goes. And it used to skip objs that had hwloc_bitmap_iszero(available). So I restored that behavior in ds_search() by adding a "constrained_cpuset" to replace start->cpuset that it was otherwise processing. With that change in place the first command works: % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname > MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] The other command uses a different path though that still ignored the available mask: % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname > MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] In bind_generic() the code used to call opal_hwloc_base_find_min_bound_target_under_obj() which used opal_hwloc_base_get_ncpus(), and that's where it would intersect objects with the available cpuset and skip over ones that were't available. To match the old behavior I added a few lines in bind_generic() to skip over objects that don't intersect the available mask. After that we get % mpirun -np 2 --report-bindings \ --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname > MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] > MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....] I think the above changes are improvements, but I don't feel like they're comprehensive. I only traced through enough code to fix the two specific bugs I was dealing with. Signed-off-by: Mark Allen <markalle@us.ibm.com> (cherry picked from commit bdd92a7)

awlauria · 2021-08-19T18:16:41Z

This is a long forgotten commit that never got ported to a release branch.

awlauria · 2021-08-19T18:39:14Z

Note - there was another commit in the PR from this cherry-pick (bf3980d), but from my testing that doesn't need to go back as it was resolved in another patch.

awlauria · 2021-08-23T19:21:51Z

@jsquyres I did verify that this is still a problem on the v4 series and is fixed with this patch.

rhc54 · 2021-08-29T17:06:04Z

As I noted on the v4.0 version of this, this actually isn't a "long forgotten" PR - we intentionally never committed it because it creates collateral damage to other use cases. Instead, we fixed it on the master branch using a different approach. As noted in the original PR, the author only verified it fixed the bugs he was targeting - he didn't check all the other use-cases (which admittedly would have been a rather large task). The problems only appeared when others began playing with it.

Up to you how you want to proceed. I don't know how hard the backport of the final fix from master would be.

jsquyres · 2021-08-29T18:41:26Z

This seems like a high risk PR to merge on a release branch. I'm going to mark as "draft" until the other use cases can be verified, or some other forward path can be determined.

awlauria · 2021-08-30T13:02:45Z

I'm not following how this got fixed in a different way on master. This PR (and the v4.0.x version) are literally cherry-picks on from master. Meaning - this code still lives on master - and is still running on master.

If this PR caused collateral damage elsewhere on master - that damage must still be there - since this code is still running on master.

Do we know what damage this PR caused? I checked the old master PR, and the old issues it referenced - and I see no mention of it.

rhc54 · 2021-08-30T14:09:49Z

Errr...I just looked at the code in PRRTE and I'm afraid it doesn't look like the code here, so I'm not sure what "master" branch you are referring to in your comment. IIRC, the problems were in other edge cases that passed thru the same code.

And no, we didn't bother flagging them on the old issues and PR - we just fixed them and moved forward.

As I said, given that master appears to be working correctly, it might be safer to backport that code than to use this one. Bringing in such old code patches is always a risk as there has been considerable drift since this was created, some of which is related or interacts with it, and none of which is easily traced.

I have no strong opinions here - please do whatever you guys want. I was only trying to warn you regarding the prior history.

awlauria · 2021-08-30T14:22:29Z

@rhc54 all of the changes to opal/hwloc are still there. Also, the code change here:

3f4b9bd#diff-d3b95ef71b882aafb439b26229c28e6ebf41962f8621a6a68fd9af4d412673a4R182

Lives on PRRTE here:

https://github.com/openpmix/prrte/blob/master/src/mca/rmaps/base/rmaps_base_binding.c#L304

the only change that doesn't live in PRRTE - as far as I can tell, is this:

3f4b9bd#diff-11aa1c9970c691d0ebffd25e38d908c757c4a5450927136499f6acdb0be6a943R110

awlauria · 2021-08-30T14:23:51Z

I'll try reverting this on master, and will see if the referenced issue comes back. If it doesn't come back, then it really was fixed in a different way/another commit.

rhc54 · 2021-08-30T14:52:21Z

I see quite a few lines of difference between the post-PR modified v4 code and what is in master, so I'm not sure what code you are reading. Still, this isn't worth the angst. You are welcome to do whatever you want - all I'm saying is that there have been some problems associated with this in the past. If it means that much to you, please feel free to proceed 😄

awlauria · 2021-08-30T15:03:01Z

Can you point me to the significant differences? I stand by my comments that all of this code is still in master except for that one change, but I could be wrong.

Again - I'll try reverting this PR on master and see if the issue comes back. if it doesn't, I'll close this and the v4.0 version as un-needed.

awlauria · 2021-09-02T20:57:22Z

Copying what I put in the companion v4.0.x pr:

I brought in these changes one-by-one, and can confirm that all changes are needed to make this option, --cpu-set, work. Even including the changes made to the hwloc files.

If this patch is to not be taken in to fix this bug - I wonder if instead we should disable --cpu-set, since as is on the v4 series it seems broken. Thoughts?

gpaulsen · 2021-09-14T21:33:36Z

This is ready for review for v4.1.x

awlauria · 2021-09-21T15:18:22Z

As of 9/21 ompi call, it was decided this is too risky to take this late in the v4 series. Instead we will print a warning saying that this option is broken, as well as update the man page.

awlauria added the Target: v4.1.x label Aug 19, 2021

awlauria added this to the v4.1.x milestone Aug 19, 2021

awlauria requested review from jsquyres, rhc54 and markalle August 19, 2021 18:16

jsquyres marked this pull request as draft August 29, 2021 18:41

gpaulsen marked this pull request as ready for review September 14, 2021 21:33

jjhursey mentioned this pull request Sep 21, 2021

restoring more hwloc --cpu-set behavior from OMPI 3.x #6755

Closed

awlauria closed this Sep 21, 2021

jsquyres mentioned this pull request Sep 21, 2021

v4.0.x and v4.1.x: document that --cpu-set doesn't work #9397

Open

awlauria deleted the cpu_fixes_v4.1.x_port branch March 17, 2022 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.1.x: -cpu-set as a constraint rather than as a binding #9299

v4.1.x: -cpu-set as a constraint rather than as a binding #9299

awlauria commented Aug 19, 2021 •

edited

Loading

awlauria commented Aug 19, 2021 •

edited

Loading

awlauria commented Aug 19, 2021

awlauria commented Aug 23, 2021

rhc54 commented Aug 29, 2021

jsquyres commented Aug 29, 2021

awlauria commented Aug 30, 2021

rhc54 commented Aug 30, 2021

awlauria commented Aug 30, 2021

awlauria commented Aug 30, 2021

rhc54 commented Aug 30, 2021

awlauria commented Aug 30, 2021

awlauria commented Sep 2, 2021 •

edited

Loading

gpaulsen commented Sep 14, 2021

awlauria commented Sep 21, 2021 •

edited

Loading

v4.1.x: -cpu-set as a constraint rather than as a binding #9299

v4.1.x: -cpu-set as a constraint rather than as a binding #9299

Conversation

awlauria commented Aug 19, 2021 • edited Loading

awlauria commented Aug 19, 2021 • edited Loading

awlauria commented Aug 19, 2021

awlauria commented Aug 23, 2021

rhc54 commented Aug 29, 2021

jsquyres commented Aug 29, 2021

awlauria commented Aug 30, 2021

rhc54 commented Aug 30, 2021

awlauria commented Aug 30, 2021

awlauria commented Aug 30, 2021

rhc54 commented Aug 30, 2021

awlauria commented Aug 30, 2021

awlauria commented Sep 2, 2021 • edited Loading

gpaulsen commented Sep 14, 2021

awlauria commented Sep 21, 2021 • edited Loading

awlauria commented Aug 19, 2021 •

edited

Loading

awlauria commented Aug 19, 2021 •

edited

Loading

awlauria commented Sep 2, 2021 •

edited

Loading

awlauria commented Sep 21, 2021 •

edited

Loading