Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binding: -cpu-set as a constraint rather than as a binding #6584

Merged
merged 2 commits into from
Apr 15, 2019

Conversation

markalle
Copy link
Contributor

There are two commits here, but one is pretty small. The bulk of this PR is -cpu-set related:

The first category of issue I'm addressing is that recent code changes seem to only consider -cpu-set as a binding option. Eg a command like this

  % mpirun -np 2 --report-bindings --use-hwthread-cpus \
      --bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname

which just round robins over the --cpu-set list.

Example output which seems fine to me:
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

It should also be possible though to pass a --cpu-set to most other map/bind options and have it be a constraint on that binding. Eg

  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

The first command above errors that

Conflicting directives for mapping policy are causing the policy
to be redefined:
New policy: RANK_FILE
Prior policy: BYHWTHREAD

The error check in orte_rmaps_rank_file_open() is likely too aggressive. The intent seems to be that any option like "--map-by whatever" will check to see if a rankfile is in use, and report that mapping via rmaps and using an explicit rankfile is a conflict.

But the check has been expanded to not just check
NULL != orte_rankfile
but also errors out if
(NULL != opal_hwloc_base_cpu_list &&
!OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only recognizing -cpu-set as a binding option and ignoring -cpu-set as a constraint on other binding policies.

For now I've changed the
NULL != opal_hwloc_base_cpu_list
to
OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy)
so it hopefully only errors out if -cpu-set is being used as a binding policy. Whether I did that right or not it's enough to get to the next stage of testing the example commands I have above.

Another place similar logic is used is hwloc_base_frame.c where it has

    /* did the user provide a slot list? */
    if (NULL != opal_hwloc_base_cpu_list) {
        OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
    }

where it used to (long ago) only do that if
!OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
I think the new code is making it impossible to use --cpu-set as anything other than a binding policy.

That brings us past the error detection and into the real functionality, some of which has been stripped out, probably in moving to hwloc-2:

  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The rank_by() function in rmaps_base_ranking.c makes an array out of objects returned from
opal_hwloc_base_get_obj_by_type(,,,i,)
which uses df_search(). That function changed quite a bit from hwloc-1 to 2 but it used to include a check for
available = opal_hwloc_base_get_available_cpus(topo, start)
which is where the bitmask from --cpu-set goes. And it used to skip objs that had hwloc_bitmap_iszero(available).

So I restored that behavior in ds_search() by adding a "constrained_cpuset" to replace start->cpuset that it was otherwise processing. With that change in place the first command works:

  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The other command uses a different path though that still ignored the available mask:

  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

In bind_generic() the code used to call opal_hwloc_base_find_min_bound_target_under_obj() which used opal_hwloc_base_get_ncpus(), and that's where it would intersect objects with the available cpuset and skip over ones that were't available. To match the old behavior I added a few lines in bind_generic() to skip over objects that don't intersect the available mask. After that we get

  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

I think the above changes are improvements, but I don't feel like they're comprehensive. I only traced through enough code to fix the two specific bugs I was dealing with.

The first category of issue I'm addressing is that recent code changes
seem to only consider -cpu-set as a binding option. Eg a command like
this
  % mpirun -np 2 --report-bindings --use-hwthread-cpus \
      --bind-to cpulist:ordered --map-by hwthread --cpu-set 6,7 hostname
which just round robins over the --cpu-set list.

Example output which seems fine to me:
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

It should also be possible though to pass a --cpu-set to most other
map/bind options and have it be a constraint on that binding. Eg
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

The first command above errors that
> Conflicting directives for mapping policy are causing the policy
> to be redefined:
>   New policy:   RANK_FILE
>   Prior policy:  BYHWTHREAD

The error check in orte_rmaps_rank_file_open() is likely too aggressive.
The intent seems to be that any option like "--map-by whatever" will
check to see if a rankfile is in use, and report that mapping via rmaps
and using an explicit rankfile is a conflict.

But the check has been expanded to not just check
    NULL != orte_rankfile
but also errors out if
    (NULL != opal_hwloc_base_cpu_list &&
    !OPAL_BIND_ORDERED_REQUESTED(opal_hwloc_binding_policy))
which seems to be only recognizing -cpu-set as a binding option and
ignoring -cpu-set as a constraint on other binding policies.

For now I've changed the
    NULL != opal_hwloc_base_cpu_list
to
    OPAL_BIND_TO_CPUSET == OPAL_GET_BINDING_POLICY(opal_hwloc_binding_policy)
so it hopefully only errors out if -cpu-set is being used as a binding
policy.  Whether I did that right or not it's enough to get to the next
stage of testing the example commands I have above.

Another place similar logic is used is hwloc_base_frame.c where it has
    /* did the user provide a slot list? */
    if (NULL != opal_hwloc_base_cpu_list) {
        OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);
    }
where it used to (long ago) only do that if
    !OPAL_BINDING_POLICY_IS_SET(opal_hwloc_binding_policy)
I think the new code is making it impossible to use --cpu-set as anything
other than a binding policy.

That brings us past the error detection and into the real functionality, some of
which has been stripped out, probably in moving to hwloc-2:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [B.../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [.B../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The rank_by() function in rmaps_base_ranking.c makes an array out of objects
returned from
    opal_hwloc_base_get_obj_by_type(,,,i,)
which uses df_search().  That function changed quite a bit from hwloc-1 to 2
but it used to include a check for
    available = opal_hwloc_base_get_available_cpus(topo, start)
which is where the bitmask from --cpu-set goes.  And it used to skip objs that
had hwloc_bitmap_iszero(available).

So I restored that behavior in ds_search() by adding a "constrained_cpuset" to
replace start->cpuset that it was otherwise processing.  With that change in
place the first command works:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
> MCW rank 0: [..../..B./..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../...B/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

The other command uses a different path though that still ignored the
available mask:
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [BB../..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..BB/..../..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
In bind_generic() the code used to call
opal_hwloc_base_find_min_bound_target_under_obj() which used
opal_hwloc_base_get_ncpus(), and that's where it would
intersect objects with the available cpuset and skip over ones
that were't available. To match the old behavior I added a few
lines in bind_generic() to skip over objects that don't intersect
the available mask. After that we get
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node:pe=2 --cpu-set 6,7,12,13 hostname
> MCW rank 0: [..../..BB/..../..../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]
> MCW rank 1: [..../..../..../BB../..../..../..../..../..../..../..../....][..../..../..../..../..../..../..../..../..../..../..../....]

I think the above changes are improvements, but I don't feel like they're
comprehensive.  I only traced through enough code to fix the two specific
bugs I was dealing with.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
The following command hangs:
  % mpirun --rank-by core -np 3 --report-bindings hostname
because of a loop where i is supposed to cycle through an
array of size num_objs, but for some reason it's only
looking at node->num_procs entries.

I changed the counter so it stays in the loop (stays on this
node) until it makes a full cycle through the array of objects
without any assignments then it ends the loop so it can go to
the next node.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2019

I don't see any cpulist:ordered binding policy in OMPI - is that some Spectrum thingy?

I'm leery of these changes unless we can demonstrate they have been tested against a wide range of mapping/binding combinations as you are touching code used by all of them - but your comments imply you have only looked at rankfile (which is the least-used mapper we have)

@markalle
Copy link
Contributor Author

The description here:
https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php
talks about --bind-to cpu-list:ordered and has some syntactic synonyms for it. It's not a spectrum addition.

But I was only using that command to show the context in which -cpu-set currently seems to be functional.

In the 3.x branch -cpu-set could be used in other contexts like these:

 % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
  % mpirun -np 2 --report-bindings \
      --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

Those are the example commands re-enabled in this PR.

@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2019

Dunno where that came from - I don't recognize that binding policy, but maybe someone added it?

Anyway, I still want to see some broader testing than just rankfile - as I said, you're touching code that many map/bind combinations traverse, even when --cpu-set isn't specified. We have lots of examples where someone "fixed" it for one use-case and "broke" it for others 😢

@markalle
Copy link
Contributor Author

I agree about wanting more testing. But the commands I'm re-enabling here aren't using rankfile. They're simple mapping options like --map-by hwthread or --map-by ppr:2:node,pe=2 but operating within a constrained -cpu-set:

% mpirun -np 2 --report-bindings \
    --bind-to hwthread --map-by hwthread --cpu-set 6,7 hostname
% mpirun -np 2 --report-bindings \
    --bind-to hwthread --map-by ppr:2:node,pe=2 --cpu-set 6,7,12,13 hostname

The only way I see rankfile coming into this is that the first error-check is (I think wrongly) conflating any use of -cpu-set as being equivalent to using a rankfile. I don't know all the ways -cpu-set can be used, but I think my examples above shouldn't be considered on par with the explicit bindings that come from a rankfile.

The part that I guess you're saying is most dangerous is the change that makes this statement in hwloc_base_frame.c

        OPAL_SET_BINDING_POLICY(opal_hwloc_binding_policy, OPAL_BIND_TO_CPUSET);

less active. I would like to know the original rationale for the above. But the code that's there now makes it as if the user typed --bind-to cpu-list even if all they did was add --cpu-set to another map/bind option like my examples.

But the rest of the changes are relatively simple, using hwloc_root->userdata->rdata if it exists, and is only replicating functionality that existed in 3.x prior to changing from hwloc-1 to hwloc-2. The part that isn't simple is the fact that it's likely not comprehensive -- I traced through the code for my two examples comparing old and new behavior. I suspect there are plenty more places essentially the same functionality was removed in moving from hwloc-1 to hwloc-2.

@markalle
Copy link
Contributor Author

For me this PR also appears to fix
#5020

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get it into master and under MTT. There is also a map tester in ompi-tests/orte/pass/maprankbind.pl - might be good to give that a try as well. You have to generate a "signature" file as it only checks for changes, so I'd do before-PR to after-PR.

@gpaulsen
Copy link
Member

Thanks.

@awlauria
Copy link
Contributor

@markalle @gpaulsen this never went back to the v4 series. Is this intentional?

@awlauria
Copy link
Contributor

I brought brack bdd92a7 to v4/v4.1.

However, the second commit, bf3980d, seems to have been fixed by another PR that did go back - I think this one specifically: #7643

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants