Skip to content

Commit

Permalink
removing an overly aggressive error check in binding
Browse files Browse the repository at this point in the history
In bind_generic() there's a loop that picks a starting trg_obj and then
walks through a loop of next = trg_obj->next_cousin until it has made
total_cpus assignments.  But the code doesn't accept that those assignments
might not be adjacent objects.

Example:
% mpirun -np 2 --report-bindings --map-by ppr:2:node:pe=3 \
    --cpu-set 4,5,7,8,9,11 -bind-to hwthread:overload-allowed
> MCW 0 : [..../BB.B/..../....]
> MCW 1 : [..../..../BB.B/....]

It will want to assign 3 cpus and will loop through
  trg_obj 00001 (with ncpus 1)
  trg_obj 000001 (with ncpus 1)
  trg_obj 0000001 (with ncpus 0)
  trg_obj 000000011 (with ncpus 1)

The original code on the third entry would see num_bound for the
object become too high for its ncpus and think oversubscription was
happening.  I changed it to only ++num_bound eg to use that object
if the object has cpus in its cpuset after intersected with the
allowed/available masks.

The error message from the original code (if you remove the overload-allowed)
would be
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>    Bind to:     HWTHREAD
>    Node:        ...
>    #processes:  1
>    #cpus:      0

Signed-off-by: Mark Allen <markalle@us.ibm.com>
  • Loading branch information
markalle committed Jun 12, 2019
1 parent bd9ad69 commit 0015c07
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion orte/mca/rmaps/base/rmaps_base_binding.c
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,9 @@ static int bind_generic(orte_job_t *jdata,
data = OBJ_NEW(opal_hwloc_obj_data_t);
trg_obj->userdata = data;
}
data->num_bound++;
if (ncpus) {
data->num_bound++;
}
/* error out if adding a proc would cause overload and that wasn't allowed,
* and it wasn't a default binding policy (i.e., the user requested it)
*/
Expand Down

0 comments on commit 0015c07

Please sign in to comment.