Fix resource usage tracking and remove stale mapper #1383

rhc54 · 2022-07-15T14:10:42Z

Fix resource usage tracking for map/bind operations

Tracking solely at the slot level doesn't adequately protect against
overlapping CPU assignments and other more complex mapping requests.
Modify the resource usage tracking to operate at the CPU level and
combine the mapping/binding operation into a single pass.

Ranking must still be done as a second pass, but restrict the options
to simplify implementation and avoid confusion. Update the help output
to reflect the changes.

Allow the DVM to also support "do-not-launch" directives for testing
purposes, and to accept simulated node/topologies.

Fix a few other minor problems along the way.

Signed-off-by: Ralph Castain rhc@pmix.org

Remove the mindist mapper - no maintainer

Signed-off-by: Ralph Castain rhc@pmix.org

rhc54 · 2022-07-16T16:23:38Z

@jjhursey Your CI contains an error in one of its cmd lines:

prterun --map-by ppr:5:node --host 192.168.210.115:5,192.168.160.63:5 hostname

cannot possibly succeed. By definition, the --host option will only assign one slot to each of the specified hosts. Thus, this cmd line is required to error out with the "not enough slots" error, which is precisely what it does.

The cmd line would work for the case where the allocation was given by a resource manager. It just cannot work in environments where the allocation is given by a hostfile or is only given via the --host cmd line option.

I'll correct this in the pmix-tests repo, but wanted to explain why it had to be changed.

rhc54 · 2022-07-16T16:30:09Z

Funny enough, it actually worked in one environment - which leads me to suspect you must have specified a default hostfile in there somewhere. I don't rule out that it might be working/failing randomly due to some other factor. However, it should fail every time as-written.

rhc54 · 2022-07-16T19:23:04Z

Ignore the noise - I saw that the cmd line does indeed specify the :5 slot number as it should. Something else must be going on.

rhc54 · 2022-07-17T02:02:45Z

@jjhursey Afraid I am going to need your help with the darned PGI case again. I simply cannot replicate this behavior. It appears that the node pointer in the PMIX_LIST_FOREACH_SAFE is not getting advanced for some reason. Can you try to get some reason for this behavior?

rhc54 · 2022-07-17T21:39:09Z

Update: I tried modifying the CI script to isolate the test and then adding debug output. Best I could find is that the "foreach" list macro in src/mca/rmaps/ppr/rmaps_ppr.c is in fact operating correctly. However, all the nodes on the list report the same hostname when running in the PGI environment.

I was unable to get any further diagnostics from it, so I can't tell if something strange is in the environment (making all the nodes resolve to the same hostname) or if something odd is happening inside PRRTE. Finally had to give up - this will require someone with access to the VM to debug it.

I reset the CI scripts and removed the debug I had added to the PR. I may have to do a little more cleanup to ensure I "reverting" things cleanly, but otherwise this should be good-to-go.

rhc54 · 2022-07-17T22:47:11Z

@jjhursey I believe the indirect-multi tests are in fact running correctly, but that the signature has changed due to the revised default ranking algorithms. Could you please double-check to confirm, and if that is correct, then update the signatures? I have found that I cannot replicate things closely enough here to regenerate them for your CI setup.

jjhursey · 2022-07-18T17:56:32Z

bot:ibm:retest

I can take a look at the PGI issue. The new CI environment does set

PRTE_MCA_prte_default_hostfile=/opt/mpi/etc/hostfile

So there is a default hostfile (trying to enumate a properly scheduled environment). Then use the --host option for the few tests that need to restrict down from there.

rhc54 · 2022-07-18T17:58:35Z

Thanks! I just pushed a debug change in for the dmodex test, so expect some extra verbage from the OOB. If it gets in your way, I can revert it and work the dmodex problem later.

rhc54 · 2022-07-18T18:43:05Z

I have removed that OOB debug - sorry for the noise.

jjhursey · 2022-07-18T21:05:29Z

src/mca/rmaps/base/rmaps_base_support_fns.c

+    pmix_list_remove_item(node_list, cur_node_item);
+    pmix_list_prepend(node_list, cur_node_item);


While debugging the PGI issue I saw the allocated_nodes list become corrupted after calling the prte_rmaps_base_get_starting_point function at the bottom of the prte_rmaps_base_get_target_nodes function . These two lines seemed to be the cause.

This fixed the PGI issue, but I'm a bit afraid that it is masking a different issue.

Suggested change

pmix_list_remove_item(node_list, cur_node_item);

pmix_list_prepend(node_list, cur_node_item);

if (pmix_list_get_first(node_list) != cur_node_item) {

pmix_list_remove_item(node_list, cur_node_item);

pmix_list_prepend(node_list, cur_node_item);

}

Can you give me a scenario where you would need to execute this section of code to move the node to the front of the list?

Sure - when doing a comm_spawn, you want to find the best node that can be used (hopefully as close to the bookmark as possible) and use it first when mapping the next job. So you search per the criteria above those lines and then move the selected starting point to the front of the list.

What is disturbing to me is that (a) that code has been there a long time without causing trouble, and (b) removing the first item from the list and then putting it back should not cause corruption. Either the corruption is coming before that point, or (more likely) the prte versions those list functions worked but the pmix version of them have a problem.

Probably worth taking a peek at the difference to see what might be going on. IIRC, I think we removed some thread locks in the PMIx versions - not sure why that would be causing a problem here (as we should be in an event), but maybe it is?

I dug thru the old PRRTE list functions and compared them to the PMIx functions we now use. There were only two differences:

PRRTE uses atomics to modify the refcount and PMIx does not. I don't think this is the issue as we only typically access lists from inside events. Still, something to keep in mind.

PMIx was missing a couple of type casts. These were right in the "remove_first" and "remove_last" inline functions, which means they might have had an impact here. I added the type casts in Properly cast the list_item_t openpmix#2642.

Let's see if that makes a difference - if not, then I'll add your fix.

jjhursey · 2022-07-18T21:25:51Z

I encountered another odd scenario while pushing on this branch (I'm not sure if they are specific to the branch though)

No MCA envar set. I have a hostfile:

shell$ cat hostfile_4_slots 
192.168.95.100 slots=4
192.168.21.86 slots=4
192.168.53.129 slots=4

Running:

shell$ prterun --hostfile hostfile_4_slots --np 12 --mca ras_base_verbose 100 ./hello
...
======================   ALLOCATED NODES   ======================
    192.168.95.100: slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
	Flags: SLOTS_GIVEN
	aliases: NONE
    192.168.21.86: slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
	Flags: SLOTS_GIVEN
	aliases: NONE
    192.168.53.129: slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
	Flags: SLOTS_GIVEN
	aliases: NONE
=================================================================

======================   ALLOCATED NODES   ======================
    ci-pmix-pgi-upstream-cn-0: slots=4 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 192.168.95.100
    ci-pmix-pgi-upstream-cn-1: slots=4 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 192.168.21.86
    ci-pmix-pgi-upstream-cn-2: slots=4 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 192.168.53.129
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 12
slots that were requested by the application:

  ./hello

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

jjhursey · 2022-07-18T21:30:14Z

Actually the error above seems to be the problem with the debug example in the other test cases. The prterun is just not mapping so the test is failing.

rhc54 · 2022-07-18T21:39:10Z

Actually the error above seems to be the problem with the debug example in the other test cases. The prterun is just not mapping so the test is failing.

Not sure what you mean by this statement - are you saying prterun isn't mapping due to the corruption you found? Or are you saying it isn't mapping in general?

jjhursey · 2022-07-18T21:49:32Z

Even with the fix I noted the following fails with the hostfile I provided above:
prterun --hostfile hostfile_4_slots --np 12 --mca ras_base_verbose 100 ./hello

It looks like the rmaps_rr_mappers.c is not proceeding to the next node once it hits the slot limit.

rhc54 · 2022-07-18T21:50:32Z

Oh? Weird - I'll have to take a look at that one.

rhc54 · 2022-07-18T21:56:15Z

I think I found this last problem - working on it now. Not sure what to do about the list corruption just yet. Probably needs more investigation.

rhc54 · 2022-07-19T01:47:51Z

Hmmm...well, GNU ran fine except for this last debugger test:

01:07:00 Starting testcase pid 76982: './indirect-multi --num-nodes 3 --hostfile hostfile_5_slots prterun --hostfile hostfile_5_slots --np 12 ./hello '
01:07:03 Verify stdout/stderr for testcase indirect-multi
10,15c10,12
< daemon-0            : [<host>:PID<?>:@NS<1>@2:0:PID<?>] Local proctable received for nspace '<host>:PID<?>:@NS<2>' has 5 entries
< daemon-0            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 0 exec hello
< daemon-0            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 1 exec hello
< daemon-0            : Proctable[2], namespace <host>:PID<?>:@NS<2> rank 2 exec hello
< daemon-0            : Proctable[3], namespace <host>:PID<?>:@NS<2> rank 3 exec hello
< daemon-0            : Proctable[4], namespace <host>:PID<?>:@NS<2> rank 4 exec hello
---
> daemon-0            : [<host>:PID<?>:@NS<1>@2:0:PID<?>] Local proctable received for nspace '<host>:PID<?>:@NS<2>' has 2 entries
> daemon-0            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 10 exec hello
> daemon-0            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 11 exec hello
32,36c29,33
< daemon-1            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 5 exec hello
< daemon-1            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 6 exec hello
< daemon-1            : Proctable[2], namespace <host>:PID<?>:@NS<2> rank 7 exec hello
< daemon-1            : Proctable[3], namespace <host>:PID<?>:@NS<2> rank 8 exec hello
< daemon-1            : Proctable[4], namespace <host>:PID<?>:@NS<2> rank 9 exec hello
---
> daemon-1            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 0 exec hello
> daemon-1            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 1 exec hello
> daemon-1            : Proctable[2], namespace <host>:PID<?>:@NS<2> rank 2 exec hello
> daemon-1            : Proctable[3], namespace <host>:PID<?>:@NS<2> rank 3 exec hello
> daemon-1            : Proctable[4], namespace <host>:PID<?>:@NS<2> rank 4 exec hello
51,53c48,53
< daemon-2            : [<host>:PID<?>:@NS<1>@2:2:PID<?>] Local proctable received for nspace '<host>:PID<?>:@NS<2>' has 2 entries
< daemon-2            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 10 exec hello
< daemon-2            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 11 exec hello
---
> daemon-2            : [<host>:PID<?>:@NS<1>@2:2:PID<?>] Local proctable received for nspace '<host>:PID<?>:@NS<2>' has 5 entries
> daemon-2            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 5 exec hello
> daemon-2            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 6 exec hello
> daemon-2            : Proctable[2], namespace <host>:PID<?>:@NS<2> rank 7 exec hello
> daemon-2            : Proctable[3], namespace <host>:PID<?>:@NS<2> rank 8 exec hello
> daemon-2            : Proctable[4], namespace <host>:PID<?>:@NS<2> rank 9 exec hello

I'm not sure I understand the output here. Based on the cmd line, daemon 0 should host ranks 0-4, daemon 1 should host ranks 5-9, and daemon 2 should host ranks 10-11. Is that what happened? It looks instead like we wound up starting on daemon 1 and then mapping around - is that correct?

I can try to locally reproduce.

On the PGI test case, it simply stops at run.sh for prun-wrapper without any output. I'll try adding your patch.

rhc54 · 2022-07-19T03:10:29Z

@jjhursey Still seeing a few warnings from the MPIR shim:

mpirshim.c: In function 'register_launcher_complete_handler':
mpirshim.c:1135:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_launcher_ready_handler':
mpirshim.c:1184:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_launcher_terminate_handler':
mpirshim.c:1235:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_application_terminate_handler':
mpirshim.c:1286:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
  CCLD     libmpirshim.la
  CCLD     mpirc
  CC       libmpirshimtest_la-mpirshim.lo
mpirshim.c: In function 'register_launcher_complete_handler':
mpirshim.c:1135:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_launcher_ready_handler':
mpirshim.c:1184:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_launcher_terminate_handler':
mpirshim.c:1235:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_application_terminate_handler':
mpirshim.c:1286:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
  CCLD     libmpirshimtest.la
make[1]: Leaving directory '/opt/ci/support/pmix-tests/prrte/mpir-shim/mpir-to-pmix-guide/src'
Making all in test
make[1]: Entering directory '/opt/ci/support/pmix-tests/prrte/mpir-shim/mpir-to-pmix-guide/test'
  CC       mpirshim_test-mpirshim_test.o
mpirshim_test.c:148:6: warning: no previous prototype for 'MPIR_Breakpoint_hook' [-Wmissing-prototypes]
 void MPIR_Breakpoint_hook(void)
      ^~~~~~~~~~~~~~~~~~~~

rhc54 · 2022-07-19T03:17:42Z

So what I'm seeing is that the ppr, seq, and rank-file mappers are working just fine (the latter two are tested locally and not in CI). The round-robin mappers also seem to be working okay, but are starting from the wrong node for some reason (at least in the indirect-multi case). I'll have to explore and see why - need to reproduce the spawn cmd that indirect-multi is generating to see if I can make it fail.

Specifically, it is this cmd line that is the only one failing:

./indirect-multi --num-nodes 3 --hostfile hostfile_5_slots prterun --hostfile hostfile_5_slots --np 12 ./hello

@jjhursey

Tracking solely at the slot level doesn't adequately protect against overlapping CPU assignments and other more complex mapping requests. Modify the resource usage tracking to operate at the CPU level and combine the mapping/binding operation into a single pass. Ranking must still be done as a second pass, but restrict the options to simplify implementation and avoid confusion. Update the help output to reflect the changes. Allow the DVM to also support "do-not-launch" directives for testing purposes, and to accept simulated node/topologies. Fix a few other minor problems along the way. Signed-off-by: Ralph Castain <rhc@pmix.org> Remove the mindist mapper - no maintainer Signed-off-by: Ralph Castain <rhc@pmix.org> Ensure output goes only to subscribing tools and fix cleanup for jobs with multiple cpus/rank Signed-off-by: Ralph Castain <rhc@pmix.org> Correctly restore usage from binding to multi-cpu regions Signed-off-by: Ralph Castain <rhc@pmix.org> Tools are always not bound Signed-off-by: Ralph Castain <rhc@pmix.org> Don't require bind support if bind-to-none is active Signed-off-by: Ralph Castain <rhc@pmix.org> Fix colocate Signed-off-by: Ralph Castain <rhc@pmix.org> Fix round-robin mapping to ensure that the mapper moves to the next node when it fills all slots on the current one. Signed-off-by: Ralph Castain <rhc@pmix.org> Add revised patch from @jjhursey - do not reorder node list if the first node on the list is the desired one. Signed-off-by: Ralph Castain <rhc@pmix.org>

rhc54 · 2022-07-19T13:37:56Z

@jjhursey I finally fixed it, but I'm disturbed by the problem/fix. Your patch didn't quite resolve the problems with that last indirect-multi command. I had to revise it a bit as we were still getting the wrong node at the start of the list.

I did some digging around to see if the prepend list function is used elsewhere. Obviously, we use remove first in a lot of places, so I doubt that is causing the problem. Prepend is used less often, but it is used in the code base. However, it isn't clear if those code paths are actually ever traversed - it is possible therefore that pmix_list_prepend has been bad for quite some time and just doesn't get used.

What bothers me is that I didn't touch any of that code path involving the bookmark. I only modified the mapping algorithms themselves. Assembling the list of nodes and traversing them - I didn't change that at all. So why did this now break?

Leaves me a little uneasy that the corruption is actually occurring elsewhere and we are only seeing it when we get to that spot in the procedure. I just can't nail down where it might be happening. Any further guidance would be appreciated.

Thanks for all the help!

jjhursey · 2022-07-19T14:00:00Z

Yeah things are running clean for me too now with your branch. I'm not sure why the list operations would cause an issue - I reviewed their implementations and it seemed ok. valgrind on valgrind prterun --hostfile hostfile_4_slots --np 12 ./hello didn't show anything unusual so 🤷 We'll just have to see if PGI trips over something in the future.

This was referenced Jul 15, 2022

Multiple multicore executions overlap #1248

Closed

Cached num numas #1339

Closed

rhc54 force-pushed the topic/pe branch 2 times, most recently from 791d15e to c07ce0e Compare July 16, 2022 22:22

rhc54 force-pushed the topic/pe branch from bb5d7d7 to c2706b5 Compare July 17, 2022 21:35

jjhursey reviewed Jul 18, 2022

View reviewed changes

rhc54 force-pushed the topic/pe branch from c2706b5 to cd5144e Compare July 19, 2022 00:53

rhc54 force-pushed the topic/pe branch from 777ce03 to d9ba0d9 Compare July 19, 2022 12:54

rhc54 merged commit 4b41c8e into openpmix:master Jul 19, 2022

rhc54 deleted the topic/pe branch July 19, 2022 14:06

alex--m mentioned this pull request Aug 21, 2022

hwloc support check is broken? #1451

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resource usage tracking and remove stale mapper #1383

Fix resource usage tracking and remove stale mapper #1383

rhc54 commented Jul 15, 2022

rhc54 commented Jul 16, 2022

rhc54 commented Jul 16, 2022

rhc54 commented Jul 16, 2022

rhc54 commented Jul 17, 2022

rhc54 commented Jul 17, 2022

rhc54 commented Jul 17, 2022

jjhursey commented Jul 18, 2022

rhc54 commented Jul 18, 2022

rhc54 commented Jul 18, 2022

jjhursey Jul 18, 2022

rhc54 Jul 18, 2022

rhc54 Jul 19, 2022

jjhursey commented Jul 18, 2022

jjhursey commented Jul 18, 2022

rhc54 commented Jul 18, 2022

jjhursey commented Jul 18, 2022

rhc54 commented Jul 18, 2022

rhc54 commented Jul 18, 2022

rhc54 commented Jul 19, 2022

rhc54 commented Jul 19, 2022

rhc54 commented Jul 19, 2022

rhc54 commented Jul 19, 2022

jjhursey commented Jul 19, 2022

		pmix_list_remove_item(node_list, cur_node_item);
		pmix_list_prepend(node_list, cur_node_item);

Fix resource usage tracking and remove stale mapper #1383

Fix resource usage tracking and remove stale mapper #1383

Conversation

rhc54 commented Jul 15, 2022

rhc54 commented Jul 16, 2022

rhc54 commented Jul 16, 2022

rhc54 commented Jul 16, 2022

rhc54 commented Jul 17, 2022

rhc54 commented Jul 17, 2022

rhc54 commented Jul 17, 2022

jjhursey commented Jul 18, 2022

rhc54 commented Jul 18, 2022

rhc54 commented Jul 18, 2022

jjhursey Jul 18, 2022

Choose a reason for hiding this comment

rhc54 Jul 18, 2022

Choose a reason for hiding this comment

rhc54 Jul 19, 2022

Choose a reason for hiding this comment

jjhursey commented Jul 18, 2022

jjhursey commented Jul 18, 2022

rhc54 commented Jul 18, 2022

jjhursey commented Jul 18, 2022

rhc54 commented Jul 18, 2022

rhc54 commented Jul 18, 2022

rhc54 commented Jul 19, 2022

rhc54 commented Jul 19, 2022

rhc54 commented Jul 19, 2022

rhc54 commented Jul 19, 2022

jjhursey commented Jul 19, 2022