Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix resource usage tracking and remove stale mapper #1383

Merged
merged 1 commit into from
Jul 19, 2022

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jul 15, 2022

Fix resource usage tracking for map/bind operations

Tracking solely at the slot level doesn't adequately protect against
overlapping CPU assignments and other more complex mapping requests.
Modify the resource usage tracking to operate at the CPU level and
combine the mapping/binding operation into a single pass.

Ranking must still be done as a second pass, but restrict the options
to simplify implementation and avoid confusion. Update the help output
to reflect the changes.

Allow the DVM to also support "do-not-launch" directives for testing
purposes, and to accept simulated node/topologies.

Fix a few other minor problems along the way.

Signed-off-by: Ralph Castain rhc@pmix.org

Remove the mindist mapper - no maintainer

Signed-off-by: Ralph Castain rhc@pmix.org

This was referenced Jul 15, 2022
@rhc54
Copy link
Contributor Author

rhc54 commented Jul 16, 2022

@jjhursey Your CI contains an error in one of its cmd lines:

prterun --map-by ppr:5:node --host 192.168.210.115:5,192.168.160.63:5 hostname

cannot possibly succeed. By definition, the --host option will only assign one slot to each of the specified hosts. Thus, this cmd line is required to error out with the "not enough slots" error, which is precisely what it does.

The cmd line would work for the case where the allocation was given by a resource manager. It just cannot work in environments where the allocation is given by a hostfile or is only given via the --host cmd line option.

I'll correct this in the pmix-tests repo, but wanted to explain why it had to be changed.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 16, 2022

Funny enough, it actually worked in one environment - which leads me to suspect you must have specified a default hostfile in there somewhere. I don't rule out that it might be working/failing randomly due to some other factor. However, it should fail every time as-written.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 16, 2022

Ignore the noise - I saw that the cmd line does indeed specify the :5 slot number as it should. Something else must be going on.

@rhc54 rhc54 force-pushed the topic/pe branch 2 times, most recently from 791d15e to c07ce0e Compare July 16, 2022 22:22
@rhc54
Copy link
Contributor Author

rhc54 commented Jul 17, 2022

@jjhursey Afraid I am going to need your help with the darned PGI case again. I simply cannot replicate this behavior. It appears that the node pointer in the PMIX_LIST_FOREACH_SAFE is not getting advanced for some reason. Can you try to get some reason for this behavior?

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 17, 2022

Update: I tried modifying the CI script to isolate the test and then adding debug output. Best I could find is that the "foreach" list macro in src/mca/rmaps/ppr/rmaps_ppr.c is in fact operating correctly. However, all the nodes on the list report the same hostname when running in the PGI environment.

I was unable to get any further diagnostics from it, so I can't tell if something strange is in the environment (making all the nodes resolve to the same hostname) or if something odd is happening inside PRRTE. Finally had to give up - this will require someone with access to the VM to debug it.

I reset the CI scripts and removed the debug I had added to the PR. I may have to do a little more cleanup to ensure I "reverting" things cleanly, but otherwise this should be good-to-go.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 17, 2022

@jjhursey I believe the indirect-multi tests are in fact running correctly, but that the signature has changed due to the revised default ranking algorithms. Could you please double-check to confirm, and if that is correct, then update the signatures? I have found that I cannot replicate things closely enough here to regenerate them for your CI setup.

@jjhursey
Copy link
Member

bot:ibm:retest

I can take a look at the PGI issue. The new CI environment does set

PRTE_MCA_prte_default_hostfile=/opt/mpi/etc/hostfile

So there is a default hostfile (trying to enumate a properly scheduled environment). Then use the --host option for the few tests that need to restrict down from there.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 18, 2022

Thanks! I just pushed a debug change in for the dmodex test, so expect some extra verbage from the OOB. If it gets in your way, I can revert it and work the dmodex problem later.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 18, 2022

I have removed that OOB debug - sorry for the noise.

Comment on lines 709 to 710
pmix_list_remove_item(node_list, cur_node_item);
pmix_list_prepend(node_list, cur_node_item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While debugging the PGI issue I saw the allocated_nodes list become corrupted after calling the prte_rmaps_base_get_starting_point function at the bottom of the prte_rmaps_base_get_target_nodes function . These two lines seemed to be the cause.

This fixed the PGI issue, but I'm a bit afraid that it is masking a different issue.

Suggested change
pmix_list_remove_item(node_list, cur_node_item);
pmix_list_prepend(node_list, cur_node_item);
if (pmix_list_get_first(node_list) != cur_node_item) {
pmix_list_remove_item(node_list, cur_node_item);
pmix_list_prepend(node_list, cur_node_item);
}

Can you give me a scenario where you would need to execute this section of code to move the node to the front of the list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - when doing a comm_spawn, you want to find the best node that can be used (hopefully as close to the bookmark as possible) and use it first when mapping the next job. So you search per the criteria above those lines and then move the selected starting point to the front of the list.

What is disturbing to me is that (a) that code has been there a long time without causing trouble, and (b) removing the first item from the list and then putting it back should not cause corruption. Either the corruption is coming before that point, or (more likely) the prte versions those list functions worked but the pmix version of them have a problem.

Probably worth taking a peek at the difference to see what might be going on. IIRC, I think we removed some thread locks in the PMIx versions - not sure why that would be causing a problem here (as we should be in an event), but maybe it is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug thru the old PRRTE list functions and compared them to the PMIx functions we now use. There were only two differences:

  1. PRRTE uses atomics to modify the refcount and PMIx does not. I don't think this is the issue as we only typically access lists from inside events. Still, something to keep in mind.
  2. PMIx was missing a couple of type casts. These were right in the "remove_first" and "remove_last" inline functions, which means they might have had an impact here. I added the type casts in Properly cast the list_item_t openpmix#2642.

Let's see if that makes a difference - if not, then I'll add your fix.

@jjhursey
Copy link
Member

I encountered another odd scenario while pushing on this branch (I'm not sure if they are specific to the branch though)

No MCA envar set. I have a hostfile:

shell$ cat hostfile_4_slots 
192.168.95.100 slots=4
192.168.21.86 slots=4
192.168.53.129 slots=4

Running:

shell$ prterun --hostfile hostfile_4_slots --np 12 --mca ras_base_verbose 100 ./hello
...
======================   ALLOCATED NODES   ======================
    192.168.95.100: slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
	Flags: SLOTS_GIVEN
	aliases: NONE
    192.168.21.86: slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
	Flags: SLOTS_GIVEN
	aliases: NONE
    192.168.53.129: slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN
	Flags: SLOTS_GIVEN
	aliases: NONE
=================================================================

======================   ALLOCATED NODES   ======================
    ci-pmix-pgi-upstream-cn-0: slots=4 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 192.168.95.100
    ci-pmix-pgi-upstream-cn-1: slots=4 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 192.168.21.86
    ci-pmix-pgi-upstream-cn-2: slots=4 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 192.168.53.129
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 12
slots that were requested by the application:

  ./hello

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

@jjhursey
Copy link
Member

Actually the error above seems to be the problem with the debug example in the other test cases. The prterun is just not mapping so the test is failing.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 18, 2022

Actually the error above seems to be the problem with the debug example in the other test cases. The prterun is just not mapping so the test is failing.

Not sure what you mean by this statement - are you saying prterun isn't mapping due to the corruption you found? Or are you saying it isn't mapping in general?

@jjhursey
Copy link
Member

Even with the fix I noted the following fails with the hostfile I provided above:
prterun --hostfile hostfile_4_slots --np 12 --mca ras_base_verbose 100 ./hello

It looks like the rmaps_rr_mappers.c is not proceeding to the next node once it hits the slot limit.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 18, 2022

Oh? Weird - I'll have to take a look at that one.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 18, 2022

I think I found this last problem - working on it now. Not sure what to do about the list corruption just yet. Probably needs more investigation.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 19, 2022

Hmmm...well, GNU ran fine except for this last debugger test:

01:07:00 Starting testcase pid 76982: './indirect-multi --num-nodes 3 --hostfile hostfile_5_slots prterun --hostfile hostfile_5_slots --np 12 ./hello '
01:07:03 Verify stdout/stderr for testcase indirect-multi
10,15c10,12
< daemon-0            : [<host>:PID<?>:@NS<1>@2:0:PID<?>] Local proctable received for nspace '<host>:PID<?>:@NS<2>' has 5 entries
< daemon-0            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 0 exec hello
< daemon-0            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 1 exec hello
< daemon-0            : Proctable[2], namespace <host>:PID<?>:@NS<2> rank 2 exec hello
< daemon-0            : Proctable[3], namespace <host>:PID<?>:@NS<2> rank 3 exec hello
< daemon-0            : Proctable[4], namespace <host>:PID<?>:@NS<2> rank 4 exec hello
---
> daemon-0            : [<host>:PID<?>:@NS<1>@2:0:PID<?>] Local proctable received for nspace '<host>:PID<?>:@NS<2>' has 2 entries
> daemon-0            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 10 exec hello
> daemon-0            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 11 exec hello
32,36c29,33
< daemon-1            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 5 exec hello
< daemon-1            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 6 exec hello
< daemon-1            : Proctable[2], namespace <host>:PID<?>:@NS<2> rank 7 exec hello
< daemon-1            : Proctable[3], namespace <host>:PID<?>:@NS<2> rank 8 exec hello
< daemon-1            : Proctable[4], namespace <host>:PID<?>:@NS<2> rank 9 exec hello
---
> daemon-1            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 0 exec hello
> daemon-1            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 1 exec hello
> daemon-1            : Proctable[2], namespace <host>:PID<?>:@NS<2> rank 2 exec hello
> daemon-1            : Proctable[3], namespace <host>:PID<?>:@NS<2> rank 3 exec hello
> daemon-1            : Proctable[4], namespace <host>:PID<?>:@NS<2> rank 4 exec hello
51,53c48,53
< daemon-2            : [<host>:PID<?>:@NS<1>@2:2:PID<?>] Local proctable received for nspace '<host>:PID<?>:@NS<2>' has 2 entries
< daemon-2            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 10 exec hello
< daemon-2            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 11 exec hello
---
> daemon-2            : [<host>:PID<?>:@NS<1>@2:2:PID<?>] Local proctable received for nspace '<host>:PID<?>:@NS<2>' has 5 entries
> daemon-2            : Proctable[0], namespace <host>:PID<?>:@NS<2> rank 5 exec hello
> daemon-2            : Proctable[1], namespace <host>:PID<?>:@NS<2> rank 6 exec hello
> daemon-2            : Proctable[2], namespace <host>:PID<?>:@NS<2> rank 7 exec hello
> daemon-2            : Proctable[3], namespace <host>:PID<?>:@NS<2> rank 8 exec hello
> daemon-2            : Proctable[4], namespace <host>:PID<?>:@NS<2> rank 9 exec hello

I'm not sure I understand the output here. Based on the cmd line, daemon 0 should host ranks 0-4, daemon 1 should host ranks 5-9, and daemon 2 should host ranks 10-11. Is that what happened? It looks instead like we wound up starting on daemon 1 and then mapping around - is that correct?

I can try to locally reproduce.

On the PGI test case, it simply stops at run.sh for prun-wrapper without any output. I'll try adding your patch.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 19, 2022

@jjhursey Still seeing a few warnings from the MPIR shim:

mpirshim.c: In function 'register_launcher_complete_handler':
mpirshim.c:1135:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_launcher_ready_handler':
mpirshim.c:1184:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_launcher_terminate_handler':
mpirshim.c:1235:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_application_terminate_handler':
mpirshim.c:1286:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
  CCLD     libmpirshim.la
  CCLD     mpirc
  CC       libmpirshimtest_la-mpirshim.lo
mpirshim.c: In function 'register_launcher_complete_handler':
mpirshim.c:1135:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_launcher_ready_handler':
mpirshim.c:1184:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_launcher_terminate_handler':
mpirshim.c:1235:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
mpirshim.c: In function 'register_application_terminate_handler':
mpirshim.c:1286:26: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
     pmix_status_t event, rc;
                          ^~
  CCLD     libmpirshimtest.la
make[1]: Leaving directory '/opt/ci/support/pmix-tests/prrte/mpir-shim/mpir-to-pmix-guide/src'
Making all in test
make[1]: Entering directory '/opt/ci/support/pmix-tests/prrte/mpir-shim/mpir-to-pmix-guide/test'
  CC       mpirshim_test-mpirshim_test.o
mpirshim_test.c:148:6: warning: no previous prototype for 'MPIR_Breakpoint_hook' [-Wmissing-prototypes]
 void MPIR_Breakpoint_hook(void)
      ^~~~~~~~~~~~~~~~~~~~

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 19, 2022

So what I'm seeing is that the ppr, seq, and rank-file mappers are working just fine (the latter two are tested locally and not in CI). The round-robin mappers also seem to be working okay, but are starting from the wrong node for some reason (at least in the indirect-multi case). I'll have to explore and see why - need to reproduce the spawn cmd that indirect-multi is generating to see if I can make it fail.

Specifically, it is this cmd line that is the only one failing:

./indirect-multi --num-nodes 3 --hostfile hostfile_5_slots prterun --hostfile hostfile_5_slots --np 12 ./hello 

Tracking solely at the slot level doesn't adequately protect against
overlapping CPU assignments and other more complex mapping requests.
Modify the resource usage tracking to operate at the CPU level and
combine the mapping/binding operation into a single pass.

Ranking must still be done as a second pass, but restrict the options
to simplify implementation and avoid confusion. Update the help output
to reflect the changes.

Allow the DVM to also support "do-not-launch" directives for testing
purposes, and to accept simulated node/topologies.

Fix a few other minor problems along the way.

Signed-off-by: Ralph Castain <rhc@pmix.org>

Remove the mindist mapper - no maintainer

Signed-off-by: Ralph Castain <rhc@pmix.org>

Ensure output goes only to subscribing tools and fix cleanup
for jobs with multiple cpus/rank

Signed-off-by: Ralph Castain <rhc@pmix.org>

Correctly restore usage from binding to multi-cpu regions

Signed-off-by: Ralph Castain <rhc@pmix.org>

Tools are always not bound

Signed-off-by: Ralph Castain <rhc@pmix.org>

Don't require bind support if bind-to-none is active

Signed-off-by: Ralph Castain <rhc@pmix.org>

Fix colocate

Signed-off-by: Ralph Castain <rhc@pmix.org>

Fix round-robin mapping to ensure that the mapper
moves to the next node when it fills all slots
on the current one.

Signed-off-by: Ralph Castain <rhc@pmix.org>

Add revised patch from @jjhursey - do not
reorder node list if the first node on the
list is the desired one.

Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54
Copy link
Contributor Author

rhc54 commented Jul 19, 2022

@jjhursey I finally fixed it, but I'm disturbed by the problem/fix. Your patch didn't quite resolve the problems with that last indirect-multi command. I had to revise it a bit as we were still getting the wrong node at the start of the list.

I did some digging around to see if the prepend list function is used elsewhere. Obviously, we use remove first in a lot of places, so I doubt that is causing the problem. Prepend is used less often, but it is used in the code base. However, it isn't clear if those code paths are actually ever traversed - it is possible therefore that pmix_list_prepend has been bad for quite some time and just doesn't get used.

What bothers me is that I didn't touch any of that code path involving the bookmark. I only modified the mapping algorithms themselves. Assembling the list of nodes and traversing them - I didn't change that at all. So why did this now break?

Leaves me a little uneasy that the corruption is actually occurring elsewhere and we are only seeing it when we get to that spot in the procedure. I just can't nail down where it might be happening. Any further guidance would be appreciated.

Thanks for all the help!

@jjhursey
Copy link
Member

Yeah things are running clean for me too now with your branch. I'm not sure why the list operations would cause an issue - I reviewed their implementations and it seemed ok. valgrind on valgrind prterun --hostfile hostfile_4_slots --np 12 ./hello didn't show anything unusual so 🤷 We'll just have to see if PGI trips over something in the future.

@rhc54 rhc54 merged commit 4b41c8e into openpmix:master Jul 19, 2022
@rhc54 rhc54 deleted the topic/pe branch July 19, 2022 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants