-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix resource usage tracking and remove stale mapper #1383
Conversation
@jjhursey Your CI contains an error in one of its cmd lines:
cannot possibly succeed. By definition, the The cmd line would work for the case where the allocation was given by a resource manager. It just cannot work in environments where the allocation is given by a hostfile or is only given via the I'll correct this in the |
Funny enough, it actually worked in one environment - which leads me to suspect you must have specified a default hostfile in there somewhere. I don't rule out that it might be working/failing randomly due to some other factor. However, it should fail every time as-written. |
Ignore the noise - I saw that the cmd line does indeed specify the |
791d15e
to
c07ce0e
Compare
@jjhursey Afraid I am going to need your help with the darned PGI case again. I simply cannot replicate this behavior. It appears that the node pointer in the |
Update: I tried modifying the CI script to isolate the test and then adding debug output. Best I could find is that the "foreach" list macro in src/mca/rmaps/ppr/rmaps_ppr.c is in fact operating correctly. However, all the nodes on the list report the same hostname when running in the PGI environment. I was unable to get any further diagnostics from it, so I can't tell if something strange is in the environment (making all the nodes resolve to the same hostname) or if something odd is happening inside PRRTE. Finally had to give up - this will require someone with access to the VM to debug it. I reset the CI scripts and removed the debug I had added to the PR. I may have to do a little more cleanup to ensure I "reverting" things cleanly, but otherwise this should be good-to-go. |
@jjhursey I believe the |
bot:ibm:retest I can take a look at the PGI issue. The new CI environment does set
So there is a default hostfile (trying to enumate a properly scheduled environment). Then use the |
Thanks! I just pushed a debug change in for the dmodex test, so expect some extra verbage from the OOB. If it gets in your way, I can revert it and work the dmodex problem later. |
I have removed that OOB debug - sorry for the noise. |
pmix_list_remove_item(node_list, cur_node_item); | ||
pmix_list_prepend(node_list, cur_node_item); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While debugging the PGI issue I saw the allocated_nodes list become corrupted after calling the prte_rmaps_base_get_starting_point
function at the bottom of the prte_rmaps_base_get_target_nodes
function . These two lines seemed to be the cause.
This fixed the PGI issue, but I'm a bit afraid that it is masking a different issue.
pmix_list_remove_item(node_list, cur_node_item); | |
pmix_list_prepend(node_list, cur_node_item); | |
if (pmix_list_get_first(node_list) != cur_node_item) { | |
pmix_list_remove_item(node_list, cur_node_item); | |
pmix_list_prepend(node_list, cur_node_item); | |
} |
Can you give me a scenario where you would need to execute this section of code to move the node to the front of the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - when doing a comm_spawn, you want to find the best node that can be used (hopefully as close to the bookmark as possible) and use it first when mapping the next job. So you search per the criteria above those lines and then move the selected starting point to the front of the list.
What is disturbing to me is that (a) that code has been there a long time without causing trouble, and (b) removing the first item from the list and then putting it back should not cause corruption. Either the corruption is coming before that point, or (more likely) the prte versions those list functions worked but the pmix version of them have a problem.
Probably worth taking a peek at the difference to see what might be going on. IIRC, I think we removed some thread locks in the PMIx versions - not sure why that would be causing a problem here (as we should be in an event), but maybe it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dug thru the old PRRTE list functions and compared them to the PMIx functions we now use. There were only two differences:
- PRRTE uses atomics to modify the refcount and PMIx does not. I don't think this is the issue as we only typically access lists from inside events. Still, something to keep in mind.
- PMIx was missing a couple of type casts. These were right in the "remove_first" and "remove_last" inline functions, which means they might have had an impact here. I added the type casts in Properly cast the list_item_t openpmix#2642.
Let's see if that makes a difference - if not, then I'll add your fix.
I encountered another odd scenario while pushing on this branch (I'm not sure if they are specific to the branch though) No MCA envar set. I have a hostfile:
Running:
|
Actually the error above seems to be the problem with the debug example in the other test cases. The |
Not sure what you mean by this statement - are you saying |
Even with the fix I noted the following fails with the hostfile I provided above: It looks like the rmaps_rr_mappers.c is not proceeding to the next node once it hits the slot limit. |
Oh? Weird - I'll have to take a look at that one. |
I think I found this last problem - working on it now. Not sure what to do about the list corruption just yet. Probably needs more investigation. |
Hmmm...well, GNU ran fine except for this last debugger test:
I'm not sure I understand the output here. Based on the cmd line, daemon 0 should host ranks 0-4, daemon 1 should host ranks 5-9, and daemon 2 should host ranks 10-11. Is that what happened? It looks instead like we wound up starting on daemon 1 and then mapping around - is that correct? I can try to locally reproduce. On the PGI test case, it simply stops at |
@jjhursey Still seeing a few warnings from the MPIR shim:
|
So what I'm seeing is that the Specifically, it is this cmd line that is the only one failing: ./indirect-multi --num-nodes 3 --hostfile hostfile_5_slots prterun --hostfile hostfile_5_slots --np 12 ./hello |
Tracking solely at the slot level doesn't adequately protect against overlapping CPU assignments and other more complex mapping requests. Modify the resource usage tracking to operate at the CPU level and combine the mapping/binding operation into a single pass. Ranking must still be done as a second pass, but restrict the options to simplify implementation and avoid confusion. Update the help output to reflect the changes. Allow the DVM to also support "do-not-launch" directives for testing purposes, and to accept simulated node/topologies. Fix a few other minor problems along the way. Signed-off-by: Ralph Castain <rhc@pmix.org> Remove the mindist mapper - no maintainer Signed-off-by: Ralph Castain <rhc@pmix.org> Ensure output goes only to subscribing tools and fix cleanup for jobs with multiple cpus/rank Signed-off-by: Ralph Castain <rhc@pmix.org> Correctly restore usage from binding to multi-cpu regions Signed-off-by: Ralph Castain <rhc@pmix.org> Tools are always not bound Signed-off-by: Ralph Castain <rhc@pmix.org> Don't require bind support if bind-to-none is active Signed-off-by: Ralph Castain <rhc@pmix.org> Fix colocate Signed-off-by: Ralph Castain <rhc@pmix.org> Fix round-robin mapping to ensure that the mapper moves to the next node when it fills all slots on the current one. Signed-off-by: Ralph Castain <rhc@pmix.org> Add revised patch from @jjhursey - do not reorder node list if the first node on the list is the desired one. Signed-off-by: Ralph Castain <rhc@pmix.org>
@jjhursey I finally fixed it, but I'm disturbed by the problem/fix. Your patch didn't quite resolve the problems with that last I did some digging around to see if the What bothers me is that I didn't touch any of that code path involving the bookmark. I only modified the mapping algorithms themselves. Assembling the list of nodes and traversing them - I didn't change that at all. So why did this now break? Leaves me a little uneasy that the corruption is actually occurring elsewhere and we are only seeing it when we get to that spot in the procedure. I just can't nail down where it might be happening. Any further guidance would be appreciated. Thanks for all the help! |
Yeah things are running clean for me too now with your branch. I'm not sure why the list operations would cause an issue - I reviewed their implementations and it seemed ok. |
Fix resource usage tracking for map/bind operations
Tracking solely at the slot level doesn't adequately protect against
overlapping CPU assignments and other more complex mapping requests.
Modify the resource usage tracking to operate at the CPU level and
combine the mapping/binding operation into a single pass.
Ranking must still be done as a second pass, but restrict the options
to simplify implementation and avoid confusion. Update the help output
to reflect the changes.
Allow the DVM to also support "do-not-launch" directives for testing
purposes, and to accept simulated node/topologies.
Fix a few other minor problems along the way.
Signed-off-by: Ralph Castain rhc@pmix.org
Remove the mindist mapper - no maintainer
Signed-off-by: Ralph Castain rhc@pmix.org