Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mindist mapper broken on master #1623

Closed
jsquyres opened this issue May 3, 2016 · 8 comments
Closed

mindist mapper broken on master #1623

jsquyres opened this issue May 3, 2016 · 8 comments
Assignees
Labels
Milestone

Comments

@jsquyres
Copy link
Member

jsquyres commented May 3, 2016

A bunch of Mellanox Jenkins runs have been failing on master with this kind of error:

18:32:14 + /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/bin/mpirun -np 8 --map-by dist -mca rmaps_dist_device mlx4_0 -x TEST_CLOSEST_NUMA -x TEST_PHYS_ID_COUNT -x TEST_CORE_ID_COUNT /scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace-2/jenkins_scripts/jenkins/ompi/mindist_test
18:32:16 
18:32:16 Success rank - 4: only one NUMA is scheduled.
18:32:16 
18:32:16 Success rank - 6: only one NUMA is scheduled.
18:32:16 
18:32:16 Success rank - 0: only one NUMA is scheduled.
18:32:16 
18:32:16 Success rank - 2: only one NUMA is scheduled.
18:32:16 
18:32:16 Error rank - 5: scheduled on wrong NUMA node - 1, should be 0
18:32:16 
18:32:16 Error rank - 3: scheduled on wrong NUMA node - 1, should be 0
18:32:16 
18:32:16 Error rank - 1: scheduled on wrong NUMA node - 1, should be 0
18:32:16 
18:32:16 Error rank - 7: scheduled on wrong NUMA node - 1, should be 0
18:32:17 -------------------------------------------------------
18:32:17 Primary job  terminated normally, but 1 process returned
18:32:17 a non-zero exit code. Per user-direction, the job has been aborted.
18:32:17 -------------------------------------------------------
18:32:17 --------------------------------------------------------------------------
18:32:17 mpirun detected that one or more processes exited with non-zero status, thus causing
18:32:17 the job to be terminated. The first process to do so was:
18:32:17 
18:32:17   Process name: [[34935,1],5]
18:32:17   Exit code:    1
18:32:17 --------------------------------------------------------------------------

In #1612 (comment), @jladd-mlnx reported that the PMIx external component changes (i.e., the changes from that PR) seem to be what broke this test.

@rhc54 Can you investigate?

@jladd-mlnx
Copy link
Member

I see it in Brice's unmerged PR too. HWLOC change makes sense, but how has it crept into master? Don't see it on 2.x. Strange...

#1584

@jsquyres
Copy link
Member Author

jsquyres commented May 3, 2016

@jladd-mlnx Did you git bisect and find that the place where it came into master was the commit that came from #1612?

@jladd-mlnx
Copy link
Member

No, just combing through closed PRs to see where it first showed up. It may have nothing to do with this commit. It shows up in Brice's unmerged HWLOC update - which would make more sense, but then I don't understand how it would have leaked in master. Not sure I will have time to chase much down today. If I get a few spare cycles, I can bisect.

@jsquyres
Copy link
Member Author

jsquyres commented May 3, 2016

Discussed on teleconf today: @rhc54 is going to have a look, but will need help in testing it because he doesn't have any HCAs.

@jsquyres
Copy link
Member Author

jsquyres commented May 4, 2016

@jladd-mlnx @Di0gen Can you temporarily disable the mindist test while we know that it is broken? It's making all other PR's fail the Mellanox Jenkins test. We can re-enable it when mindist gets fixed.

@jladd-mlnx
Copy link
Member

Done. I plan to re-enable it in one week.

On Wed, May 4, 2016 at 6:58 AM, Jeff Squyres notifications@github.com
wrote:

@jladd-mlnx https://github.com/jladd-mlnx @Di0gen
https://github.com/Di0gen Can you temporarily disable the mindist test
while we know that it is broken? It's making all other PR's fail the
Mellanox Jenkins test. We can re-enable it when mindist gets fixed.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1623 (comment)

@rhc54 rhc54 added this to the v3.0.0 milestone May 4, 2016
@ggouaillardet
Copy link
Contributor

@jladd-mlnx does mellanox Jenkins explicitly configure ompi with external pmix ?
if yes, then both external hwloc and libevent are required.
if not, then external pmix should not be built, and this is a bug.
@rhc54 do you agree with the last statement ?

@rhc54
Copy link
Contributor

rhc54 commented May 5, 2016

Yes, that would certainly be true. However, someone on the call looked back at the logs and claimed it started with a different PR, so there is some uncertainty here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants