Skip to content
This repository has been archived by the owner on Dec 9, 2022. It is now read-only.

Avoid shared memory bug in openmpi>=3.1.2 #523

Closed
leofang opened this issue Jan 26, 2019 · 5 comments · Fixed by #561
Closed

Avoid shared memory bug in openmpi>=3.1.2 #523

leofang opened this issue Jan 26, 2019 · 5 comments · Fixed by #561

Comments

@leofang
Copy link
Collaborator

leofang commented Jan 26, 2019

Currently we build v3.1.2 in nsls2-tag:

{% set version = "3.1.2" %}

However, this version seems to be buggy. If one spawns a few MPI processes, let them do some work, but terminate them abnormally (Ctrl-C and whatnot), it can be seen that in /dev/shm/ there will be shared memory segments related to openmpi's vader component (don't ask me what this is...) that are not unlinked by openmpi during the cleanup phase:

leofang@xf03id-srv5:~$ ls -lt /dev/shm | more
total 28464
-rw------- 1 leofang leofang 4194312 Jan 25 23:06 vader_segment.xf03id-srv5.73dc0001.2
-rw------- 1 leofang leofang 4194312 Jan 25 23:06 vader_segment.xf03id-srv5.73dc0001.3
-rw------- 1 leofang leofang 4194312 Jan 25 23:06 vader_segment.xf03id-srv5.73dc0001.1
-rw------- 1 leofang leofang 4194312 Jan 25 23:06 vader_segment.xf03id-srv5.73dc0001.0
-rw------- 1 xjhuang xjhuang 4194312 Jan 15 11:08 vader_segment.xf03id-srv5.32270001.2
-rw------- 1 xjhuang xjhuang 4194312 Jan 15 11:08 vader_segment.xf03id-srv5.32270001.0
-rw------- 1 xjhuang xjhuang 4194312 Jan 15 11:08 vader_segment.xf03id-srv5.32270001.3
-rw------- 1 xjhuang xjhuang 4194312 Jan 15 11:08 vader_segment.xf03id-srv5.32270001.1
-rw------- 1 xjhuang xjhuang 4194312 Jan 15 11:07 vader_segment.xf03id-srv5.358e0001.3
-rw------- 1 xjhuang xjhuang 4194312 Jan 15 11:07 vader_segment.xf03id-srv5.358e0001.2
......

and they will remain there until the system is reboot, eating up slowly the system's memory!

Based on openmpi's changelog (see here), it seems that vader was reworked in v3.1.2, presumably this bug sneaked in by then. There's a bug fix in v3.1.3 hopefully would address this; if not, we can downgrade to v3.1.1, which I tested and worked without this issue.

(UPDATE: v3.1.3 also has this problem, has to use 3.1.1...)

I have other questions related to building conda packages for mpi4py, openmpi, and mpich, but perhaps I should ask them somewhere else...

@tacaswell
Copy link
Member

@leofang can you sync our recipe with the conda-forge one (which sadly for now is a copy-paste job 😞 )

@leofang
Copy link
Collaborator Author

leofang commented Jan 29, 2019

Sure, happy to do. But let me first check with Open MPI people and see if this is a known bug or caused by other problems.

@leofang
Copy link
Collaborator Author

leofang commented Jan 29, 2019

@tacaswell since you brought up, I'd also like to copy and paste the conda-forge recipes for mpi4py and mpich. This would solve the MPI nuisance. Is it OK?

@stuartcampbell
Copy link
Member

@leofang Definately - go ahead

@leofang leofang changed the title Avoid shared memory bug in openmpi v3.1.2 Avoid shared memory bug in openmpi>=3.1.2 Feb 15, 2019
leofang added a commit to leofang/lightsource2-recipes that referenced this issue Feb 15, 2019
@leofang
Copy link
Collaborator Author

leofang commented Feb 16, 2019

ref: open-mpi/ompi/issues/6322

leofang added a commit to leofang/lightsource2-recipes that referenced this issue Feb 16, 2019
mrakitin pushed a commit that referenced this issue Mar 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants