Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vader transport appears to leave SHM files laying around after successful termination #7220

Closed
mwheinz opened this issue Dec 4, 2019 · 11 comments

Comments

@mwheinz
Copy link

mwheinz commented Dec 4, 2019

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.1.4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Packaged with Intel OPA 10.10.0.0.445

Please describe the system on which you are running

Back-to-back Xeon systems running RHEL 7.6 on one and RHEL 8.0 on the other.


Details of the problem

I was using OMPI to do some stress testing of some minor changes to the OPA PSM library, when I discovered that the vader transport appears to be leaking memory mapped files.

I wrote a bash script to run the OSU micro benchmarks in a continuous loop, alternating between using the PSM2 MTL and the OFI MTL. After a 24 hour run, I ran into some "resource exhausted" issues when trying to start new shells, execute shell scripts, etc..

Investigating, I found over 100k shared memory files in /dev/shm, all of the form vader_segment.<hostname>.<hex number>.<decimal number>

It's not clear at this point that the shared memory files are the cause of the problems I had, but they certainly shouldn't be there!

Sample run lines:

mpirun --allow-run-as-root --oversubscribe -np 48 --mca osc pt2pt --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd -H hdsmpriv01,hdsmpriv02 ./mpi/pt2pt/osu_mbw_mr
mpirun --allow-run-as-root --oversubscribe -np 48 --mca osc pt2pt --mca pml cm --mca mtl psm2 -H hdsmpriv01,hdsmpriv02 ./mpi/pt2pt/osu_mbw_mr

Script that was used to run the benchmarks:

#!/bin/bash

# mpirun --mca mtl_base_verbose 10 --mca osc pt2pt --allow-run-as-root --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd -np 2 -H hdsmpriv01,hdsmpriv02 $PWD/IMB-EXT accumulate 2>&1 | tee a

OPTS1="--mca osc pt2pt --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2,ofi_rxd"
OPTS2="--mca osc pt2pt --mca pml cm --mca mtl psm2"
HOSTS="-H hdsmpriv01,hdsmpriv02"
N=48

TEST_PAIR=(./mpi/pt2pt/osu_bw
	./mpi/pt2pt/osu_bibw
	./mpi/pt2pt/osu_latency_mt
	./mpi/pt2pt/osu_latency
	./mpi/one-sided/osu_get_latency
	./mpi/one-sided/osu_put_latency
	./mpi/one-sided/osu_cas_latency
	./mpi/one-sided/osu_get_acc_latency
	./mpi/one-sided/osu_acc_latency
	./mpi/one-sided/osu_fop_latency
	./mpi/one-sided/osu_get_bw
	./mpi/one-sided/osu_put_bibw
	./mpi/one-sided/osu_put_bw
)
TEST_FULL=(
	./mpi/pt2pt/osu_mbw_mr
	./mpi/pt2pt/osu_multi_lat
	./mpi/startup/osu_init
	./mpi/startup/osu_hello
	./mpi/collective/osu_allreduce
	./mpi/collective/osu_scatter
	./mpi/collective/osu_iallgatherv
	./mpi/collective/osu_alltoallv
	./mpi/collective/osu_ireduce
	./mpi/collective/osu_alltoall
	./mpi/collective/osu_igather
	./mpi/collective/osu_allgatherv
	./mpi/collective/osu_iallgather
	./mpi/collective/osu_reduce
	./mpi/collective/osu_ialltoallv
	./mpi/collective/osu_ibarrier
	./mpi/collective/osu_ibcast
	./mpi/collective/osu_gather
	./mpi/collective/osu_barrier
	./mpi/collective/osu_iscatter
	./mpi/collective/osu_scatterv
	./mpi/collective/osu_igatherv
	./mpi/collective/osu_allgather
	./mpi/collective/osu_ialltoall
	./mpi/collective/osu_ialltoallw
	./mpi/collective/osu_reduce_scatter
	./mpi/collective/osu_iscatterv
	./mpi/collective/osu_gatherv
	./mpi/collective/osu_bcast
	./mpi/collective/osu_iallreduce)

while true; do
	echo "------------------------"
	date
	echo "------------------------"
	for t in ${TEST_PAIR[@]}
	do
		CMD="mpirun --allow-run-as-root -np 2 ${OPTS1} ${HOSTS} ${t}"
		
		echo "${CMD}"

		eval ${CMD}

		CMD="mpirun --allow-run-as-root -np 2 ${OPTS2} ${HOSTS} ${t}"
		
		echo "${CMD}"

		eval ${CMD}
	done
	for t in ${TEST_FULL[@]}
	do
		CMD="mpirun --allow-run-as-root --oversubscribe -np ${N} ${OPTS1} ${HOSTS} ${t}"
		
		echo "${CMD}"

		eval ${CMD}

		CMD="mpirun --allow-run-as-root --oversubscribe -np ${N} ${OPTS2} ${HOSTS} ${t}"
		
		echo "${CMD}"

		eval ${CMD}
	done
	sleep 60
done
@mwheinz mwheinz changed the title vader transport appears to leave SHM files laying around after termination vader transport appears to leave SHM files laying around after successful termination Dec 4, 2019
@mwheinz
Copy link
Author

mwheinz commented Dec 4, 2019

Looks like this problem does not exist in 4.0.2. I haven't figured out which commit corrects the issue, however.

@mwheinz
Copy link
Author

mwheinz commented Dec 4, 2019

Looks like this is the known issue #6565. Fix is in master and 4.0.2 but not in the 3.1.x branch.

This does not seem to be true.

@mwheinz mwheinz closed this as completed Dec 9, 2019
@mwheinz
Copy link
Author

mwheinz commented Dec 9, 2019

Okay - I tried backporting the patch from #6565 because it fit much of the description, but it does not actually fix the problem for 3.1.4. I tried testing 3.1.5 but failed to build it due to the GLIBC_PRIVATE issue.

@mwheinz mwheinz reopened this Dec 10, 2019
@maxhgerlach

This comment has been minimized.

@hjelmn
Copy link
Member

hjelmn commented Jan 20, 2020

These files are supposed to be cleaned up by PMIx. Not sure why that isn't happening in this case.

@jsquyres
Copy link
Member

FWIW: we discussed this on the weekly OMPI call today:

  1. Open MPI >= v4.0.x uses PMIX 3.x, which has a "register to do something at job shutdown" hook. Hence, in Open MPI master and >= v4.0, we shouldn't be seeing these leftover files. If we are, it's a bug.
  2. Open MPI < v4.0x uses PMIX 2.x, which does not have the "register to do something at job shutdown" hook. @hjelmn today said he'd look at the logic / workarounds we were supposed to have in place in the v3.0.x / v3.1.x trees and make sure those were working as best as they can work.

@rhc54
Copy link
Contributor

rhc54 commented Jan 23, 2020

I examined OMPI v4.0.2 and it appears to be doing everything correctly (ditto for master). I cannot see any reason why it would be leaving those files behind. Even the terminate-by-signal path flows thru the cleanup.

No real ideas here - can anyone replicate this behavior? I can't on my VMs - it all works correctly.

@mkre
Copy link

mkre commented Jan 23, 2020

@rhc54, I can confirm it's working with 4.0.2. However, I can reliably reproduce the behavior using 3.1.x.

I think it's the same underlying I'm running into in #7308: If another user left behind a segment file and there is a segment file name conflict with my current job, the run will abort with "permission denied" as the existing segment file can't be opened.

As @jsquyres pointed out, it seems like it is an issue with PMIx 2.x. While @hjelmn is looking into possible workarounds, I'm wondering if we can use PMIx 3.x with Open MPI 3.1.5?

@maxhgerlach
Copy link

Sorry for the confusion: It was a bug in our setup. I can now confirm that /dev/shm/vader* files are cleaned up after SIGTERM in Open MPI 4.0.2.

@awlauria
Copy link
Contributor

@mwheinz can you check to see if: #10040

fixes this issue for you? I noticed the same thing on master recently.

@awlauria
Copy link
Contributor

I miss-read this issue. It appears it only happens in the mpi v3 series - which is frozen. Since it is fixed in v4 and beyond, this should probably be closed.

I confirmed that #10040 is a master/v5 regression - it works on v4/4.1.

v5.0.x pr: #10046

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants