Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open MPI 2.1.0: MPI_Finalize hangs because cuIpcCloseMemHandle fails #3244

Closed
Evgueni-Petrov-aka-espetrov opened this issue Mar 28, 2017 · 10 comments
Assignees
Labels
Milestone

Comments

@Evgueni-Petrov-aka-espetrov
Copy link

Evgueni-Petrov-aka-espetrov commented Mar 28, 2017

Hi Open MPI,

Thank you very much for fixing #3042!

We want to switch from version 2.0.2 to 2.1.0 containing the fix but, if we do, our application starts hanging in MPI_Finalize.
From our point of view, this behavior is a regression in version 2.1.0 w.r.t version 2.0.2.

First, MPI_Finalize warns that cuIpcCloseMemHandle failed with the return value of 4 (CUDA_DEINITIALIZED), and then it prints the following messages in a loop:

[hostname:87484] Sleep on 87484
[hostname:87483] Sleep on 87483
[hostname:87478] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed
[hostname:87478] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed

...
Gdb shows the following stack:

#0  0x00007f12df23393d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f12df2337d4 in __sleep (seconds=0)
    at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x00007f12d59f63bd in cuda_closememhandle ()
   from /home/espetrov/sandbox/install_mpi/lib/libmca_common_cuda.so.20
#3  0x00007f12d55e93c9 in mca_rcache_rgpusm_finalize ()
   from /home/espetrov/sandbox/install_mpi/lib/openmpi/mca_rcache_rgpusm.so
#4  0x00007f12deca5b92 in mca_rcache_base_module_destroy ()
   from /home/espetrov/sandbox/install_mpi/lib/libopen-pal.so.20
#5  0x00007f12d435e57a in mca_btl_smcuda_del_procs ()
   from /home/espetrov/sandbox/install_mpi/lib/openmpi/mca_btl_smcuda.so
#6  0x00007f12d51e1042 in mca_bml_r2_del_procs ()
   from /home/espetrov/sandbox/install_mpi/lib/openmpi/mca_bml_r2.so
#7  0x00007f12dfaa2918 in ompi_mpi_finalize ()
   from /home/espetrov/sandbox/install_mpi/lib/libmpi.so.20

I am not sure but I would say that MPI_Finalize tries to close a remote memory handle after the remote MPI process unloaded libcuda.so.

Probably, getting CUDA_DEINITIALIZED from cuIpcCloseMemHandle is OK?
Our CUDA version is 7.5, CUDA driver version is 361.93.02.

Evgueni.

@jsquyres
Copy link
Member

@sjeaugey This appears to be stuck in a CUDA call. Can you look into this?

@sjeaugey
Copy link
Member

Indeed, looks like a problem on our part. I'll look into this.

@sjeaugey
Copy link
Member

Maybe it's not even stuck. For some reason, there is a sleep(20) (!) for every IpcCloseMemHandle that fails. So if we're emptying our cache, there may be a lot of handles to close, and it can take a really long time !

I noticed yesterday my MTT was not finished in the morning, so I had to kill it but didn't have time to look into it. Same today. Maybe that's the reason.

I'm currently testing a fix ignoring CUDA_DEINITIALIZED return codes and most importantly removing the sleep(20).

@sjeaugey
Copy link
Member

I couldn't manage to reproduce the bug so far.

I tested that patch :

diff --git a/opal/mca/common/cuda/common_cuda.c b/opal/mca/common/cuda/common_cuda.c
index 2ce3b20..d66f00b 100644
--- a/opal/mca/common/cuda/common_cuda.c
+++ b/opal/mca/common/cuda/common_cuda.c
@@ -1157,10 +1157,10 @@ int cuda_closememhandle(void *reg_data, mca_rcache_base_registration_t *reg)
     if (ctx_ok) {
         result = cuFunc.cuIpcCloseMemHandle((CUdeviceptr)cuda_reg->base.alloc_base);
         if (OPAL_UNLIKELY(CUDA_SUCCESS != result)) {
-            opal_show_help("help-mpi-common-cuda.txt", "cuIpcCloseMemHandle failed",
-                           true, result, cuda_reg->base.alloc_base);
-            opal_output(0, "Sleep on %d", getpid());
-            sleep(20);
+            if (CUDA_ERROR_DEINITIALIZED != result) {
+                opal_show_help("help-mpi-common-cuda.txt", "cuIpcCloseMemHandle failed",
+                true, result, cuda_reg->base.alloc_base);
+            }
             /* We will just continue on and hope things continue to work. */
         } else {
             opal_output_verbose(10, mca_common_cuda_output,

which compiles and works but since I can't reproduce the bug, I can't confirm it fixes the problem for sure.

@Evgueni-Petrov-aka-espetrov can you give it a try ?

@hppritcha hppritcha added this to the v2.1.1 milestone Mar 30, 2017
@Evgueni-Petrov-aka-espetrov
Copy link
Author

Evgueni-Petrov-aka-espetrov commented Mar 31, 2017

Thanks for the fix, @sjeaugey!
It works for our application.

@sjeaugey
Copy link
Member

Thanks for sharing the result. @jsquyres is it OK for me to push that patch to master directly (since Evgueni confirmed it fixed the issue ?)

@rhc54
Copy link
Contributor

rhc54 commented Mar 31, 2017

Why not just put it in a branch and submit a PR like normal? It would allow the CI to ensure nothing broke outside of this environment.

@sjeaugey
Copy link
Member

Sure -- just takes more time. I'll submit a PR.

@renganxu
Copy link

renganxu commented Apr 24, 2018

@sjeaugey I still have this problem when running Horovod benchmark (a MPI framework for TensorFlow). I tried both OpenMPI 2.1.1 and the latest 3.0.1, and the problem is still there. My CUDA version is 9.0.176 and the GPU driver is 387.26.

The following are the last few lines of my output for OpenMPI 3.0.1:

The call to cuIpcCloseMemHandle failed. This is a warning and the program
will continue to run.
  cuIpcCloseMemHandle return value:   4
  address: 0x2ab6b8a00000
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[node023:07102] Sleep on 7102
[node023:07100] Sleep on 7100
[node023:07101] Sleep on 7101

@sjeaugey
Copy link
Member

@hfutxrg This is expected, since the patch above has not been merged in 3.0.x, only in 3.1.x.
So I would suggest you try 3.1 and see if you can still reproduce the issue.
https://www.open-mpi.org/software/ompi/v3.1/

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants