MPI_Finalize slow with CUDA + OpenMPI 2.x (known OpenMPI issue; fixed in 3.1) #2698

mhoemmen · 2018-05-08T18:23:32Z

Tpetra::CrsMatrix UnitTests2 takes > 560s in a CUDA 8 release build on K80. Seriously, what's going on? Do I need different KOKKOS_ARCH settings? It sure would have been nice to have had some performance tracking so we could have caught this. I don't think this is anything we did; we've only been fixing CUDA issues over time.

36/145 Test  #36: TpetraCore_CrsMatrix_UnitTests2_MPI_4 .......................................................   Passed  561.02 sec

@trilinos/tpetra

The text was updated successfully, but these errors were encountered:

mhoemmen · 2018-05-08T18:29:09Z

I ran this test with verbose mode enabled. It looks like a known OpenMPI issue open-mpi/ompi#3244 :

37: Summary: total = 48, run = 48, passed = 48, failed = 0
37:
37: End Result: TEST PASSED
37: --------------------------------------------------------------------------
37: The call to cuIpcCloseMemHandle failed. This is a warning and the program
37: will continue to run.
37:   cuIpcCloseMemHandle return value:   1
37:   address: 0x20d120000
37: Check the cuda.h file for what the return value means. Perhaps a reboot
37: of the node will clear the problem.
37: --------------------------------------------------------------------------
37: [...:113163] Sleep on 113163
37: [...:113164] Sleep on 113164
37: [...:113166] Sleep on 113166
37: [...:113165] Sleep on 113165
37: [...:113158] 3 more processes have sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed
37: [...:113158] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

mhoemmen · 2018-05-08T18:31:14Z

I am using OpenMPI 2.0.1, which according to the above link, manifests the issue. The patch that fixes it was reportedly merged in OpenMPI 3.1.

@bartlettroscoe @nmhamster @micahahoward might want to know about this.

bartlettroscoe · 2018-05-08T19:06:22Z

@mhoemmen said:

@bartlettroscoe @nmhamster @micahahoward might want to know about this.

We are not seeing runtimes like that in any of the current Trilinos builds (including all of the ATDM Trilinos builds) as shown today, for example, at:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-05-08&filtercombine=and&filtercount=1&showfilters=1&field1=testname&compare1=65&value1=TpetraCore_CrsMatrix_UnitTests2_MPI_4

The max runtime for that test shown there is 31 seconds.

mhoemmen · 2018-05-08T20:50:14Z

Today: @trilinos/tpetra suggests adding a configure-time test to Tpetra's CMake logic, to detect the OpenMPI version and report a warning (not an error) if it's one of the versions known to have this issue.

kddevin · 2018-05-08T21:40:41Z

@trilinos/framework The test be at the Trilinos CMake level, rather than at the Tpetra level, right? Codes could call MPI_Finalize() without enabling Tpetra.
Can @trilinos/framework add the CMake logic?

mhoemmen · 2018-05-09T00:12:59Z

@kddevin wrote:

The test be at the Trilinos CMake level, rather than at the Tpetra level, right? Codes could call MPI_Finalize() without enabling Tpetra.

I agree. The test applies to any package that uses both MPI and CUDA. It can't be in Kokkos, because Kokkos (at least Core) does not depend on MPI. STK depends on Kokkos(Core) and MPI, but not Tpetra. Thus, it makes practical sense for the test to live at the Trilinos CMake level.

mhoemmen · 2018-05-14T18:34:46Z

@kddevin

@jwillenbring and I chatted about this on the phone today. I think it's a higher priority to fix the Dashboard CUDA builds so they use the right version of OpenMPI. Jim asked whether it would make more sense for the configure process to stop with an error instead of just printing a warning; I thought that would be good but there would always be that one user: https://xkcd.com/1172/

Makes it easier to load modules and run utilities out of tribits/python_utils. Since tribits/ci_support depends on tibits/python_utils, this is not making things any less general. This will make it easier to write unit tests for cdash_build_testing_date.py.

…2698) This should complete the major features for a TriBITS-based install for Trilinos that is robust to package build failures and correctly sets the installed directory permissions. Build/Test Cases Summary Enabled Packages: Enabled all Packages 0) MPI_DEBUG => passed: passed=353,notpassed=0 (1.14 min) 1) SERIAL_RELEASE => passed: passed=353,notpassed=0 (1.20 min) Other local commits for this build/test group: 747ad3d, be2522b, 4903a53

…ault (trilinos/Trilinos#2698) Turns out the default for <Project>_ENABLE_INSTALL_CMAKE_CONFIG_FILES is OFF, not ON. That was very confusing. It is important that we test installs of TribitsExampleProject where there are install failures and we ensure that the file <Project>Config.cmake gets installed correctly and is usable when an installation fails.

github-actions · 2021-06-09T12:25:31Z

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions · 2021-07-10T12:22:45Z

This issue was closed due to inactivity for 395 days.

mhoemmen added pkg: Tpetra TpetraRF labels May 8, 2018

mhoemmen changed the title ~~Tpetra::CrsMatrix UnitTests2 takes > 560s in a CUDA 8 release build on K80~~ MPI_Finalize slow with CUDA + OpenMPI 2.x (known OpenMPI issue; fixed in 3.1) May 8, 2018

mhoemmen added the system: gpu label May 8, 2018

mhoemmen mentioned this issue May 8, 2018

Test TpetraCore_ImportBug5430_MPI_4 randomly timing out in Trilinos-atdm-white-ride-cuda-opt build on 'white' and 'ride' #2681

Closed

kddevin added Framework tasks Framework tasks (used internally by Framework team) and removed pkg: Tpetra TpetraRF labels May 10, 2018

github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 9, 2021

github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jul 10, 2021

github-actions bot closed this as completed Jul 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI_Finalize slow with CUDA + OpenMPI 2.x (known OpenMPI issue; fixed in 3.1) #2698

MPI_Finalize slow with CUDA + OpenMPI 2.x (known OpenMPI issue; fixed in 3.1) #2698

mhoemmen commented May 8, 2018 •

edited

Loading

mhoemmen commented May 8, 2018

mhoemmen commented May 8, 2018

bartlettroscoe commented May 8, 2018

mhoemmen commented May 8, 2018

kddevin commented May 8, 2018

mhoemmen commented May 9, 2018

mhoemmen commented May 14, 2018

github-actions bot commented Jun 9, 2021

github-actions bot commented Jul 10, 2021

MPI_Finalize slow with CUDA + OpenMPI 2.x (known OpenMPI issue; fixed in 3.1) #2698

MPI_Finalize slow with CUDA + OpenMPI 2.x (known OpenMPI issue; fixed in 3.1) #2698

Comments

mhoemmen commented May 8, 2018 • edited Loading

mhoemmen commented May 8, 2018

mhoemmen commented May 8, 2018

bartlettroscoe commented May 8, 2018

mhoemmen commented May 8, 2018

kddevin commented May 8, 2018

mhoemmen commented May 9, 2018

mhoemmen commented May 14, 2018

github-actions bot commented Jun 9, 2021

github-actions bot commented Jul 10, 2021

mhoemmen commented May 8, 2018 •

edited

Loading