Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Finalize slow with CUDA + OpenMPI 2.x (known OpenMPI issue; fixed in 3.1) #2698

Closed
mhoemmen opened this issue May 8, 2018 · 9 comments
Closed
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. system: gpu

Comments

@mhoemmen
Copy link
Contributor

mhoemmen commented May 8, 2018

Tpetra::CrsMatrix UnitTests2 takes > 560s in a CUDA 8 release build on K80. Seriously, what's going on? Do I need different KOKKOS_ARCH settings? It sure would have been nice to have had some performance tracking so we could have caught this. I don't think this is anything we did; we've only been fixing CUDA issues over time.

36/145 Test  #36: TpetraCore_CrsMatrix_UnitTests2_MPI_4 .......................................................   Passed  561.02 sec

@trilinos/tpetra

@mhoemmen
Copy link
Contributor Author

mhoemmen commented May 8, 2018

I ran this test with verbose mode enabled. It looks like a known OpenMPI issue open-mpi/ompi#3244 :

37: Summary: total = 48, run = 48, passed = 48, failed = 0
37:
37: End Result: TEST PASSED
37: --------------------------------------------------------------------------
37: The call to cuIpcCloseMemHandle failed. This is a warning and the program
37: will continue to run.
37:   cuIpcCloseMemHandle return value:   1
37:   address: 0x20d120000
37: Check the cuda.h file for what the return value means. Perhaps a reboot
37: of the node will clear the problem.
37: --------------------------------------------------------------------------
37: [...:113163] Sleep on 113163
37: [...:113164] Sleep on 113164
37: [...:113166] Sleep on 113166
37: [...:113165] Sleep on 113165
37: [...:113158] 3 more processes have sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed
37: [...:113158] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@mhoemmen
Copy link
Contributor Author

mhoemmen commented May 8, 2018

I am using OpenMPI 2.0.1, which according to the above link, manifests the issue. The patch that fixes it was reportedly merged in OpenMPI 3.1.

@bartlettroscoe @nmhamster @micahahoward might want to know about this.

@bartlettroscoe
Copy link
Member

@mhoemmen said:

@bartlettroscoe @nmhamster @micahahoward might want to know about this.

We are not seeing runtimes like that in any of the current Trilinos builds (including all of the ATDM Trilinos builds) as shown today, for example, at:

The max runtime for that test shown there is 31 seconds.

@mhoemmen mhoemmen changed the title Tpetra::CrsMatrix UnitTests2 takes > 560s in a CUDA 8 release build on K80 MPI_Finalize slow with CUDA + OpenMPI 2.x (known OpenMPI issue; fixed in 3.1) May 8, 2018
@mhoemmen
Copy link
Contributor Author

mhoemmen commented May 8, 2018

Today: @trilinos/tpetra suggests adding a configure-time test to Tpetra's CMake logic, to detect the OpenMPI version and report a warning (not an error) if it's one of the versions known to have this issue.

@kddevin
Copy link
Contributor

kddevin commented May 8, 2018

@trilinos/framework The test be at the Trilinos CMake level, rather than at the Tpetra level, right? Codes could call MPI_Finalize() without enabling Tpetra.
Can @trilinos/framework add the CMake logic?

@mhoemmen
Copy link
Contributor Author

mhoemmen commented May 9, 2018

@kddevin wrote:

The test be at the Trilinos CMake level, rather than at the Tpetra level, right? Codes could call MPI_Finalize() without enabling Tpetra.

I agree. The test applies to any package that uses both MPI and CUDA. It can't be in Kokkos, because Kokkos (at least Core) does not depend on MPI. STK depends on Kokkos(Core) and MPI, but not Tpetra. Thus, it makes practical sense for the test to live at the Trilinos CMake level.

@kddevin kddevin added Framework tasks Framework tasks (used internally by Framework team) and removed pkg: Tpetra TpetraRF labels May 10, 2018
@mhoemmen
Copy link
Contributor Author

@kddevin

@jwillenbring and I chatted about this on the phone today. I think it's a higher priority to fix the Dashboard CUDA builds so they use the right version of OpenMPI. Jim asked whether it would make more sense for the configure process to stop with an error instead of just printing a warning; I thought that would be good but there would always be that one user: https://xkcd.com/1172/

bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Jan 24, 2019
Makes it easier to load modules and run utilities out of tribits/python_utils.
Since tribits/ci_support depends on tibits/python_utils, this is not making
things any less general.

This will make it easier to write unit tests for cdash_build_testing_date.py.
bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Apr 18, 2019
…2698)

This should complete the major features for a TriBITS-based install for
Trilinos that is robust to package build failures and correctly sets the
installed directory permissions.

Build/Test Cases Summary
Enabled Packages:
Enabled all Packages
0) MPI_DEBUG => passed: passed=353,notpassed=0 (1.14 min)
1) SERIAL_RELEASE => passed: passed=353,notpassed=0 (1.20 min)
Other local commits for this build/test group: 747ad3d, be2522b, 4903a53
bartlettroscoe added a commit to TriBITSPub/TriBITS that referenced this issue Apr 19, 2019
…ault (trilinos/Trilinos#2698)

Turns out the default for <Project>_ENABLE_INSTALL_CMAKE_CONFIG_FILES is OFF,
not ON.  That was very confusing.

It is important that we test installs of TribitsExampleProject where there are
install failures and we ensure that the file <Project>Config.cmake gets
installed correctly and is usable when an installation fails.
@github-actions
Copy link

github-actions bot commented Jun 9, 2021

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 9, 2021
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jul 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. system: gpu
Projects
None yet
Development

No branches or pull requests

3 participants