-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers #1088
Comments
Inside Tpetra::DistObject:
Change packAndPrepareNew so that Change unpackAndCombineNew so that |
I am on it, was thinking I can get that in one batch through with the testing for KokkosKernels. |
one step at a time :) we don't know if this really the issue yet, also we may have to do the same for dot products and i haven't tested whether this builds yet ;-P |
Digging around inside Tpetra_Import_Util2.hpp |
Looks like it is compiling :-) |
And it only broke the CrsMatrix UnitTests and the BlockCrsMatrix UnitTests, but those are not important right ;-)? |
This probably means the sync logic is bad inside pack and unpack ... |
:( note that we've never tested Tpetra on CUDA not with UVM. On the other hand, pack and unpack for CrsGraph and CrsMatrix should use the old interface that takes Teuchos::Array*. |
Just to summarize our debugging session this afternoon: Kokkos::DualView::modify checks whether the input template parameter has the same memory_space as the DualView. If not, it marks the "host view" modified. This means that, for example, if the DualView's memory_space is CudaUVMSpace and the input template parameter is CudaSpace, the DualView's host view will get marked as modified, contrary to expectations. The work-around is to make sure that the input memory_space is right. |
@trilinos/tpetra (pinging the issue at MH's request) |
I think @crtrott has the benchmark -- did we commit it to Tpetra? |
I'm working on this now. The idea is to link up with @tjfulle 's work on thread-parallel CrsMatrix pack and unpack, and then get rid of the "old" DistObject interface for good. |
I'm getting some funny debugging results that suggest that SEMS' build of OpenMPI might not be CUDA-aware. In particular, I'm getting segfaults when I give CudaSpace buffers to |
I've been wondering how to detect at compile time if MPI is CUDA-aware... that seems increasingly relevant |
@ibaned I would very much like that. If it's not CUDA-aware, it segfaults. Not sure if I want Trilinos to get into the business of setting SIGSEGV handlers, though I think PETSc does or did that as a way to catch user error. |
OpenMPI 2.0.0 gives users a way to tell if it supports CUDA: https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-aware-support |
Thanks for that link @mhoemmen ! |
@mhoemmen, yes, giving CudaSpace pointers to MPI is valid if the MPI accepts it. Haven't used the SEMS MPI module for GPU builds. |
@ambrad wrote:
That might explain it.... How hard is it to configure, say, OpenMPI 2.0.x for CUDA? |
@trilinos/tpetra This commit will help me prepare to work on #1088. All changes affect DistObject::doTransferNew only. Highlights: 1. Sum up totalImportPackets in thread-parallel fashion on device or host, dependent on commOnHost input argument. Before, this was only host thread-parallel. 2. Set modified flags for imports_ to avoid spurious Kokkos::DualView debug-mode errors, since we're using it in write-only mode.
@trilinos/tpetra Add a CMake option Tpetra_ASSUME_CUDA_AWARE_MPI, with associated macro TPETRA_ASSUME_CUDA_AWARE_MPI defined in TpetraCore_config.h. If the CMake option is ON, Tpetra may assume that the MPI implementation it uses is CUDA aware. See #1571 for discussion, and #1088 for an application. The option currently defaults to OFF. #1571 requires that we actually have a test for this option, at least for OpenMPI, so we haven't finished the issue yet.
…os#1571) @trilinos/tpetra Add a CMake option Tpetra_ASSUME_CUDA_AWARE_MPI, with associated macro TPETRA_ASSUME_CUDA_AWARE_MPI defined in TpetraCore_config.h. If the CMake option is ON, Tpetra may assume that the MPI implementation it uses is CUDA aware. See trilinos#1571 for discussion, and trilinos#1088 for an application. The option currently defaults to OFF. trilinos#1571 requires that we actually have a test for this option, at least for OpenMPI, so we haven't finished the issue yet.
…os#1571) @trilinos/tpetra Add a CMake option Tpetra_ASSUME_CUDA_AWARE_MPI, with associated macro TPETRA_ASSUME_CUDA_AWARE_MPI defined in TpetraCore_config.h. If the CMake option is ON, Tpetra may assume that the MPI implementation it uses is CUDA aware. See trilinos#1571 for discussion, and trilinos#1088 for an application. The option currently defaults to OFF. trilinos#1571 requires that we actually have a test for this option, at least for OpenMPI, so we haven't finished the issue yet.
@trilinos/tpetra This helps with trilinos#1088.
@trilinos/tpetra I added an all-reduce at the end of each test, to ensure that all processes made it through the test and passed. This should help with trilinos#1088.
Recent Kokkos changes and merging with @tjfulle 's packCrsMatrix added some challenge to this task, but I'm making progress. For an interesting complication, see this OpenMPI issue: |
@trilinos/tpetra This helps with trilinos#1088.
@trilinos/tpetra I added an all-reduce at the end of each test, to ensure that all processes made it through the test and passed. This should help with trilinos#1088.
I got Tpetra to pass tests with CUDA 8.0 and (CUDA-aware) OpenMPI 2.0.1. I'm working on Stokhos now. |
@trilinos/tpetra This will help with trilinos#1088.
@trilinos/tpetra @trilinos/stokhos It's unfortunate that I had to merge these two fixes into one commit. The reason is that trilinos#1658 blocks trilinos#1088, but I couldn't reorder the two commits without the risk of losing information. I will give the two commits' messages below. * Stokhos,Tpetra: Fix trilinos#1658 (Tpetra::MultiVector unpack kernels) @trilinos/tpetra Tpetra::MultiVector's unpack kernels (used in unpackAndCombineNew) now take a Kokkos execution space argument. Before, the kernels were using the output Kokkos::View (the MultiVector's data) to determine the execution space. The problem with this, is that for a Kokkos::DualView of CudaUVMSpace, the host and device Views are the same. This blocks trilinos#1088, whose fix requires that the input View (the buffer to unpack) be either a CudaSpace or a HostSpace View. (CudaUVMSpace::execution_space == Cuda, so the kernel will always run on device, even given a HostSpace buffer to unpack.) The new execution space input argument lets the user specify the execution space on which to run. It's an execution space instance, so this could give us a future option to run in a separate CUDA stream (e.g., for overlap of communication and computation), or to run on a subset of threads. I also fixed the unpack kernel specializations in Stokhos. * Tpetra: Fix trilinos#1088 (never give CUDA UVM allocations to MPI) @trilinos/tpetra This commit does two things: 1. Change DistObject so that it never gives CUDA UVM (i.e., Kokkos::CudaUVMSpace) allocations to MPI sends or receives. 2. Make packCrsMatrix and unpackCrsMatrix use ETI (explicit template instantiation), to reduce build time and speed up debugging of the pack and unpack routines. I did the first of these by insisting on Kokkos::CudaSpace, rather than Kokkos::CudaUVMSpace, for all of DistObject's communication buffers: imports_, numImportPacketsPerLID_, exports_, and numExportPacketsPerLID_. I added a public typedef to DistObject, buffer_device_type, to help users allocate communication buffers of the right type. This required changing Tpetra::MultiVector and Tpetra::Details::CooMatrix, which are currently the only two implementations of the "new" DistObject interface. This commit fixes trilinos#1088. However, this commit will break CUDA tests if the MPI implementation is not CUDA aware. See trilinos#1088 and trilinos#1571. I also added a lot of optional debug output. For example, doTransferNew's debug output now says whether the communication buffers live on host or device. The debug output is off by default. If you would like to enable it temporarily for debugging, just search for the string "constexpr bool debug = false"; the comment above this line explains what to do. The second thing I did was to make packCrsMatrix and unpackCrsMatrix use explicit template instantiation (ETI). This reduces build time. I did it specifically to speed up debugging of my changes to the pack and unpack routines. [Author's note: This is one reason why it was hard to reorder the two commits.] Stokhos' ETI system is more restrictive than Tpetra's ETI system, so I had to rename unpackCrsMatrix's headers. Sorry to folks who are working on this at the moment. I also fixed possible errors in Tpetra::CrsMatrix::*[gG]aussSeidel* just to make sure.
@trilinos/tpetra See trilinos#1088. UVM allocations are slow with MPI. This is an MPI and CUDA issue, not a Kokkos or Tpetra issue. To prevent users from doing a slow thing, explicitly forbid them from passing any Kokkos::View into Tpetra::Distributor whose memory space is Kokkos::CudaUVMSpace.
@trilinos/tpetra Set the environment variable TPETRA_ASSUME_CUDA_AWARE_MPI to tell Tpetra to assume that MPI is CUDA aware (see trilinos#1088 and trilinos#1571).
@trilinos/tpetra @trilinos/stokhos It's unfortunate that I had to merge these two fixes into one commit. The reason is that trilinos#1658 blocks trilinos#1088, but I couldn't reorder the two commits without the risk of losing information. I will give the two commits' messages below. * Stokhos,Tpetra: Fix trilinos#1658 (Tpetra::MultiVector unpack kernels) @trilinos/tpetra Tpetra::MultiVector's unpack kernels (used in unpackAndCombineNew) now take a Kokkos execution space argument. Before, the kernels were using the output Kokkos::View (the MultiVector's data) to determine the execution space. The problem with this, is that for a Kokkos::DualView of CudaUVMSpace, the host and device Views are the same. This blocks trilinos#1088, whose fix requires that the input View (the buffer to unpack) be either a CudaSpace or a HostSpace View. (CudaUVMSpace::execution_space == Cuda, so the kernel will always run on device, even given a HostSpace buffer to unpack.) The new execution space input argument lets the user specify the execution space on which to run. It's an execution space instance, so this could give us a future option to run in a separate CUDA stream (e.g., for overlap of communication and computation), or to run on a subset of threads. I also fixed the unpack kernel specializations in Stokhos. * Tpetra: Fix trilinos#1088 (never give CUDA UVM allocations to MPI) @trilinos/tpetra This commit does two things: 1. Change DistObject so that it never gives CUDA UVM (i.e., Kokkos::CudaUVMSpace) allocations to MPI sends or receives. 2. Make packCrsMatrix and unpackCrsMatrix use ETI (explicit template instantiation), to reduce build time and speed up debugging of the pack and unpack routines. I did the first of these by insisting on Kokkos::CudaSpace, rather than Kokkos::CudaUVMSpace, for all of DistObject's communication buffers: imports_, numImportPacketsPerLID_, exports_, and numExportPacketsPerLID_. I added a public typedef to DistObject, buffer_device_type, to help users allocate communication buffers of the right type. This required changing Tpetra::MultiVector and Tpetra::Details::CooMatrix, which are currently the only two implementations of the "new" DistObject interface. This commit fixes trilinos#1088. However, this commit will break CUDA tests if the MPI implementation is not CUDA aware. See trilinos#1088 and trilinos#1571. I also added a lot of optional debug output. For example, doTransferNew's debug output now says whether the communication buffers live on host or device. The debug output is off by default. If you would like to enable it temporarily for debugging, just search for the string "constexpr bool debug = false"; the comment above this line explains what to do. The second thing I did was to make packCrsMatrix and unpackCrsMatrix use explicit template instantiation (ETI). This reduces build time. I did it specifically to speed up debugging of my changes to the pack and unpack routines. [Author's note: This is one reason why it was hard to reorder the two commits.] Stokhos' ETI system is more restrictive than Tpetra's ETI system, so I had to rename unpackCrsMatrix's headers. Sorry to folks who are working on this at the moment. I also fixed possible errors in Tpetra::CrsMatrix::*[gG]aussSeidel* just to make sure.
@trilinos/tpetra See trilinos#1088. UVM allocations are slow with MPI. This is an MPI and CUDA issue, not a Kokkos or Tpetra issue. To prevent users from doing a slow thing, explicitly forbid them from passing any Kokkos::View into Tpetra::Distributor whose memory space is Kokkos::CudaUVMSpace.
@trilinos/tpetra Set the environment variable TPETRA_ASSUME_CUDA_AWARE_MPI to tell Tpetra to assume that MPI is CUDA aware (see trilinos#1088 and trilinos#1571).
Fixed in develop. |
@trilinos/tpetra See trilinos#1571 and trilinos#1088 to learn what it means for an MPI implementation to be "CUDA aware," and why this matters for performance. This commit adds a test for whether the MPI implementation that Trilinos uses is actually CUDA aware, if it claims to be CUDA aware, or if the user claims that MPI is CUDA aware. The test will only build if CUDA is enabled. If you want to exercise this test, you must build Tpetra with CUDA enabled, and set the environment variable TPETRA_ASSUME_CUDA_AWARE_MPI to some true value (e.g., "1" or "TRUE"). If you set the environment variable to some false value (e.g., "0" or "FALSE"), the test will run but will pass trivially. This means that you may control the test's behavior at run time.
@trilinos/tpetra See trilinos#1571 and trilinos#1088. The new function is called Tpetra::Details::assumeMpiIsCudaAware.
@trilinos/tpetra
Blocked by: #1571, #1602, #1658
MPI on Pascal, not K80, appears to be slow when accessing UVM-allocated device buffers, but not standard CUDA device buffers. This causes trouble with Tpetra::Distributor sending and receiving packed data, as measured by @ambrad . It also probably causes trouble with dot product and norm.
@crtrott and I wrote an MPI + Kokkos benchmark to measure this independently of Tpetra.
Possible work-around: Don't use CudaUVMSpace for internal comm buffers in DistObject methods packAndPrepareNew and unpackAndCombineNew. Define a typedef pack_dev_memory_space which is CudaSpace when execution_space is Cuda.
The text was updated successfully, but these errors were encountered: