Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers #1088

mhoemmen · 2017-02-23T22:30:14Z

@trilinos/tpetra

MPI on Pascal, not K80, appears to be slow when accessing UVM-allocated device buffers, but not standard CUDA device buffers. This causes trouble with Tpetra::Distributor sending and receiving packed data, as measured by @ambrad . It also probably causes trouble with dot product and norm.

@crtrott and I wrote an MPI + Kokkos benchmark to measure this independently of Tpetra.

Possible work-around: Don't use CudaUVMSpace for internal comm buffers in DistObject methods packAndPrepareNew and unpackAndCombineNew. Define a typedef pack_dev_memory_space which is CudaSpace when execution_space is Cuda.

mhoemmen · 2017-02-23T23:27:36Z

Inside Tpetra::DistObject:

#ifdef KOKKOS_HAVE_CUDA
typedef typename std::conditional<std::is_same<typename device_type::execution_space, Kokkos::Cuda>::value, Kokkos::CudaSpace, typename device_type::memory_space>::type pack_memory_space;
#else
typedef typename device_type::memory_space pack_memory_space;
#endif // KOKKOS_HAVE_CUDA

typedef Kokkos::Device<typename device_type::execution_space, pack_memory_space> pack_device_type;

Change packAndPrepareNew so that exports (and only that DualView) uses pack_device_type as its Device template parameter.

Change unpackAndCombineNew so that imports (and only that DualView) uses pack_device_type as its Device template parameter.

crtrott · 2017-02-23T23:52:42Z

I am on it, was thinking I can get that in one batch through with the testing for KokkosKernels.

mhoemmen · 2017-02-23T23:53:37Z

I am on it, was thinking I can get that in one batch through with the testing for KokkosKernels.

one step at a time :) we don't know if this really the issue yet, also we may have to do the same for dot products and i haven't tested whether this builds yet ;-P

crtrott · 2017-02-23T23:53:59Z

Digging around inside Tpetra_Import_Util2.hpp

crtrott · 2017-02-24T00:00:23Z

Looks like it is compiling :-)

crtrott · 2017-02-24T00:08:50Z

And it only broke the CrsMatrix UnitTests and the BlockCrsMatrix UnitTests, but those are not important right ;-)?

crtrott · 2017-02-24T00:10:04Z

This probably means the sync logic is bad inside pack and unpack ...

mhoemmen · 2017-02-24T00:23:16Z

This probably means the sync logic is bad inside pack and unpack ...

:( note that we've never tested Tpetra on CUDA not with UVM. On the other hand, pack and unpack for CrsGraph and CrsMatrix should use the old interface that takes Teuchos::Array*.

mhoemmen · 2017-02-24T05:31:37Z

Just to summarize our debugging session this afternoon: Kokkos::DualView::modify checks whether the input template parameter has the same memory_space as the DualView. If not, it marks the "host view" modified. This means that, for example, if the DualView's memory_space is CudaUVMSpace and the input template parameter is CudaSpace, the DualView's host view will get marked as modified, contrary to expectations. The work-around is to make sure that the input memory_space is right.

ambrad · 2017-03-21T20:08:23Z

@trilinos/tpetra (pinging the issue at MH's request)

mhoemmen · 2017-03-21T21:03:20Z

@ambrad Could you send both @crtrott and me (resend if we have it already) a configure script, a computer on which to demonstrate the problem, and an executable that demonstrates the problem?

mhoemmen · 2017-03-21T21:03:49Z

I think @crtrott has the benchmark -- did we commit it to Tpetra?

mhoemmen · 2017-08-01T20:25:09Z

I'm working on this now. The idea is to link up with @tjfulle 's work on thread-parallel CrsMatrix pack and unpack, and then get rid of the "old" DistObject interface for good.

mhoemmen · 2017-08-02T00:00:30Z

I'm getting some funny debugging results that suggest that SEMS' build of OpenMPI might not be CUDA-aware. In particular, I'm getting segfaults when I give CudaSpace buffers to MPI_Send, but not when I give CudaUVMSpace buffers to MPI_Send.

mhoemmen · 2017-08-02T00:04:51Z

@crtrott @ambrad Have you noticed this? Should it actually be OK to give CudaSpace buffers to MPI?

ibaned · 2017-08-02T00:16:17Z

I've been wondering how to detect at compile time if MPI is CUDA-aware... that seems increasingly relevant

mhoemmen · 2017-08-02T03:31:05Z

@ibaned I would very much like that. If it's not CUDA-aware, it segfaults. Not sure if I want Trilinos to get into the business of setting SIGSEGV handlers, though I think PETSc does or did that as a way to catch user error.

mhoemmen · 2017-08-02T03:33:24Z

OpenMPI 2.0.0 gives users a way to tell if it supports CUDA:

https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-aware-support

ibaned · 2017-08-02T13:39:02Z

Thanks for that link @mhoemmen !
Even if we can't set the default automatically yet, it certainly seems like portable code is going to need #ifdef CUDA_AWARE_MPI conditionals. I'd be interested in performance numbers if you find a proper CUDA-aware MPI and pass it device buffers.

ambrad · 2017-08-02T13:59:45Z

@mhoemmen, yes, giving CudaSpace pointers to MPI is valid if the MPI accepts it. Haven't used the SEMS MPI module for GPU builds.

mhoemmen · 2017-08-02T15:34:42Z

@ambrad wrote:

Haven't used the SEMS MPI module for GPU builds.

That might explain it.... How hard is it to configure, say, OpenMPI 2.0.x for CUDA?

@trilinos/tpetra This commit will help me prepare to work on #1088. All changes affect DistObject::doTransferNew only. Highlights: 1. Sum up totalImportPackets in thread-parallel fashion on device or host, dependent on commOnHost input argument. Before, this was only host thread-parallel. 2. Set modified flags for imports_ to avoid spurious Kokkos::DualView debug-mode errors, since we're using it in write-only mode.

@trilinos/tpetra Add a CMake option Tpetra_ASSUME_CUDA_AWARE_MPI, with associated macro TPETRA_ASSUME_CUDA_AWARE_MPI defined in TpetraCore_config.h. If the CMake option is ON, Tpetra may assume that the MPI implementation it uses is CUDA aware. See #1571 for discussion, and #1088 for an application. The option currently defaults to OFF. #1571 requires that we actually have a test for this option, at least for OpenMPI, so we haven't finished the issue yet.

…os#1571) @trilinos/tpetra Add a CMake option Tpetra_ASSUME_CUDA_AWARE_MPI, with associated macro TPETRA_ASSUME_CUDA_AWARE_MPI defined in TpetraCore_config.h. If the CMake option is ON, Tpetra may assume that the MPI implementation it uses is CUDA aware. See trilinos#1571 for discussion, and trilinos#1088 for an application. The option currently defaults to OFF. trilinos#1571 requires that we actually have a test for this option, at least for OpenMPI, so we haven't finished the issue yet.

@trilinos/tpetra This helps with trilinos#1088.

@trilinos/tpetra I added an all-reduce at the end of each test, to ensure that all processes made it through the test and passed. This should help with trilinos#1088.

mhoemmen · 2017-08-28T22:50:39Z

Recent Kokkos changes and merging with @tjfulle 's packCrsMatrix added some challenge to this task, but I'm making progress. For an interesting complication, see this OpenMPI issue:

open-mpi/ompi#3244

@trilinos/tpetra This helps with trilinos#1088.

@trilinos/tpetra I added an all-reduce at the end of each test, to ensure that all processes made it through the test and passed. This should help with trilinos#1088.

mhoemmen · 2017-08-30T02:04:26Z

I got Tpetra to pass tests with CUDA 8.0 and (CUDA-aware) OpenMPI 2.0.1. I'm working on Stokhos now.

@trilinos/tpetra This will help with trilinos#1088.

@trilinos/tpetra @trilinos/stokhos It's unfortunate that I had to merge these two fixes into one commit. The reason is that trilinos#1658 blocks trilinos#1088, but I couldn't reorder the two commits without the risk of losing information. I will give the two commits' messages below. * Stokhos,Tpetra: Fix trilinos#1658 (Tpetra::MultiVector unpack kernels) @trilinos/tpetra Tpetra::MultiVector's unpack kernels (used in unpackAndCombineNew) now take a Kokkos execution space argument. Before, the kernels were using the output Kokkos::View (the MultiVector's data) to determine the execution space. The problem with this, is that for a Kokkos::DualView of CudaUVMSpace, the host and device Views are the same. This blocks trilinos#1088, whose fix requires that the input View (the buffer to unpack) be either a CudaSpace or a HostSpace View. (CudaUVMSpace::execution_space == Cuda, so the kernel will always run on device, even given a HostSpace buffer to unpack.) The new execution space input argument lets the user specify the execution space on which to run. It's an execution space instance, so this could give us a future option to run in a separate CUDA stream (e.g., for overlap of communication and computation), or to run on a subset of threads. I also fixed the unpack kernel specializations in Stokhos. * Tpetra: Fix trilinos#1088 (never give CUDA UVM allocations to MPI) @trilinos/tpetra This commit does two things: 1. Change DistObject so that it never gives CUDA UVM (i.e., Kokkos::CudaUVMSpace) allocations to MPI sends or receives. 2. Make packCrsMatrix and unpackCrsMatrix use ETI (explicit template instantiation), to reduce build time and speed up debugging of the pack and unpack routines. I did the first of these by insisting on Kokkos::CudaSpace, rather than Kokkos::CudaUVMSpace, for all of DistObject's communication buffers: imports_, numImportPacketsPerLID_, exports_, and numExportPacketsPerLID_. I added a public typedef to DistObject, buffer_device_type, to help users allocate communication buffers of the right type. This required changing Tpetra::MultiVector and Tpetra::Details::CooMatrix, which are currently the only two implementations of the "new" DistObject interface. This commit fixes trilinos#1088. However, this commit will break CUDA tests if the MPI implementation is not CUDA aware. See trilinos#1088 and trilinos#1571. I also added a lot of optional debug output. For example, doTransferNew's debug output now says whether the communication buffers live on host or device. The debug output is off by default. If you would like to enable it temporarily for debugging, just search for the string "constexpr bool debug = false"; the comment above this line explains what to do. The second thing I did was to make packCrsMatrix and unpackCrsMatrix use explicit template instantiation (ETI). This reduces build time. I did it specifically to speed up debugging of my changes to the pack and unpack routines. [Author's note: This is one reason why it was hard to reorder the two commits.] Stokhos' ETI system is more restrictive than Tpetra's ETI system, so I had to rename unpackCrsMatrix's headers. Sorry to folks who are working on this at the moment. I also fixed possible errors in Tpetra::CrsMatrix::*[gG]aussSeidel* just to make sure.

@trilinos/tpetra See trilinos#1088. UVM allocations are slow with MPI. This is an MPI and CUDA issue, not a Kokkos or Tpetra issue. To prevent users from doing a slow thing, explicitly forbid them from passing any Kokkos::View into Tpetra::Distributor whose memory space is Kokkos::CudaUVMSpace.

@trilinos/tpetra Set the environment variable TPETRA_ASSUME_CUDA_AWARE_MPI to tell Tpetra to assume that MPI is CUDA aware (see trilinos#1088 and trilinos#1571).

@trilinos/tpetra @trilinos/stokhos It's unfortunate that I had to merge these two fixes into one commit. The reason is that trilinos#1658 blocks trilinos#1088, but I couldn't reorder the two commits without the risk of losing information. I will give the two commits' messages below. * Stokhos,Tpetra: Fix trilinos#1658 (Tpetra::MultiVector unpack kernels) @trilinos/tpetra Tpetra::MultiVector's unpack kernels (used in unpackAndCombineNew) now take a Kokkos execution space argument. Before, the kernels were using the output Kokkos::View (the MultiVector's data) to determine the execution space. The problem with this, is that for a Kokkos::DualView of CudaUVMSpace, the host and device Views are the same. This blocks trilinos#1088, whose fix requires that the input View (the buffer to unpack) be either a CudaSpace or a HostSpace View. (CudaUVMSpace::execution_space == Cuda, so the kernel will always run on device, even given a HostSpace buffer to unpack.) The new execution space input argument lets the user specify the execution space on which to run. It's an execution space instance, so this could give us a future option to run in a separate CUDA stream (e.g., for overlap of communication and computation), or to run on a subset of threads. I also fixed the unpack kernel specializations in Stokhos. * Tpetra: Fix trilinos#1088 (never give CUDA UVM allocations to MPI) @trilinos/tpetra This commit does two things: 1. Change DistObject so that it never gives CUDA UVM (i.e., Kokkos::CudaUVMSpace) allocations to MPI sends or receives. 2. Make packCrsMatrix and unpackCrsMatrix use ETI (explicit template instantiation), to reduce build time and speed up debugging of the pack and unpack routines. I did the first of these by insisting on Kokkos::CudaSpace, rather than Kokkos::CudaUVMSpace, for all of DistObject's communication buffers: imports_, numImportPacketsPerLID_, exports_, and numExportPacketsPerLID_. I added a public typedef to DistObject, buffer_device_type, to help users allocate communication buffers of the right type. This required changing Tpetra::MultiVector and Tpetra::Details::CooMatrix, which are currently the only two implementations of the "new" DistObject interface. This commit fixes trilinos#1088. However, this commit will break CUDA tests if the MPI implementation is not CUDA aware. See trilinos#1088 and trilinos#1571. I also added a lot of optional debug output. For example, doTransferNew's debug output now says whether the communication buffers live on host or device. The debug output is off by default. If you would like to enable it temporarily for debugging, just search for the string "constexpr bool debug = false"; the comment above this line explains what to do. The second thing I did was to make packCrsMatrix and unpackCrsMatrix use explicit template instantiation (ETI). This reduces build time. I did it specifically to speed up debugging of my changes to the pack and unpack routines. [Author's note: This is one reason why it was hard to reorder the two commits.] Stokhos' ETI system is more restrictive than Tpetra's ETI system, so I had to rename unpackCrsMatrix's headers. Sorry to folks who are working on this at the moment. I also fixed possible errors in Tpetra::CrsMatrix::*[gG]aussSeidel* just to make sure.

@trilinos/tpetra See trilinos#1088. UVM allocations are slow with MPI. This is an MPI and CUDA issue, not a Kokkos or Tpetra issue. To prevent users from doing a slow thing, explicitly forbid them from passing any Kokkos::View into Tpetra::Distributor whose memory space is Kokkos::CudaUVMSpace.

@trilinos/tpetra Set the environment variable TPETRA_ASSUME_CUDA_AWARE_MPI to tell Tpetra to assume that MPI is CUDA aware (see trilinos#1088 and trilinos#1571).

mhoemmen · 2017-08-31T15:05:57Z

Fixed in develop.

@trilinos/tpetra See trilinos#1571 and trilinos#1088 to learn what it means for an MPI implementation to be "CUDA aware," and why this matters for performance. This commit adds a test for whether the MPI implementation that Trilinos uses is actually CUDA aware, if it claims to be CUDA aware, or if the user claims that MPI is CUDA aware. The test will only build if CUDA is enabled. If you want to exercise this test, you must build Tpetra with CUDA enabled, and set the environment variable TPETRA_ASSUME_CUDA_AWARE_MPI to some true value (e.g., "1" or "TRUE"). If you set the environment variable to some false value (e.g., "0" or "FALSE"), the test will run but will pass trivially. This means that you may control the test's behavior at run time.

@trilinos/tpetra See trilinos#1571 and trilinos#1088. The new function is called Tpetra::Details::assumeMpiIsCudaAware.

mhoemmen assigned mhoemmen, ambrad and crtrott Feb 23, 2017

mhoemmen added pkg: Tpetra system: gpu labels Mar 22, 2017

mhoemmen added this to the Tpetra-FY17-Q3 milestone Mar 22, 2017

mhoemmen added the stage: in progress Work on the issue has started label Aug 1, 2017

mhoemmen modified the milestones: Tpetra-FY17-Q4, Tpetra-FY17-Q3 Aug 1, 2017

mhoemmen mentioned this issue Aug 3, 2017

Albany slower after Trilinos changes on 27 July 2017 #1553

Closed

mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 28, 2017

Tpetra: Add debug code to BlockCrsMatrix test

93fe351

@trilinos/tpetra This helps with trilinos#1088.

ambrad pushed a commit to ambrad/Trilinos that referenced this issue Aug 29, 2017

Tpetra: Add debug code to BlockCrsMatrix test

97c08fe

@trilinos/tpetra This helps with trilinos#1088.

mhoemmen mentioned this issue Aug 30, 2017

Tpetra::MultiVector: Unpack kernels not using correct execution space #1658

Closed

mhoemmen unassigned ambrad and crtrott Aug 30, 2017

mhoemmen mentioned this issue Aug 30, 2017

Tpetra: Make CrsMatrix::transferAndFillComplete do thread-parallel pack & unpack #802

Closed

mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017

Tpetra: Improve CrsMatrix::gaussSeidel test

f1e1aff

@trilinos/tpetra This will help with trilinos#1088.

mhoemmen closed this as completed Aug 31, 2017

mhoemmen removed the stage: in progress Work on the issue has started label Aug 31, 2017

mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017

Tpetra: Spin off assumeMpiIsCudaAware into its own function

da54b1e

@trilinos/tpetra See trilinos#1571 and trilinos#1088. The new function is called Tpetra::Details::assumeMpiIsCudaAware.

tjfulle mentioned this issue Sep 4, 2017

Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays #1665

Merged

tasmith4 mentioned this issue Oct 27, 2020

Tpetra: Use device_type instead of execution_space for Kokkos types #8247

Merged

trilinos-autotester mentioned this issue Dec 9, 2020

Zoltan2: add distributed distance-1 and distance-2 coloring implementations #8363

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers #1088

Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers #1088

mhoemmen commented Feb 23, 2017 •

edited

Loading

mhoemmen commented Feb 23, 2017 •

edited

Loading

crtrott commented Feb 23, 2017

mhoemmen commented Feb 23, 2017

crtrott commented Feb 23, 2017

crtrott commented Feb 24, 2017

crtrott commented Feb 24, 2017

crtrott commented Feb 24, 2017

mhoemmen commented Feb 24, 2017

mhoemmen commented Feb 24, 2017

ambrad commented Mar 21, 2017

mhoemmen commented Mar 21, 2017

mhoemmen commented Mar 21, 2017

mhoemmen commented Aug 1, 2017

mhoemmen commented Aug 2, 2017

mhoemmen commented Aug 2, 2017

ibaned commented Aug 2, 2017

mhoemmen commented Aug 2, 2017

mhoemmen commented Aug 2, 2017

ibaned commented Aug 2, 2017

ambrad commented Aug 2, 2017

mhoemmen commented Aug 2, 2017

mhoemmen commented Aug 28, 2017 •

edited

Loading

mhoemmen commented Aug 30, 2017

mhoemmen commented Aug 31, 2017

Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers #1088

Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers #1088

Comments

mhoemmen commented Feb 23, 2017 • edited Loading

mhoemmen commented Feb 23, 2017 • edited Loading

crtrott commented Feb 23, 2017

mhoemmen commented Feb 23, 2017

crtrott commented Feb 23, 2017

crtrott commented Feb 24, 2017

crtrott commented Feb 24, 2017

crtrott commented Feb 24, 2017

mhoemmen commented Feb 24, 2017

mhoemmen commented Feb 24, 2017

ambrad commented Mar 21, 2017

mhoemmen commented Mar 21, 2017

mhoemmen commented Mar 21, 2017

mhoemmen commented Aug 1, 2017

mhoemmen commented Aug 2, 2017

mhoemmen commented Aug 2, 2017

ibaned commented Aug 2, 2017

mhoemmen commented Aug 2, 2017

mhoemmen commented Aug 2, 2017

ibaned commented Aug 2, 2017

ambrad commented Aug 2, 2017

mhoemmen commented Aug 2, 2017

mhoemmen commented Aug 28, 2017 • edited Loading

mhoemmen commented Aug 30, 2017

mhoemmen commented Aug 31, 2017

mhoemmen commented Feb 23, 2017 •

edited

Loading

mhoemmen commented Feb 23, 2017 •

edited

Loading

mhoemmen commented Aug 28, 2017 •

edited

Loading