Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers #1088

Closed
mhoemmen opened this issue Feb 23, 2017 · 36 comments
Closed

Tpetra::MultiVector: Don't use CudaUVMSpace for internal comm buffers #1088

mhoemmen opened this issue Feb 23, 2017 · 36 comments

Comments

@mhoemmen
Copy link
Contributor

mhoemmen commented Feb 23, 2017

@trilinos/tpetra

Blocked by: #1571, #1602, #1658

MPI on Pascal, not K80, appears to be slow when accessing UVM-allocated device buffers, but not standard CUDA device buffers. This causes trouble with Tpetra::Distributor sending and receiving packed data, as measured by @ambrad . It also probably causes trouble with dot product and norm.

@crtrott and I wrote an MPI + Kokkos benchmark to measure this independently of Tpetra.

Possible work-around: Don't use CudaUVMSpace for internal comm buffers in DistObject methods packAndPrepareNew and unpackAndCombineNew. Define a typedef pack_dev_memory_space which is CudaSpace when execution_space is Cuda.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Feb 23, 2017

Inside Tpetra::DistObject:

#ifdef KOKKOS_HAVE_CUDA
typedef typename std::conditional<std::is_same<typename device_type::execution_space, Kokkos::Cuda>::value, Kokkos::CudaSpace, typename device_type::memory_space>::type pack_memory_space;
#else
typedef typename device_type::memory_space pack_memory_space;
#endif // KOKKOS_HAVE_CUDA

typedef Kokkos::Device<typename device_type::execution_space, pack_memory_space> pack_device_type;

Change packAndPrepareNew so that exports (and only that DualView) uses pack_device_type as its Device template parameter.

Change unpackAndCombineNew so that imports (and only that DualView) uses pack_device_type as its Device template parameter.

@crtrott
Copy link
Member

crtrott commented Feb 23, 2017

I am on it, was thinking I can get that in one batch through with the testing for KokkosKernels.

@mhoemmen
Copy link
Contributor Author

I am on it, was thinking I can get that in one batch through with the testing for KokkosKernels.

one step at a time :) we don't know if this really the issue yet, also we may have to do the same for dot products and i haven't tested whether this builds yet ;-P

@crtrott
Copy link
Member

crtrott commented Feb 23, 2017

Digging around inside Tpetra_Import_Util2.hpp

@crtrott
Copy link
Member

crtrott commented Feb 24, 2017

Looks like it is compiling :-)

@crtrott
Copy link
Member

crtrott commented Feb 24, 2017

And it only broke the CrsMatrix UnitTests and the BlockCrsMatrix UnitTests, but those are not important right ;-)?

@crtrott
Copy link
Member

crtrott commented Feb 24, 2017

This probably means the sync logic is bad inside pack and unpack ...

@mhoemmen
Copy link
Contributor Author

This probably means the sync logic is bad inside pack and unpack ...

:( note that we've never tested Tpetra on CUDA not with UVM. On the other hand, pack and unpack for CrsGraph and CrsMatrix should use the old interface that takes Teuchos::Array*.

@mhoemmen
Copy link
Contributor Author

Just to summarize our debugging session this afternoon: Kokkos::DualView::modify checks whether the input template parameter has the same memory_space as the DualView. If not, it marks the "host view" modified. This means that, for example, if the DualView's memory_space is CudaUVMSpace and the input template parameter is CudaSpace, the DualView's host view will get marked as modified, contrary to expectations. The work-around is to make sure that the input memory_space is right.

@ambrad
Copy link
Contributor

ambrad commented Mar 21, 2017

@trilinos/tpetra (pinging the issue at MH's request)

@mhoemmen
Copy link
Contributor Author

@ambrad Could you send both @crtrott and me (resend if we have it already) a configure script, a computer on which to demonstrate the problem, and an executable that demonstrates the problem?

@mhoemmen
Copy link
Contributor Author

I think @crtrott has the benchmark -- did we commit it to Tpetra?

@mhoemmen mhoemmen added this to the Tpetra-FY17-Q3 milestone Mar 22, 2017
@mhoemmen mhoemmen added the stage: in progress Work on the issue has started label Aug 1, 2017
@mhoemmen
Copy link
Contributor Author

mhoemmen commented Aug 1, 2017

I'm working on this now. The idea is to link up with @tjfulle 's work on thread-parallel CrsMatrix pack and unpack, and then get rid of the "old" DistObject interface for good.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Aug 2, 2017

I'm getting some funny debugging results that suggest that SEMS' build of OpenMPI might not be CUDA-aware. In particular, I'm getting segfaults when I give CudaSpace buffers to MPI_Send, but not when I give CudaUVMSpace buffers to MPI_Send.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Aug 2, 2017

@crtrott @ambrad Have you noticed this? Should it actually be OK to give CudaSpace buffers to MPI?

@ibaned
Copy link
Contributor

ibaned commented Aug 2, 2017

I've been wondering how to detect at compile time if MPI is CUDA-aware... that seems increasingly relevant

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Aug 2, 2017

@ibaned I would very much like that. If it's not CUDA-aware, it segfaults. Not sure if I want Trilinos to get into the business of setting SIGSEGV handlers, though I think PETSc does or did that as a way to catch user error.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Aug 2, 2017

OpenMPI 2.0.0 gives users a way to tell if it supports CUDA:

https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-aware-support

@ibaned
Copy link
Contributor

ibaned commented Aug 2, 2017

Thanks for that link @mhoemmen !
Even if we can't set the default automatically yet, it certainly seems like portable code is going to need #ifdef CUDA_AWARE_MPI conditionals. I'd be interested in performance numbers if you find a proper CUDA-aware MPI and pass it device buffers.

@ambrad
Copy link
Contributor

ambrad commented Aug 2, 2017

@mhoemmen, yes, giving CudaSpace pointers to MPI is valid if the MPI accepts it. Haven't used the SEMS MPI module for GPU builds.

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Aug 2, 2017

@ambrad wrote:

Haven't used the SEMS MPI module for GPU builds.

That might explain it.... How hard is it to configure, say, OpenMPI 2.0.x for CUDA?

lxmota pushed a commit that referenced this issue Aug 8, 2017
@trilinos/tpetra This commit will help me prepare to work on #1088.
All changes affect DistObject::doTransferNew only.  Highlights:

  1. Sum up totalImportPackets in thread-parallel fashion on device or
     host, dependent on commOnHost input argument.  Before, this was
     only host thread-parallel.

  2. Set modified flags for imports_ to avoid spurious
     Kokkos::DualView debug-mode errors, since we're using it in
     write-only mode.
mhoemmen pushed a commit that referenced this issue Aug 14, 2017
@trilinos/tpetra Add a CMake option Tpetra_ASSUME_CUDA_AWARE_MPI, with
associated macro TPETRA_ASSUME_CUDA_AWARE_MPI defined in
TpetraCore_config.h.  If the CMake option is ON, Tpetra may assume
that the MPI implementation it uses is CUDA aware.  See #1571 for
discussion, and #1088 for an application.

The option currently defaults to OFF.  #1571 requires that we actually
have a test for this option, at least for OpenMPI, so we haven't
finished the issue yet.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 14, 2017
…os#1571)

@trilinos/tpetra Add a CMake option Tpetra_ASSUME_CUDA_AWARE_MPI, with
associated macro TPETRA_ASSUME_CUDA_AWARE_MPI defined in
TpetraCore_config.h.  If the CMake option is ON, Tpetra may assume
that the MPI implementation it uses is CUDA aware.  See trilinos#1571 for
discussion, and trilinos#1088 for an application.

The option currently defaults to OFF.  trilinos#1571 requires that we actually
have a test for this option, at least for OpenMPI, so we haven't
finished the issue yet.
crtrott pushed a commit to mndevec/Trilinos that referenced this issue Aug 17, 2017
…os#1571)

@trilinos/tpetra Add a CMake option Tpetra_ASSUME_CUDA_AWARE_MPI, with
associated macro TPETRA_ASSUME_CUDA_AWARE_MPI defined in
TpetraCore_config.h.  If the CMake option is ON, Tpetra may assume
that the MPI implementation it uses is CUDA aware.  See trilinos#1571 for
discussion, and trilinos#1088 for an application.

The option currently defaults to OFF.  trilinos#1571 requires that we actually
have a test for this option, at least for OpenMPI, so we haven't
finished the issue yet.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 28, 2017
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 28, 2017
@trilinos/tpetra I added an all-reduce at the end of each test, to
ensure that all processes made it through the test and passed.  This
should help with trilinos#1088.
@mhoemmen
Copy link
Contributor Author

mhoemmen commented Aug 28, 2017

Recent Kokkos changes and merging with @tjfulle 's packCrsMatrix added some challenge to this task, but I'm making progress. For an interesting complication, see this OpenMPI issue:

open-mpi/ompi#3244

ambrad pushed a commit to ambrad/Trilinos that referenced this issue Aug 29, 2017
ambrad pushed a commit to ambrad/Trilinos that referenced this issue Aug 29, 2017
@trilinos/tpetra I added an all-reduce at the end of each test, to
ensure that all processes made it through the test and passed.  This
should help with trilinos#1088.
@mhoemmen
Copy link
Contributor Author

I got Tpetra to pass tests with CUDA 8.0 and (CUDA-aware) OpenMPI 2.0.1. I'm working on Stokhos now.

mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017
@trilinos/tpetra This will help with trilinos#1088.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017
@trilinos/tpetra @trilinos/stokhos It's unfortunate that I had to
merge these two fixes into one commit.  The reason is that trilinos#1658
blocks trilinos#1088, but I couldn't reorder the two commits without the risk
of losing information.  I will give the two commits' messages below.

* Stokhos,Tpetra: Fix trilinos#1658 (Tpetra::MultiVector unpack kernels)

@trilinos/tpetra Tpetra::MultiVector's unpack kernels (used in
unpackAndCombineNew) now take a Kokkos execution space argument.
Before, the kernels were using the output Kokkos::View (the
MultiVector's data) to determine the execution space.  The problem
with this, is that for a Kokkos::DualView of CudaUVMSpace, the host
and device Views are the same.  This blocks trilinos#1088, whose fix requires
that the input View (the buffer to unpack) be either a CudaSpace or a
HostSpace View.  (CudaUVMSpace::execution_space == Cuda, so the kernel
will always run on device, even given a HostSpace buffer to unpack.)

The new execution space input argument lets the user specify the
execution space on which to run.  It's an execution space instance, so
this could give us a future option to run in a separate CUDA stream
(e.g., for overlap of communication and computation), or to run on a
subset of threads.

I also fixed the unpack kernel specializations in Stokhos.

* Tpetra: Fix trilinos#1088 (never give CUDA UVM allocations to MPI)

@trilinos/tpetra This commit does two things:

  1. Change DistObject so that it never gives CUDA UVM (i.e.,
     Kokkos::CudaUVMSpace) allocations to MPI sends or receives.

  2. Make packCrsMatrix and unpackCrsMatrix use ETI (explicit template
     instantiation), to reduce build time and speed up debugging of
     the pack and unpack routines.

I did the first of these by insisting on Kokkos::CudaSpace, rather
than Kokkos::CudaUVMSpace, for all of DistObject's communication
buffers: imports_, numImportPacketsPerLID_, exports_, and
numExportPacketsPerLID_.  I added a public typedef to DistObject,
buffer_device_type, to help users allocate communication buffers of
the right type.  This required changing Tpetra::MultiVector and
Tpetra::Details::CooMatrix, which are currently the only two
implementations of the "new" DistObject interface.

This commit fixes trilinos#1088.  However, this commit will break CUDA tests
if the MPI implementation is not CUDA aware.  See trilinos#1088 and trilinos#1571.

I also added a lot of optional debug output.  For example,
doTransferNew's debug output now says whether the communication
buffers live on host or device.  The debug output is off by default.
If you would like to enable it temporarily for debugging, just search
for the string "constexpr bool debug = false"; the comment above this
line explains what to do.

The second thing I did was to make packCrsMatrix and unpackCrsMatrix
use explicit template instantiation (ETI).  This reduces build time.
I did it specifically to speed up debugging of my changes to the pack
and unpack routines.  [Author's note: This is one reason why it was
hard to reorder the two commits.]

Stokhos' ETI system is more restrictive than Tpetra's ETI system, so I
had to rename unpackCrsMatrix's headers.  Sorry to folks who are
working on this at the moment.

I also fixed possible errors in Tpetra::CrsMatrix::*[gG]aussSeidel*
just to make sure.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017
@trilinos/tpetra See trilinos#1088.  UVM allocations are slow with MPI.  This
is an MPI and CUDA issue, not a Kokkos or Tpetra issue.  To prevent
users from doing a slow thing, explicitly forbid them from passing any
Kokkos::View into Tpetra::Distributor whose memory space is
Kokkos::CudaUVMSpace.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017
@trilinos/tpetra Set the environment variable
TPETRA_ASSUME_CUDA_AWARE_MPI to tell Tpetra to assume that MPI is CUDA
aware (see trilinos#1088 and trilinos#1571).
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017
@trilinos/tpetra @trilinos/stokhos It's unfortunate that I had to
merge these two fixes into one commit.  The reason is that trilinos#1658
blocks trilinos#1088, but I couldn't reorder the two commits without the risk
of losing information.  I will give the two commits' messages below.

* Stokhos,Tpetra: Fix trilinos#1658 (Tpetra::MultiVector unpack kernels)

@trilinos/tpetra Tpetra::MultiVector's unpack kernels (used in
unpackAndCombineNew) now take a Kokkos execution space argument.
Before, the kernels were using the output Kokkos::View (the
MultiVector's data) to determine the execution space.  The problem
with this, is that for a Kokkos::DualView of CudaUVMSpace, the host
and device Views are the same.  This blocks trilinos#1088, whose fix requires
that the input View (the buffer to unpack) be either a CudaSpace or a
HostSpace View.  (CudaUVMSpace::execution_space == Cuda, so the kernel
will always run on device, even given a HostSpace buffer to unpack.)

The new execution space input argument lets the user specify the
execution space on which to run.  It's an execution space instance, so
this could give us a future option to run in a separate CUDA stream
(e.g., for overlap of communication and computation), or to run on a
subset of threads.

I also fixed the unpack kernel specializations in Stokhos.

* Tpetra: Fix trilinos#1088 (never give CUDA UVM allocations to MPI)

@trilinos/tpetra This commit does two things:

  1. Change DistObject so that it never gives CUDA UVM (i.e.,
     Kokkos::CudaUVMSpace) allocations to MPI sends or receives.

  2. Make packCrsMatrix and unpackCrsMatrix use ETI (explicit template
     instantiation), to reduce build time and speed up debugging of
     the pack and unpack routines.

I did the first of these by insisting on Kokkos::CudaSpace, rather
than Kokkos::CudaUVMSpace, for all of DistObject's communication
buffers: imports_, numImportPacketsPerLID_, exports_, and
numExportPacketsPerLID_.  I added a public typedef to DistObject,
buffer_device_type, to help users allocate communication buffers of
the right type.  This required changing Tpetra::MultiVector and
Tpetra::Details::CooMatrix, which are currently the only two
implementations of the "new" DistObject interface.

This commit fixes trilinos#1088.  However, this commit will break CUDA tests
if the MPI implementation is not CUDA aware.  See trilinos#1088 and trilinos#1571.

I also added a lot of optional debug output.  For example,
doTransferNew's debug output now says whether the communication
buffers live on host or device.  The debug output is off by default.
If you would like to enable it temporarily for debugging, just search
for the string "constexpr bool debug = false"; the comment above this
line explains what to do.

The second thing I did was to make packCrsMatrix and unpackCrsMatrix
use explicit template instantiation (ETI).  This reduces build time.
I did it specifically to speed up debugging of my changes to the pack
and unpack routines.  [Author's note: This is one reason why it was
hard to reorder the two commits.]

Stokhos' ETI system is more restrictive than Tpetra's ETI system, so I
had to rename unpackCrsMatrix's headers.  Sorry to folks who are
working on this at the moment.

I also fixed possible errors in Tpetra::CrsMatrix::*[gG]aussSeidel*
just to make sure.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017
@trilinos/tpetra See trilinos#1088.  UVM allocations are slow with MPI.  This
is an MPI and CUDA issue, not a Kokkos or Tpetra issue.  To prevent
users from doing a slow thing, explicitly forbid them from passing any
Kokkos::View into Tpetra::Distributor whose memory space is
Kokkos::CudaUVMSpace.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017
@trilinos/tpetra Set the environment variable
TPETRA_ASSUME_CUDA_AWARE_MPI to tell Tpetra to assume that MPI is CUDA
aware (see trilinos#1088 and trilinos#1571).
@mhoemmen
Copy link
Contributor Author

Fixed in develop.

@mhoemmen mhoemmen removed the stage: in progress Work on the issue has started label Aug 31, 2017
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017
@trilinos/tpetra See trilinos#1571 and trilinos#1088 to learn what it means for an MPI
implementation to be "CUDA aware," and why this matters for
performance.

This commit adds a test for whether the MPI implementation that
Trilinos uses is actually CUDA aware, if it claims to be CUDA aware,
or if the user claims that MPI is CUDA aware.

The test will only build if CUDA is enabled.  If you want to exercise
this test, you must build Tpetra with CUDA enabled, and set the
environment variable TPETRA_ASSUME_CUDA_AWARE_MPI to some true value
(e.g., "1" or "TRUE").  If you set the environment variable to some
false value (e.g., "0" or "FALSE"), the test will run but will pass
trivially.  This means that you may control the test's behavior at run
time.
mhoemmen pushed a commit to mhoemmen/Trilinos that referenced this issue Aug 31, 2017
@trilinos/tpetra See trilinos#1571 and trilinos#1088.  The new function is called
Tpetra::Details::assumeMpiIsCudaAware.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants