Tpetra,Stokhos: Fix trilinos#1658 and trilinos#1088

@trilinos/tpetra @trilinos/stokhos It's unfortunate that I had to merge these two fixes into one commit. The reason is that trilinos#1658 blocks trilinos#1088, but I couldn't reorder the two commits without the risk of losing information. I will give the two commits' messages below. * Stokhos,Tpetra: Fix trilinos#1658 (Tpetra::MultiVector unpack kernels) @trilinos/tpetra Tpetra::MultiVector's unpack kernels (used in unpackAndCombineNew) now take a Kokkos execution space argument. Before, the kernels were using the output Kokkos::View (the MultiVector's data) to determine the execution space. The problem with this, is that for a Kokkos::DualView of CudaUVMSpace, the host and device Views are the same. This blocks trilinos#1088, whose fix requires that the input View (the buffer to unpack) be either a CudaSpace or a HostSpace View. (CudaUVMSpace::execution_space == Cuda, so the kernel will always run on device, even given a HostSpace buffer to unpack.) The new execution space input argument lets the user specify the execution space on which to run. It's an execution space instance, so this could give us a future option to run in a separate CUDA stream (e.g., for overlap of communication and computation), or to run on a subset of threads. I also fixed the unpack kernel specializations in Stokhos. * Tpetra: Fix trilinos#1088 (never give CUDA UVM allocations to MPI) @trilinos/tpetra This commit does two things: 1. Change DistObject so that it never gives CUDA UVM (i.e., Kokkos::CudaUVMSpace) allocations to MPI sends or receives. 2. Make packCrsMatrix and unpackCrsMatrix use ETI (explicit template instantiation), to reduce build time and speed up debugging of the pack and unpack routines. I did the first of these by insisting on Kokkos::CudaSpace, rather than Kokkos::CudaUVMSpace, for all of DistObject's communication buffers: imports_, numImportPacketsPerLID_, exports_, and numExportPacketsPerLID_. I added a public typedef to DistObject, buffer_device_type, to help users allocate communication buffers of the right type. This required changing Tpetra::MultiVector and Tpetra::Details::CooMatrix, which are currently the only two implementations of the "new" DistObject interface. This commit fixes trilinos#1088. However, this commit will break CUDA tests if the MPI implementation is not CUDA aware. See trilinos#1088 and trilinos#1571. I also added a lot of optional debug output. For example, doTransferNew's debug output now says whether the communication buffers live on host or device. The debug output is off by default. If you would like to enable it temporarily for debugging, just search for the string "constexpr bool debug = false"; the comment above this line explains what to do. The second thing I did was to make packCrsMatrix and unpackCrsMatrix use explicit template instantiation (ETI). This reduces build time. I did it specifically to speed up debugging of my changes to the pack and unpack routines. [Author's note: This is one reason why it was hard to reorder the two commits.] Stokhos' ETI system is more restrictive than Tpetra's ETI system, so I had to rename unpackCrsMatrix's headers. Sorry to folks who are working on this at the moment. I also fixed possible errors in Tpetra::CrsMatrix::*[gG]aussSeidel* just to make sure.
mhoemmen · Aug 31, 2017 · 876bba2 · 876bba2
1 parent 2b1a05a
commit 876bba2
Show file tree

Hide file tree

Showing 19 changed files with 1,692 additions and 805 deletions.
diff --git a/packages/stokhos/src/CMakeLists.txt b/packages/stokhos/src/CMakeLists.txt
@@ -614,6 +614,8 @@ IF (Stokhos_ENABLE_Sacado)
         Experimental::BlockMultiVector
         Experimental::BlockCrsMatrix
         Details::getDiagCopyWithoutOffsets
+        Details::packCrsMatrix
+        Details::unpackCrsMatrixAndCombine
         )
       SET(TPETRAEXT_ETI_CLASSES
         MatrixMatrix

diff --git a/...o/kokkos/pce/tpetra/Tpetra_KokkosRefactor_Details_MultiVectorDistObjectKernels_UQ_PCE.hpp b/...o/kokkos/pce/tpetra/Tpetra_KokkosRefactor_Details_MultiVectorDistObjectKernels_UQ_PCE.hpp
@@ -235,18 +235,23 @@ namespace Details {
     }
   };
 
-  template <typename DS, typename ... DP,
-            typename SS, typename ... SP,
-            typename IdxView, typename Op>
+  template <typename ExecutionSpace,
+            typename DS,
+            typename ... DP,
+            typename SS,
+            typename ... SP,
+            typename IdxView,
+            typename Op>
   struct UnpackArrayMultiColumn<
+    ExecutionSpace,
     Kokkos::View<Sacado::UQ::PCE<DS>**,DP...>,
     Kokkos::View<const Sacado::UQ::PCE<SS>*,SP...>,
     IdxView,
     Op >
   {
     typedef Kokkos::View<Sacado::UQ::PCE<DS>**,DP...> DstView;
     typedef Kokkos::View<const Sacado::UQ::PCE<SS>*,SP...> SrcView;
-    typedef typename DstView::execution_space execution_space;
+    typedef typename ExecutionSpace::execution_space execution_space;
     typedef typename execution_space::size_type size_type;
 
     static const unsigned BlockSize = 32;
@@ -257,50 +262,75 @@ namespace Details {
     Op op;
     size_t numCols;
 
-    UnpackArrayMultiColumn(const DstView& dst_,
-                           const SrcView& src_,
-                           const IdxView& idx_,
-                           const Op& op_,
-                           size_t numCols_) :
-      dst(dst_), src(src_), idx(idx_), op(op_), numCols(numCols_) {}
-
-    KOKKOS_INLINE_FUNCTION
-    void operator()( const size_type k ) const {
+    UnpackArrayMultiColumn (const ExecutionSpace& /* execSpace */,
+                            const DstView& dst_,
+                            const SrcView& src_,
+                            const IdxView& idx_,
+                            const Op& op_,
+                            const size_t numCols_) :
+      dst (dst_),
+      src (src_),
+      idx (idx_),
+      op (op_),
+      numCols (numCols_)
+    {}
+
+    KOKKOS_INLINE_FUNCTION void
+    operator() (const size_type k) const
+    {
       const typename IdxView::value_type localRow = idx(k);
       const size_t offset = k*numCols;
-      for (size_t j = 0; j < numCols; ++j)
-        op( dst(localRow,j), src(offset+j) );
+      for (size_t j = 0; j < numCols; ++j) {
+        op (dst(localRow, j), src(offset+j));
+      }
     }
 
-    KOKKOS_INLINE_FUNCTION
-    void operator()( const size_type k, const size_type tidx ) const {
+    KOKKOS_INLINE_FUNCTION void
+    operator() (const size_type k, const size_type tidx) const
+    {
       const typename IdxView::value_type localRow = idx(k);
       const size_t offset = k*numCols;
-      for (size_t j = 0; j < numCols; ++j)
-        for (size_type i=tidx; i<Kokkos::dimension_scalar(dst); i+=BlockSize)
-          op( dst(localRow,j).fastAccessCoeff(i),
-              src(offset+j).fastAccessCoeff(i) );
+      for (size_t j = 0; j < numCols; ++j) {
+        for (size_type i=tidx; i<Kokkos::dimension_scalar(dst); i+=BlockSize) {
+          op (dst(localRow, j).fastAccessCoeff(i),
+              src(offset+j).fastAccessCoeff(i));
+        }
+      }
     }
 
-    static void unpack(const DstView& dst,
-                       const SrcView& src,
-                       const IdxView& idx,
-                       const Op& op,
-                       size_t numCols) {
-      if ( Details::device_is_cuda<execution_space>::value )
-        Kokkos::parallel_for(
-          Kokkos::MPVectorWorkConfig<execution_space>( idx.size(), BlockSize ),
-          UnpackArrayMultiColumn(dst,src,idx,op,numCols) );
-      else
-        Kokkos::parallel_for( idx.size(),
-                              UnpackArrayMultiColumn(dst,src,idx,op,numCols) );
+    static void
+    unpack (const ExecutionSpace& execSpace,
+            const DstView& dst,
+            const SrcView& src,
+            const IdxView& idx,
+            const Op& op,
+            const size_t numCols)
+    {
+      if (Details::device_is_cuda<execution_space>::value) {
+        Kokkos::parallel_for
+          ("Tpetra::MultiVector (Sacado::UQ::PCE) unpack (constant stride)",
+           Kokkos::MPVectorWorkConfig<execution_space> (idx.size (), BlockSize),
+           UnpackArrayMultiColumn (execSpace, dst, src, idx, op, numCols));
+      }
+      else {
+        Kokkos::parallel_for
+          ("Tpetra::MultiVector (Sacado::UQ::PCE) unpack (constant stride)",
+           Kokkos::RangePolicy<execution_space, size_type> (0, idx.size ()),
+           UnpackArrayMultiColumn (execSpace, dst, src, idx, op, numCols));
+      }
     }
   };
 
-  template <typename DS, typename ... DP,
-            typename SS, typename ... SP,
-            typename IdxView, typename ColView, typename Op>
+  template <typename ExecutionSpace,
+            typename DS,
+            typename ... DP,
+            typename SS,
+            typename ... SP,
+            typename IdxView,
+            typename ColView,
+            typename Op>
   struct UnpackArrayMultiColumnVariableStride<
+    ExecutionSpace,
     Kokkos::View<Sacado::UQ::PCE<DS>**,DP...>,
     Kokkos::View<const Sacado::UQ::PCE<SS>*,SP...>,
     IdxView,
@@ -309,7 +339,7 @@ namespace Details {
   {
     typedef Kokkos::View<Sacado::UQ::PCE<DS>**,DP...> DstView;
     typedef Kokkos::View<const Sacado::UQ::PCE<SS>*,SP...> SrcView;
-    typedef typename DstView::execution_space execution_space;
+    typedef typename ExecutionSpace::execution_space execution_space;
     typedef typename execution_space::size_type size_type;
 
     static const unsigned BlockSize = 32;
@@ -321,12 +351,13 @@ namespace Details {
     Op op;
     size_t numCols;
 
-    UnpackArrayMultiColumnVariableStride(const DstView& dst_,
-                                         const SrcView& src_,
-                                         const IdxView& idx_,
-                                         const ColView& col_,
-                                         const Op& op_,
-                                         size_t numCols_) :
+    UnpackArrayMultiColumnVariableStride (const ExecutionSpace& /* execSpace */,
+                                          const DstView& dst_,
+                                          const SrcView& src_,
+                                          const IdxView& idx_,
+                                          const ColView& col_,
+                                          const Op& op_,
+                                          size_t numCols_) :
       dst(dst_), src(src_), idx(idx_), col(col_), op(op_), numCols(numCols_) {}
 
     KOKKOS_INLINE_FUNCTION
@@ -347,20 +378,29 @@ namespace Details {
               src(offset+j).fastAccessCoeff(i) );
     }
 
-    static void unpack(const DstView& dst,
-                       const SrcView& src,
-                       const IdxView& idx,
-                       const ColView& col,
-                       const Op& op,
-                       size_t numCols) {
-      if ( Details::device_is_cuda<execution_space>::value )
-        Kokkos::parallel_for(
-          Kokkos::MPVectorWorkConfig<execution_space>( idx.size(), BlockSize ),
-          UnpackArrayMultiColumnVariableStride(dst,src,idx,col,op,numCols) );
-      else
-        Kokkos::parallel_for( idx.size(),
-                              UnpackArrayMultiColumnVariableStride(
-                                dst,src,idx,col,op,numCols) );
+    static void
+    unpack (const ExecutionSpace& execSpace,
+            const DstView& dst,
+            const SrcView& src,
+            const IdxView& idx,
+            const ColView& col,
+            const Op& op,
+            const size_t numCols)
+    {
+      if ( Details::device_is_cuda<execution_space>::value ) {
+        Kokkos::parallel_for
+          ("Tpetra::MultiVector unpack (Sacado::UQ::PCE) (nonconstant stride)",
+           Kokkos::MPVectorWorkConfig<execution_space> (idx.size (), BlockSize),
+           UnpackArrayMultiColumnVariableStride (execSpace, dst, src,
+                                                 idx, col, op, numCols));
+      }
+      else {
+        Kokkos::parallel_for
+          ("Tpetra::MultiVector unpack (Sacado::UQ::PCE) (nonconstant stride)",
+           Kokkos::RangePolicy<execution_space, size_type> (0, idx.size ()),
+           UnpackArrayMultiColumnVariableStride (execSpace, dst, src,
+                                                 idx, col, op, numCols));
+      }
     }
   };