Skip to content

Commit

Permalink
Tpetra,Stokhos: Fix trilinos#1658 and trilinos#1088
Browse files Browse the repository at this point in the history
@trilinos/tpetra @trilinos/stokhos It's unfortunate that I had to
merge these two fixes into one commit.  The reason is that trilinos#1658
blocks trilinos#1088, but I couldn't reorder the two commits without the risk
of losing information.  I will give the two commits' messages below.

* Stokhos,Tpetra: Fix trilinos#1658 (Tpetra::MultiVector unpack kernels)

@trilinos/tpetra Tpetra::MultiVector's unpack kernels (used in
unpackAndCombineNew) now take a Kokkos execution space argument.
Before, the kernels were using the output Kokkos::View (the
MultiVector's data) to determine the execution space.  The problem
with this, is that for a Kokkos::DualView of CudaUVMSpace, the host
and device Views are the same.  This blocks trilinos#1088, whose fix requires
that the input View (the buffer to unpack) be either a CudaSpace or a
HostSpace View.  (CudaUVMSpace::execution_space == Cuda, so the kernel
will always run on device, even given a HostSpace buffer to unpack.)

The new execution space input argument lets the user specify the
execution space on which to run.  It's an execution space instance, so
this could give us a future option to run in a separate CUDA stream
(e.g., for overlap of communication and computation), or to run on a
subset of threads.

I also fixed the unpack kernel specializations in Stokhos.

* Tpetra: Fix trilinos#1088 (never give CUDA UVM allocations to MPI)

@trilinos/tpetra This commit does two things:

  1. Change DistObject so that it never gives CUDA UVM (i.e.,
     Kokkos::CudaUVMSpace) allocations to MPI sends or receives.

  2. Make packCrsMatrix and unpackCrsMatrix use ETI (explicit template
     instantiation), to reduce build time and speed up debugging of
     the pack and unpack routines.

I did the first of these by insisting on Kokkos::CudaSpace, rather
than Kokkos::CudaUVMSpace, for all of DistObject's communication
buffers: imports_, numImportPacketsPerLID_, exports_, and
numExportPacketsPerLID_.  I added a public typedef to DistObject,
buffer_device_type, to help users allocate communication buffers of
the right type.  This required changing Tpetra::MultiVector and
Tpetra::Details::CooMatrix, which are currently the only two
implementations of the "new" DistObject interface.

This commit fixes trilinos#1088.  However, this commit will break CUDA tests
if the MPI implementation is not CUDA aware.  See trilinos#1088 and trilinos#1571.

I also added a lot of optional debug output.  For example,
doTransferNew's debug output now says whether the communication
buffers live on host or device.  The debug output is off by default.
If you would like to enable it temporarily for debugging, just search
for the string "constexpr bool debug = false"; the comment above this
line explains what to do.

The second thing I did was to make packCrsMatrix and unpackCrsMatrix
use explicit template instantiation (ETI).  This reduces build time.
I did it specifically to speed up debugging of my changes to the pack
and unpack routines.  [Author's note: This is one reason why it was
hard to reorder the two commits.]

Stokhos' ETI system is more restrictive than Tpetra's ETI system, so I
had to rename unpackCrsMatrix's headers.  Sorry to folks who are
working on this at the moment.

I also fixed possible errors in Tpetra::CrsMatrix::*[gG]aussSeidel*
just to make sure.
  • Loading branch information
Mark Hoemmen committed Aug 31, 2017
1 parent 2b1a05a commit 876bba2
Show file tree
Hide file tree
Showing 19 changed files with 1,692 additions and 805 deletions.
2 changes: 2 additions & 0 deletions packages/stokhos/src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -614,6 +614,8 @@ IF (Stokhos_ENABLE_Sacado)
Experimental::BlockMultiVector
Experimental::BlockCrsMatrix
Details::getDiagCopyWithoutOffsets
Details::packCrsMatrix
Details::unpackCrsMatrixAndCombine
)
SET(TPETRAEXT_ETI_CLASSES
MatrixMatrix
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -235,18 +235,23 @@ namespace Details {
}
};

template <typename DS, typename ... DP,
typename SS, typename ... SP,
typename IdxView, typename Op>
template <typename ExecutionSpace,
typename DS,
typename ... DP,
typename SS,
typename ... SP,
typename IdxView,
typename Op>
struct UnpackArrayMultiColumn<
ExecutionSpace,
Kokkos::View<Sacado::UQ::PCE<DS>**,DP...>,
Kokkos::View<const Sacado::UQ::PCE<SS>*,SP...>,
IdxView,
Op >
{
typedef Kokkos::View<Sacado::UQ::PCE<DS>**,DP...> DstView;
typedef Kokkos::View<const Sacado::UQ::PCE<SS>*,SP...> SrcView;
typedef typename DstView::execution_space execution_space;
typedef typename ExecutionSpace::execution_space execution_space;
typedef typename execution_space::size_type size_type;

static const unsigned BlockSize = 32;
Expand All @@ -257,50 +262,75 @@ namespace Details {
Op op;
size_t numCols;

UnpackArrayMultiColumn(const DstView& dst_,
const SrcView& src_,
const IdxView& idx_,
const Op& op_,
size_t numCols_) :
dst(dst_), src(src_), idx(idx_), op(op_), numCols(numCols_) {}

KOKKOS_INLINE_FUNCTION
void operator()( const size_type k ) const {
UnpackArrayMultiColumn (const ExecutionSpace& /* execSpace */,
const DstView& dst_,
const SrcView& src_,
const IdxView& idx_,
const Op& op_,
const size_t numCols_) :
dst (dst_),
src (src_),
idx (idx_),
op (op_),
numCols (numCols_)
{}

KOKKOS_INLINE_FUNCTION void
operator() (const size_type k) const
{
const typename IdxView::value_type localRow = idx(k);
const size_t offset = k*numCols;
for (size_t j = 0; j < numCols; ++j)
op( dst(localRow,j), src(offset+j) );
for (size_t j = 0; j < numCols; ++j) {
op (dst(localRow, j), src(offset+j));
}
}

KOKKOS_INLINE_FUNCTION
void operator()( const size_type k, const size_type tidx ) const {
KOKKOS_INLINE_FUNCTION void
operator() (const size_type k, const size_type tidx) const
{
const typename IdxView::value_type localRow = idx(k);
const size_t offset = k*numCols;
for (size_t j = 0; j < numCols; ++j)
for (size_type i=tidx; i<Kokkos::dimension_scalar(dst); i+=BlockSize)
op( dst(localRow,j).fastAccessCoeff(i),
src(offset+j).fastAccessCoeff(i) );
for (size_t j = 0; j < numCols; ++j) {
for (size_type i=tidx; i<Kokkos::dimension_scalar(dst); i+=BlockSize) {
op (dst(localRow, j).fastAccessCoeff(i),
src(offset+j).fastAccessCoeff(i));
}
}
}

static void unpack(const DstView& dst,
const SrcView& src,
const IdxView& idx,
const Op& op,
size_t numCols) {
if ( Details::device_is_cuda<execution_space>::value )
Kokkos::parallel_for(
Kokkos::MPVectorWorkConfig<execution_space>( idx.size(), BlockSize ),
UnpackArrayMultiColumn(dst,src,idx,op,numCols) );
else
Kokkos::parallel_for( idx.size(),
UnpackArrayMultiColumn(dst,src,idx,op,numCols) );
static void
unpack (const ExecutionSpace& execSpace,
const DstView& dst,
const SrcView& src,
const IdxView& idx,
const Op& op,
const size_t numCols)
{
if (Details::device_is_cuda<execution_space>::value) {
Kokkos::parallel_for
("Tpetra::MultiVector (Sacado::UQ::PCE) unpack (constant stride)",
Kokkos::MPVectorWorkConfig<execution_space> (idx.size (), BlockSize),
UnpackArrayMultiColumn (execSpace, dst, src, idx, op, numCols));
}
else {
Kokkos::parallel_for
("Tpetra::MultiVector (Sacado::UQ::PCE) unpack (constant stride)",
Kokkos::RangePolicy<execution_space, size_type> (0, idx.size ()),
UnpackArrayMultiColumn (execSpace, dst, src, idx, op, numCols));
}
}
};

template <typename DS, typename ... DP,
typename SS, typename ... SP,
typename IdxView, typename ColView, typename Op>
template <typename ExecutionSpace,
typename DS,
typename ... DP,
typename SS,
typename ... SP,
typename IdxView,
typename ColView,
typename Op>
struct UnpackArrayMultiColumnVariableStride<
ExecutionSpace,
Kokkos::View<Sacado::UQ::PCE<DS>**,DP...>,
Kokkos::View<const Sacado::UQ::PCE<SS>*,SP...>,
IdxView,
Expand All @@ -309,7 +339,7 @@ namespace Details {
{
typedef Kokkos::View<Sacado::UQ::PCE<DS>**,DP...> DstView;
typedef Kokkos::View<const Sacado::UQ::PCE<SS>*,SP...> SrcView;
typedef typename DstView::execution_space execution_space;
typedef typename ExecutionSpace::execution_space execution_space;
typedef typename execution_space::size_type size_type;

static const unsigned BlockSize = 32;
Expand All @@ -321,12 +351,13 @@ namespace Details {
Op op;
size_t numCols;

UnpackArrayMultiColumnVariableStride(const DstView& dst_,
const SrcView& src_,
const IdxView& idx_,
const ColView& col_,
const Op& op_,
size_t numCols_) :
UnpackArrayMultiColumnVariableStride (const ExecutionSpace& /* execSpace */,
const DstView& dst_,
const SrcView& src_,
const IdxView& idx_,
const ColView& col_,
const Op& op_,
size_t numCols_) :
dst(dst_), src(src_), idx(idx_), col(col_), op(op_), numCols(numCols_) {}

KOKKOS_INLINE_FUNCTION
Expand All @@ -347,20 +378,29 @@ namespace Details {
src(offset+j).fastAccessCoeff(i) );
}

static void unpack(const DstView& dst,
const SrcView& src,
const IdxView& idx,
const ColView& col,
const Op& op,
size_t numCols) {
if ( Details::device_is_cuda<execution_space>::value )
Kokkos::parallel_for(
Kokkos::MPVectorWorkConfig<execution_space>( idx.size(), BlockSize ),
UnpackArrayMultiColumnVariableStride(dst,src,idx,col,op,numCols) );
else
Kokkos::parallel_for( idx.size(),
UnpackArrayMultiColumnVariableStride(
dst,src,idx,col,op,numCols) );
static void
unpack (const ExecutionSpace& execSpace,
const DstView& dst,
const SrcView& src,
const IdxView& idx,
const ColView& col,
const Op& op,
const size_t numCols)
{
if ( Details::device_is_cuda<execution_space>::value ) {
Kokkos::parallel_for
("Tpetra::MultiVector unpack (Sacado::UQ::PCE) (nonconstant stride)",
Kokkos::MPVectorWorkConfig<execution_space> (idx.size (), BlockSize),
UnpackArrayMultiColumnVariableStride (execSpace, dst, src,
idx, col, op, numCols));
}
else {
Kokkos::parallel_for
("Tpetra::MultiVector unpack (Sacado::UQ::PCE) (nonconstant stride)",
Kokkos::RangePolicy<execution_space, size_type> (0, idx.size ()),
UnpackArrayMultiColumnVariableStride (execSpace, dst, src,
idx, col, op, numCols));
}
}
};

Expand Down
Loading

0 comments on commit 876bba2

Please sign in to comment.