-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra: Thread parallelization of unpackAndCombineIntoCrsArrays #1665
Conversation
There is only one section of the (new) unpacking code that is not yet thread parallel. Once that is complete, this PR will be complete. |
b08e108
to
2262f5f
Compare
@mhoemmen, with my latest changes, all Tpetra tests pass on CUDA, all Tpetra tests pass on Darwin with OpenMP node type, and all Tpetra+downstream tests pass with the checkin script. However, two @trilinos/muelu tests fail for OpenMP node types. But, I don't think the issue is with this unpacking work, since they also fail on the HEAD of the develop branch. I'm bisecting the git history as we speak to determine when they started failing for OpenMP node types. Unfortunately, it takes sooooooooooooooooooooo long to recompile... |
@tjfulle See #1660 -- it looks like MueLu uses Mehmet's sparse matrix-matrix multiply now by default, when OpenMP is enabled. I'm not sure if it's possible to disable this with a CMake option any more, but you could try the following, just to see if tests pass with that code disabled:
|
@mhoemmen the failing tests are Muelu_TpetraUnitTes_MPI_[1,4]. My machine is running git bisect right now, I should have the offending commit in a few hours |
Editing last comment to add that those are approximate names, I don't recall the exact names and I am away fro a computer |
Ross mentioned that there was a MueLu issue with those tests. Scroll to the bottom of #1304 (today's comments). I think it would be wise to wait until those tests pass, alas -- MueLu is the code that uses CrsMatrix pack and unpack the most. |
2262f5f
to
bb0dff8
Compare
@tjfulle Sweet! :-D Albany has been reporting CUDA build issues involving Stokhos and Tpetra. See https://github.com/gahansen/Albany/issues/176 I'm worried that I built Stokhos and Tpetra, but wasn't able to manifest this build error. Could you take a look? Thanks! :-D |
@mhoemmen I'll take a look on Tuesday morning - I'm checking out for the holiday. |
3834521
to
a58e8e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comments. Thanks!
# Tpetra::Details::unpackCrsMatrixAndCombine. Do so for the same | ||
# Tpetra::Details::unpackCrsMatrixAndCombine, | ||
# Tpetra::Details::unpackAndCombineIntoCrsArrays, and | ||
# Tpetra::Details::unpackAndCombineWithOwningPIDsCount. Do so for the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine for now, but later we could think about splitting these into separate files, for more parallel builds.
@@ -8172,12 +8166,12 @@ namespace Tpetra { | |||
// in a huge list of arrays is icky. Can't we have a bit of an | |||
// abstraction? Implementing a concrete DistObject subclass only | |||
// takes five methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment still stands, but it stands for the future, not for this PR ;-) .
/// \param numPacketsPerLID [out] Entry k gives the number of bytes | ||
/// packed for row exportLIDs[k] of the local matrix. | ||
/// | ||
/// \param ixportLIDs [in] Local indices of the rows to pack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
importLIDs not "ixportLIDs"
/// | ||
/// \param sourceMatrix [in] the CrsMatrix source | ||
/// | ||
/// \param ixports [in/out] Output pack buffer; resized if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"imports" not "ixports". More importantly, imports
is a const Teuchos::ArrayView<const char>&
, so it can't possibly be "resized if needed." (This could be just a matter of fixing the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied and pasted documentation. Perhaps at some time in the past it was an input/output? I'll fix the documentation to reflect the current state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do; thanks!
|
||
if (num_bytes_per_value == 0) { | ||
ST val; // packValueCount wants this | ||
num_bytes_per_value = PackTraits<ST, Device>::packValueCount(val); | ||
num_bytes_per_value = PackTraits<ST, DT>::packValueCount(ST()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break Stokhos, due to the assumption that default-constructed ST has the right size, but it was broken before anyway ;-) . Thus, it's OK to defer fixing this to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this branch is unused and can be removed. All callers to this function send in a value > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjfulle What's actually the point of the num_bytes_per_value
branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe when I put the branch in I was combining the several unpacking schemes and this branch allowed for one case that the num_bytes_per_value
was not known and Scalar was default constructed. On further combining/refactoring, that case went away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stohkos never calls this function, it calls packValueCount and friends. This function is a combination of similar functions in the unpackCrsMatrix and unpackWithOwningPIDs functions. Stokhos use of packValueCount is pretty simple and will allow simplifying packValueCount as @ibaned has advocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just looking at the code in full now and remember why the branch exists. When unpacking the CrsMatrix
, matrix values are not known a-priori, so the only (current) way of getting the size of the values is by the default constructor. When unpacking in to CrsArrays
, at least some matrix values are known and those values can be used to determine the size needed to unpack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified the code to push the computation of the number of bytes per value further up stream. It does not remove the issue, but simplifies this function by removing the branch.
struct TotNumEntTag {}; | ||
|
||
/// \brief Functor to determine the number of entries in a matrix using | ||
/// Kokkos::parallel_reduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation looks wrong -- this is a functor to determine either the total number of bytes required for unpacking incoming matrix entries, or the maximum number of bytes required for unpacking a row of incoming matrix entries, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, could you do both at once? Each requires one pass over the incoming packed data, so you could get the max and total in one shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the fun for does compute the max and total number of entries for array allocation purposes. They can probably be combined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tag could be more explicit, it's the max bum entries on any one row. It's used to allocate scratch space on device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhoemmen said:
if so, could you do both at once? Each requires one pass over the incoming packed data, so you could get the max and total in one shot.
I suppose they could be combined, but it would not save work done as the two counting functions are not called in the same procedure. Unpacking in to the CrsMatrix
requires the maximum number of entries in any one row and unpacking in to CrsArrays
requires total number of entries, thus two separate counting procedures (but only one functor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it would not save work done as the two counting functions are not called in the same procedure.
It seems like you could compute both of them once, before anybody needs them, and then pass along those two integer values (the max and the total).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I wasn't too clear. The total is needed for transferAndFillComplete
and the maximum is needed for packAndPrepare
. I'm having a hard time seeing how computing the max and total at once would be beneficial as these two procedures are independent.
Addresses feedback from @mhoemmen for PR trilinos#1665
@mhoemmen, the latest commit addresses your comments. Passes all Tpetra+downstream on RHEL6 with/without OpenMP and Tpetra tests pass on CUDA 8.0.44 (I really ought to write a script that automagically runs all "standard" Tpetra tests for me...) |
@tjfulle Not to scold, but I'm just curious: Do you have trouble getting downstream tests to build with CUDA? I would say, as long as MueLu and Stokhos pass with CUDA, we're golden :-) . |
If you mean "run Trilinos tests with all combinations of OpenMP and CUDA options," Christian might have a script that you could modify and use. |
I've only built and run Tpetra tests, I can run the others later tonight or tomorrow
Cheers,
…-- Tim --
Tim Fuller, PhD
Scalable Algorithms
Organization 1426
Sandia National Laboratories
E: tjfulle@sandia.gov<mailto:tjfulle@sandia.gov>
P: (505) 205-0003
On Sep 4, 2017, at 11:48 AM, Mark Hoemmen <notifications@github.com<mailto:notifications@github.com>> wrote:
@tjfulle<https://github.com/tjfulle> Not to scold, but I'm just curious: Do you have trouble getting downstream tests to build with CUDA? I would say, as long as MueLu and Stokhos pass with CUDA, we're golden :-) .
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1665 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABfe5_1U6LXIDpqUlkm_IHierEZPgjV6ks5sfDfKgaJpZM4PJFI1>.
|
It's trickier to get CUDA builds to work with downstream stuff. You may have to disable some packages that don't matter so much for MueLu and Stokhos testing. Also, no hurry :) . |
Kinda, I was more thinking a script like Ross' remote testing script that would build and run tests on different remote machines to test all important builds simultaneously |
@tjfulle wrote:
I think we could build something like that out of Ross' script. It would make sense to farm out the test platforms to a cloud service. Ross is probably working on that already! |
The fix will be in Stokhos. |
@mhoemmen , I can run tests easily on Cuda, but debugging is a bit clumsy on an interactive node. |
I forgot to mention this issue in the commit message, but I just pushed a change to Stokhos that appears to fix the failing Stokhos test. |
Thanks @etphipp ! |
74d47e6
to
c8769a7
Compare
Adding the [WIP] label until the failing Stokhos tests can be resolved. |
OK, I believe all of the failing Stokhos tests have been resolved. There was a problem with compute capability 6.x that was preventing the PCE mat-vec from functioning properly, which was causing the failures on ride (and nowhere else). All of the Stokhos tests pass on ride for me now. |
c8769a7
to
a8b943e
Compare
- Moved unpackAndCombineIntoCrsArrays (and friends) from Tpetra_Import_Util2.hpp to Tpetra_Details_unpackCrsMatrixAndCombine_de*.hpp. - unpackAndCombineIntoCrsArrays broken up in to many many smaller functions (it was previously one large monolithic function). Each of the small functions was refactored to be thread parallel. - Race conditions were identified and resolved, mostly by using Kokkos::atomic_fetch_add where appropriate. Addresses: trilinos#797, trilinos#800, trilinos#802 Review: @mhoemmen Tests were run on two different machines and there results amended to this commit: Build/Test Cases Summary [RHEL6, standard checkin script] Enabled Packages: TpetraCore Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1506,notpassed=0 (19.13 min) 1) MPI_RELEASE_DEBUG_SHARED_OPENMP_PT => passed: passed=1509,notpassed=0 (13.75 min) Build/Test Cases Summary [ride.sandia.gov, CUDA] Enabled Packages: Tpetra,MueLu,Stokhos 0) MPI_RELEASE_SHARED_CUDA => passed=233,notpassed=14 (8.76 min) The 14 failing tests are unrelated MueLu tests that can be ignored, see trilinos#1699 The failing Stokhos tests mentioned in trilinos#1655 were fixed with commit e97e37b
a8b943e
to
7cc9af2
Compare
@mhoemmen, with the updates to Stokhos, all tests are passing on RHEL6 and CUDA. This is PR is ready once it passes your review. |
@tjfulle SWEET :-D I'll (re-)review! |
I apologize that the commit is so large! Swapping in the |
I'm working on #1706 first, then this :-) |
@tjfulle turns out I can multitask! ;-) thanks for working on this!!! :-D |
Hm, did we ever test this for complex builds? I'm getting some build errors.... I'll work on it though. |
I've got a fix for complex builds ready and will push soon. |
@trilinos/tpetra If Scalar could be std::complex<T>, it needs to turn into impl_scalar_type (via reinterpret_cast) before it enters the Kokkos world.
Doh! The last push was not tested with complex. What was the error? |
@tjfulle Just pushed the fix for complex builds. There was one spot that was passing a I think we could fix |
Work in Progress pull request to discuss with @mhoemmen
This pull request addresses thread parallelization of
unpackAndCombineIntoCrsArrays
. The current state passes all standard tests with and without OpenMP. All @trilinos/tpetra tests pass on CUDA (I did not run any other tests).