Fix performance bugs in scalar reductions #509

magnatelee · 2022-08-05T05:49:26Z

No description provided.

* Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently.

magnatelee · 2022-08-05T05:50:37Z

this fixes #506 (cc @rohany)

manopapad · 2022-08-05T18:30:53Z

Do you you want to also replace the use of DeferredReduction in binary_red.cu?

magnatelee · 2022-08-05T18:41:19Z

Do you you want to also replace the use of DeferredReduction in binary_red.cu?

b204473

src/cunumeric/scalar_reduction_buffer.h

src/cunumeric/index/advanced_indexing.cu

src/cunumeric/cuda_help.h

jjwilke · 2022-08-05T20:38:32Z

src/cunumeric/scalar_reduction_buffer.h

+    CHECK_CUDA(cudaMemcpyAsync(ptr_, &identity, sizeof(LHS), cudaMemcpyHostToDevice, stream));
+  }
+
+  __device__ void operator<<=(const RHS& value) const


Nit: obviously these things come down to preference and this is just matching legion, but I would personally suggest writing this out as a function name rather than overloading an operator. This appears be doing an atomic reduce. The reduce_output helper function was a little bit difficult to parse with the <<= (a bit-shift operator borrowed for a different purpose) instead of just having a function call say exactly what the code is doing (result.non_exclusive_fold, e.g.)

jjwilke · 2022-08-05T20:42:27Z

src/cunumeric/scalar_reduction_buffer.h

+  using RHS = typename REDOP::RHS;
+
+ public:
+  ScalarReductionBuffer(cudaStream_t stream) : buffer_(legate::create_buffer<LHS>(1))


the class name obviously gets annoying long, but consider calling this 'DeviceScalarReductionBuffer' to make it clear this is not a general reduction buffer and is only designed for device reductions.

jjwilke · 2022-08-05T21:10:14Z

src/cunumeric/scalar_reduction_buffer.h

+  using RHS = typename REDOP::RHS;
+
+ public:
+  ScalarReductionBuffer(cudaStream_t stream) : buffer_(legate::create_buffer<LHS>(1))


Since this is again only going to run on the device, do we want to explicitly pass GPU_FB_MEM to create_buffer to make it clearer what is happening? Otherwise this is using the default kind = NO_MEMKIND, which seems potentially fragile to rely on get_executing_processor() returning TOC_PROC to allocate this in the right place.

src/cunumeric/device_scalar_reduction_buffer.h

* Unify the template for device reduction tree and do some cleanup * Fix performance bugs in scalar reduction kernels: * Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently. * Use thew new scalar reduction buffer in binary reductions as well * Use only the RHS type in the reduction buffer as we never call apply * Minor clean up per review * Rename the buffer class and method to make the intent explicit * Flip the polarity of reduce's template parameter

* Unify the template for device reduction tree and do some cleanup * Fix performance bugs in scalar reduction kernels: * Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently. * Use thew new scalar reduction buffer in binary reductions as well * Use only the RHS type in the reduction buffer as we never call apply * Minor clean up per review * Rename the buffer class and method to make the intent explicit * Flip the polarity of reduce's template parameter Co-authored-by: Wonchan Lee <wonchanl@nvidia.com>

magnatelee added 2 commits August 4, 2022 22:03

Unify the template for device reduction tree and do some cleanup

90165d5

magnatelee requested a review from manopapad August 5, 2022 05:50

Use thew new scalar reduction buffer in binary reductions as well

b204473

manopapad reviewed Aug 5, 2022

View reviewed changes

src/cunumeric/scalar_reduction_buffer.h Outdated Show resolved Hide resolved

manopapad reviewed Aug 5, 2022

View reviewed changes

src/cunumeric/index/advanced_indexing.cu Show resolved Hide resolved

manopapad reviewed Aug 5, 2022

View reviewed changes

src/cunumeric/cuda_help.h Outdated Show resolved Hide resolved

magnatelee added 2 commits August 5, 2022 13:51

Use only the RHS type in the reduction buffer as we never call apply

14cd060

Minor clean up per review

9383c2a

manopapad approved these changes Aug 5, 2022

View reviewed changes

jjwilke reviewed Aug 5, 2022

View reviewed changes

Rename the buffer class and method to make the intent explicit

2533b0d

manopapad reviewed Aug 5, 2022

View reviewed changes

src/cunumeric/device_scalar_reduction_buffer.h Outdated Show resolved Hide resolved

Flip the polarity of reduce's template parameter

7d43246

magnatelee merged commit e65032b into nv-legate:branch-22.10 Aug 6, 2022

magnatelee deleted the fix-perf-bug-scalar-reduction branch August 6, 2022 00:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance bugs in scalar reductions #509

Fix performance bugs in scalar reductions #509

magnatelee commented Aug 5, 2022

magnatelee commented Aug 5, 2022

manopapad commented Aug 5, 2022

magnatelee commented Aug 5, 2022

jjwilke Aug 5, 2022

magnatelee Aug 5, 2022

jjwilke Aug 5, 2022

magnatelee Aug 5, 2022

jjwilke Aug 5, 2022

magnatelee Aug 5, 2022

Fix performance bugs in scalar reductions #509

Fix performance bugs in scalar reductions #509

Conversation

magnatelee commented Aug 5, 2022

magnatelee commented Aug 5, 2022

manopapad commented Aug 5, 2022

magnatelee commented Aug 5, 2022

jjwilke Aug 5, 2022

Choose a reason for hiding this comment

magnatelee Aug 5, 2022

Choose a reason for hiding this comment

jjwilke Aug 5, 2022

Choose a reason for hiding this comment

magnatelee Aug 5, 2022

Choose a reason for hiding this comment

jjwilke Aug 5, 2022

Choose a reason for hiding this comment

magnatelee Aug 5, 2022

Choose a reason for hiding this comment