Make no warp sync assumption for CC7.x #3211

kangshiyin · 2019-04-07T05:43:24Z

Trying to fix the issue mentioned in #3080.

Adding __syncwarp() to all warp reduction code as some of them may not be able to be replaced by CUB block reduction. It should not change the behavior for CC6.x and older GPUs.

All GPU tests have passed on my CC5.x GPU with CUDA9. Performance is exactly the same on TaceMatMat(). But I don't have a CC7.x GPU, so I'm not 100% sure if it works on it.

danpovey · 2019-04-07T17:43:54Z

src/cudamatrix/cu-kernels.cu

@@ -1010,11 +1010,14 @@ static void _trace_mat_mat(const Real* A, const Real* B, MatrixDim dA,
    __syncthreads();
  }

-  // Warp reduce. Implicitly synchronized within a warp.
+  // Warp reduce


Thanks.
I think this reduction is just summing the array. Surely there must be a cub approach for this-- wouldn't that be a more standard approach?
I don't know much about this stuff, just want to know.

cub may be better if it can be used here. I will take a look soon.

Speed(gflops) size no-cub cub speedup CuMatrix::TraceMatMat<double>, 16 0.01 0.01 1.01x CuMatrix::TraceMatMat<double>, 32 0.05 0.05 0.99x CuMatrix::TraceMatMat<double>, 64 0.20 0.20 1.00x CuMatrix::TraceMatMat<double>, 128 0.73 0.80 1.10x CuMatrix::TraceMatMat<double>, 256 2.34 2.33 1.00x CuMatrix::TraceMatMat<double>, 512 6.74 5.60 0.83x CuMatrix::TraceMatMat<double>, 1024 11.78 11.54 0.98x CuMatrix::TraceMatMat<double>, 2048 14.71 14.58 0.99x CuMatrix::TraceMatMat<double>, 4096 15.82 15.70 0.99x CuMatrix::TraceMatMat<double>, 8192 16.01 15.90 0.99x CuMatrix::TraceMatMat<double>[transposed], 16 0.01 0.01 1.03x CuMatrix::TraceMatMat<double>[transposed], 32 0.05 0.05 1.02x CuMatrix::TraceMatMat<double>[transposed], 64 0.19 0.20 1.05x CuMatrix::TraceMatMat<double>[transposed], 128 0.64 0.78 1.23x CuMatrix::TraceMatMat<double>[transposed], 256 2.33 2.34 1.00x CuMatrix::TraceMatMat<double>[transposed], 512 6.60 5.68 0.86x CuMatrix::TraceMatMat<double>[transposed], 1024 11.83 10.99 0.93x CuMatrix::TraceMatMat<double>[transposed], 2048 14.78 14.77 1.00x CuMatrix::TraceMatMat<double>[transposed], 4096 15.98 15.93 1.00x CuMatrix::TraceMatMat<double>[transposed], 8192 16.17 16.17 1.00x CuMatrix::TraceMatMat<float>, 16 0.01 0.01 1.01x CuMatrix::TraceMatMat<float>, 32 0.05 0.05 1.01x CuMatrix::TraceMatMat<float>, 64 0.21 0.22 1.02x CuMatrix::TraceMatMat<float>, 128 0.83 0.85 1.03x CuMatrix::TraceMatMat<float>, 256 3.21 3.27 1.02x CuMatrix::TraceMatMat<float>, 512 9.09 9.10 1.00x CuMatrix::TraceMatMat<float>, 1024 19.55 19.67 1.01x CuMatrix::TraceMatMat<float>, 2048 27.42 27.53 1.00x CuMatrix::TraceMatMat<float>, 4096 30.54 30.50 1.00x CuMatrix::TraceMatMat<float>, 8192 31.49 31.44 1.00x CuMatrix::TraceMatMat<float>[transposed], 16 0.01 0.01 1.03x CuMatrix::TraceMatMat<float>[transposed], 32 0.05 0.05 1.05x CuMatrix::TraceMatMat<float>[transposed], 64 0.21 0.22 1.05x CuMatrix::TraceMatMat<float>[transposed], 128 0.81 0.86 1.05x CuMatrix::TraceMatMat<float>[transposed], 256 3.20 3.25 1.02x CuMatrix::TraceMatMat<float>[transposed], 512 9.06 9.13 1.01x CuMatrix::TraceMatMat<float>[transposed], 1024 17.29 19.05 1.10x CuMatrix::TraceMatMat<float>[transposed], 2048 26.17 26.22 1.00x CuMatrix::TraceMatMat<float>[transposed], 4096 29.22 29.32 1.00x CuMatrix::TraceMatMat<float>[transposed], 8192 30.68 30.63 1.00x cub block reduce for _add_diag_mat_mat_MNT

kangshiyin · 2019-04-11T17:34:48Z

CUB block reduce for _trace_mat_mat and _add_diag_mat_mat_MNT. Performance is almost the same.

All GPU tests have passed.

                            Speed(gflops)   size  no-cub  cub    speedup
            CuMatrix::TraceMatMat<double>,    16   0.01   0.01   1.01x
            CuMatrix::TraceMatMat<double>,    32   0.05   0.05   0.99x
            CuMatrix::TraceMatMat<double>,    64   0.20   0.20   1.00x
            CuMatrix::TraceMatMat<double>,   128   0.73   0.80   1.10x
            CuMatrix::TraceMatMat<double>,   256   2.34   2.33   1.00x
            CuMatrix::TraceMatMat<double>,   512   6.74   5.60   0.83x
            CuMatrix::TraceMatMat<double>,  1024  11.78  11.54   0.98x
            CuMatrix::TraceMatMat<double>,  2048  14.71  14.58   0.99x
            CuMatrix::TraceMatMat<double>,  4096  15.82  15.70   0.99x
            CuMatrix::TraceMatMat<double>,  8192  16.01  15.90   0.99x
CuMatrix::TraceMatMat<double>[transposed],    16   0.01   0.01   1.03x
CuMatrix::TraceMatMat<double>[transposed],    32   0.05   0.05   1.02x
CuMatrix::TraceMatMat<double>[transposed],    64   0.19   0.20   1.05x
CuMatrix::TraceMatMat<double>[transposed],   128   0.64   0.78   1.23x
CuMatrix::TraceMatMat<double>[transposed],   256   2.33   2.34   1.00x
CuMatrix::TraceMatMat<double>[transposed],   512   6.60   5.68   0.86x
CuMatrix::TraceMatMat<double>[transposed],  1024  11.83  10.99   0.93x
CuMatrix::TraceMatMat<double>[transposed],  2048  14.78  14.77   1.00x
CuMatrix::TraceMatMat<double>[transposed],  4096  15.98  15.93   1.00x
CuMatrix::TraceMatMat<double>[transposed],  8192  16.17  16.17   1.00x
             CuMatrix::TraceMatMat<float>,    16   0.01   0.01   1.01x
             CuMatrix::TraceMatMat<float>,    32   0.05   0.05   1.01x
             CuMatrix::TraceMatMat<float>,    64   0.21   0.22   1.02x
             CuMatrix::TraceMatMat<float>,   128   0.83   0.85   1.03x
             CuMatrix::TraceMatMat<float>,   256   3.21   3.27   1.02x
             CuMatrix::TraceMatMat<float>,   512   9.09   9.10   1.00x
             CuMatrix::TraceMatMat<float>,  1024  19.55  19.67   1.01x
             CuMatrix::TraceMatMat<float>,  2048  27.42  27.53   1.00x
             CuMatrix::TraceMatMat<float>,  4096  30.54  30.50   1.00x
             CuMatrix::TraceMatMat<float>,  8192  31.49  31.44   1.00x
 CuMatrix::TraceMatMat<float>[transposed],    16   0.01   0.01   1.03x
 CuMatrix::TraceMatMat<float>[transposed],    32   0.05   0.05   1.05x
 CuMatrix::TraceMatMat<float>[transposed],    64   0.21   0.22   1.05x
 CuMatrix::TraceMatMat<float>[transposed],   128   0.81   0.86   1.05x
 CuMatrix::TraceMatMat<float>[transposed],   256   3.20   3.25   1.02x
 CuMatrix::TraceMatMat<float>[transposed],   512   9.06   9.13   1.01x
 CuMatrix::TraceMatMat<float>[transposed],  1024  17.29  19.05   1.10x
 CuMatrix::TraceMatMat<float>[transposed],  2048  26.17  26.22   1.00x
 CuMatrix::TraceMatMat<float>[transposed],  4096  29.22  29.32   1.00x
 CuMatrix::TraceMatMat<float>[transposed],  8192  30.68  30.63   1.00x

danpovey · 2019-04-11T20:47:45Z

Thanks. There is a largish PR from @luitjens that I want to merge first to check for conflicts, before I merge this.

luitjens · 2019-04-11T21:22:28Z

Hopefully won't conflict but there will be one new routine that will need to be updated to match. I chose not to fix the warpsync issues in the code i was touching as I knew someone else was working on it. So the routine that I based my code on had warpsync issues which persisted into the new routine.

…pute capability 7.x (#3211)" This reverts commit 4cfbd21.

…pute capability 7.x (#3211)" (#3236) This reverts commit 4cfbd21.

danpovey · 2019-04-22T19:47:28Z

@kangshiyin: @luitjens has, I believe, reworked or fixed this in some way in his PR #3221 which I'm about to merge.

luitjens · 2019-04-22T20:00:58Z

No this has not been fixed. reposting my comment here for visibility. ===== I think we should remove this code from TOT and wait for a fully vetted patch before reapplying. The CUB reduce code all works fine. I've test them individually using make test and it passes. However, this assumes make test covers these functions. I then went and disabled all warpsyncs and the code passes. I then went and selectively enabled some warp syncs. Here is at least one area that fails: template<EnumTransformReduce TransReduceType, typename Real> __global__ static void _group_transform_reduce( Real *y, const Real *x, const MatrixDim d, const int src_stride, const int group_size, const TransReduceOp<TransReduceType, Real> op) {

…

__shared__ Real sreduction[CU1DBLOCK]; const int i = blockIdx.x; const int x_start = i * src_stride; const int y_start = i * d.stride; const int threads_per_group = blockDim.x; // Reduce n groups per thread block const int n = blockDim.y; const int len = group_size * n; // linear thread id const int tid = threadIdx.y * threads_per_group + threadIdx.x; int j = threadIdx.y * group_size + threadIdx.x; // col-id of *x int group_id = threadIdx.y; // col-id of *y int group_end = x_start + (group_id + 1) * group_size; while (group_id < d.cols) { // reduce to threads_per_group elements per group int x_idx = x_start + j; Real treduction = op.Transform(x[x_idx]); x_idx += threads_per_group; while (x_idx < group_end) { treduction = op.Reduce(treduction, op.Transform(x[x_idx])); x_idx += threads_per_group; } sreduction[tid] = treduction; if (threads_per_group > warpSize) { __syncthreads(); } //PROBLEM HERE: when threads_per_group <= 32 this is warp synchronous and writes to sreduction above may not be visible at reads below. // tree-reduce to 2x warpSize elements per group # pragma unroll for (int shift = threads_per_group / 2; shift > warpSize; shift >>= 1) { if (threadIdx.x < shift) { sreduction[tid] = op.Reduce(sreduction[tid], sreduction[tid + shift]); } __syncthreads(); } // Warp-reduce to 1 element per group. const int warp_reduce_size = threads_per_group / 2 < warpSize ? threads_per_group / 2 : warpSize; if (threadIdx.x < warp_reduce_size) { # pragma unroll for (int shift = warp_reduce_size; shift > 0; shift >>= 1) { Real buf = op.Reduce(sreduction[tid], sreduction[tid + shift]); __syncwarp(); //PROBLEM HERE assumes 0xFFFFFFFF sreduction[tid] = buf; __syncwarp(); //PROBLEM HERE assumes 0xFFFFFFFF } } There are two issues in the code above. 1) When blockDim.x<=32 we don't call syncwarp before reading from shared memory. 2) When doing the warp reduction we don't pass in the mask which is wrong when blockDim.x<32. Everywhere in this patchset that __syncwarp() is called we are just using the default 0xFFFFFFFF which is only safe if we can guarantee blockDim.x>=32. If it is not then we need to rework those kernels. Overall i think we should move this code to CUB or cooperative groups and get rid of the __syncwarp() stuff as it is very brittle. See here for more details on the issues with warpsync: https://devblogs.nvidia.com/using-cuda-warp-level-primitives/ If we don't want to move directly to CUB for these reductions we should at minimum use cooperative groups which gives us the portable abstractions we need to safely express these algorithms. In particular we could use the tiled partition: https://devblogs.nvidia.com/cooperative-groups/ Note the reason i'm seeing these failures and you are not are likely that this is a Volta+ issue and internally your hardware is not Volta+. Do you have any Turing boards you could test on?

On Mon, Apr 22, 2019 at 1:48 PM Daniel Povey ***@***.***> wrote: @kangshiyin <https://github.com/kangshiyin>: @luitjens <https://github.com/luitjens> has, I believe, reworked or fixed this in some way in his PR #3221 <#3221> which I'm about to merge. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3211 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABSFS4WTKWXTTZLYE2DNO23PRYI7DANCNFSM4HECJXPQ> .

…ability 7.x (kaldi-asr#3211)

…pute capability 7.x (kaldi-asr#3211)" (kaldi-asr#3236) This reverts commit 4cfbd21.

make no warp sync assumption for CC7.x

d9807bc

danpovey reviewed Apr 7, 2019

View reviewed changes

danpovey merged commit 4cfbd21 into kaldi-asr:master Apr 12, 2019

danpovey added a commit that referenced this pull request Apr 16, 2019

Revert "[src] Update CUDA code to avoid synchronization errors on com…

1de2d8b

…pute capability 7.x (#3211)" This reverts commit 4cfbd21.

danpovey mentioned this pull request Apr 16, 2019

Revert "Make no warp sync assumption for CC7.x" #3236

Merged

danpovey added a commit that referenced this pull request Apr 16, 2019

Revert "[src] Update CUDA code to avoid synchronization errors on com…

f8021d7

…pute capability 7.x (#3211)" (#3236) This reverts commit 4cfbd21.

danpovey pushed a commit to danpovey/kaldi that referenced this pull request Jun 19, 2019

[src] Update CUDA code to avoid synchronization errors on compute cap…

d7685cb

…ability 7.x (kaldi-asr#3211)

danpovey added a commit to danpovey/kaldi that referenced this pull request Jun 19, 2019

Revert "[src] Update CUDA code to avoid synchronization errors on com…

a12ee73

…pute capability 7.x (kaldi-asr#3211)" (kaldi-asr#3236) This reverts commit 4cfbd21.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make no warp sync assumption for CC7.x #3211

Make no warp sync assumption for CC7.x #3211

kangshiyin commented Apr 7, 2019

danpovey Apr 7, 2019

kangshiyin Apr 7, 2019

kangshiyin commented Apr 11, 2019

danpovey commented Apr 11, 2019

luitjens commented Apr 11, 2019

danpovey commented Apr 22, 2019

luitjens commented Apr 22, 2019 via email

Make no warp sync assumption for CC7.x #3211

Make no warp sync assumption for CC7.x #3211

Conversation

kangshiyin commented Apr 7, 2019

danpovey Apr 7, 2019

Choose a reason for hiding this comment

kangshiyin Apr 7, 2019

Choose a reason for hiding this comment

kangshiyin commented Apr 11, 2019

danpovey commented Apr 11, 2019

luitjens commented Apr 11, 2019

danpovey commented Apr 22, 2019

luitjens commented Apr 22, 2019 via email