-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make no warp sync assumption for CC7.x #3211
Make no warp sync assumption for CC7.x #3211
Conversation
src/cudamatrix/cu-kernels.cu
Outdated
@@ -1010,11 +1010,14 @@ static void _trace_mat_mat(const Real* A, const Real* B, MatrixDim dA, | |||
__syncthreads(); | |||
} | |||
|
|||
// Warp reduce. Implicitly synchronized within a warp. | |||
// Warp reduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I think this reduction is just summing the array. Surely there must be a cub approach for this-- wouldn't that be a more standard approach?
I don't know much about this stuff, just want to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cub may be better if it can be used here. I will take a look soon.
Speed(gflops) size no-cub cub speedup CuMatrix::TraceMatMat<double>, 16 0.01 0.01 1.01x CuMatrix::TraceMatMat<double>, 32 0.05 0.05 0.99x CuMatrix::TraceMatMat<double>, 64 0.20 0.20 1.00x CuMatrix::TraceMatMat<double>, 128 0.73 0.80 1.10x CuMatrix::TraceMatMat<double>, 256 2.34 2.33 1.00x CuMatrix::TraceMatMat<double>, 512 6.74 5.60 0.83x CuMatrix::TraceMatMat<double>, 1024 11.78 11.54 0.98x CuMatrix::TraceMatMat<double>, 2048 14.71 14.58 0.99x CuMatrix::TraceMatMat<double>, 4096 15.82 15.70 0.99x CuMatrix::TraceMatMat<double>, 8192 16.01 15.90 0.99x CuMatrix::TraceMatMat<double>[transposed], 16 0.01 0.01 1.03x CuMatrix::TraceMatMat<double>[transposed], 32 0.05 0.05 1.02x CuMatrix::TraceMatMat<double>[transposed], 64 0.19 0.20 1.05x CuMatrix::TraceMatMat<double>[transposed], 128 0.64 0.78 1.23x CuMatrix::TraceMatMat<double>[transposed], 256 2.33 2.34 1.00x CuMatrix::TraceMatMat<double>[transposed], 512 6.60 5.68 0.86x CuMatrix::TraceMatMat<double>[transposed], 1024 11.83 10.99 0.93x CuMatrix::TraceMatMat<double>[transposed], 2048 14.78 14.77 1.00x CuMatrix::TraceMatMat<double>[transposed], 4096 15.98 15.93 1.00x CuMatrix::TraceMatMat<double>[transposed], 8192 16.17 16.17 1.00x CuMatrix::TraceMatMat<float>, 16 0.01 0.01 1.01x CuMatrix::TraceMatMat<float>, 32 0.05 0.05 1.01x CuMatrix::TraceMatMat<float>, 64 0.21 0.22 1.02x CuMatrix::TraceMatMat<float>, 128 0.83 0.85 1.03x CuMatrix::TraceMatMat<float>, 256 3.21 3.27 1.02x CuMatrix::TraceMatMat<float>, 512 9.09 9.10 1.00x CuMatrix::TraceMatMat<float>, 1024 19.55 19.67 1.01x CuMatrix::TraceMatMat<float>, 2048 27.42 27.53 1.00x CuMatrix::TraceMatMat<float>, 4096 30.54 30.50 1.00x CuMatrix::TraceMatMat<float>, 8192 31.49 31.44 1.00x CuMatrix::TraceMatMat<float>[transposed], 16 0.01 0.01 1.03x CuMatrix::TraceMatMat<float>[transposed], 32 0.05 0.05 1.05x CuMatrix::TraceMatMat<float>[transposed], 64 0.21 0.22 1.05x CuMatrix::TraceMatMat<float>[transposed], 128 0.81 0.86 1.05x CuMatrix::TraceMatMat<float>[transposed], 256 3.20 3.25 1.02x CuMatrix::TraceMatMat<float>[transposed], 512 9.06 9.13 1.01x CuMatrix::TraceMatMat<float>[transposed], 1024 17.29 19.05 1.10x CuMatrix::TraceMatMat<float>[transposed], 2048 26.17 26.22 1.00x CuMatrix::TraceMatMat<float>[transposed], 4096 29.22 29.32 1.00x CuMatrix::TraceMatMat<float>[transposed], 8192 30.68 30.63 1.00x cub block reduce for _add_diag_mat_mat_MNT
CUB block reduce for All GPU tests have passed.
|
Thanks. There is a largish PR from @luitjens that I want to merge first to check for conflicts, before I merge this. |
Hopefully won't conflict but there will be one new routine that will need to be updated to match. I chose not to fix the warpsync issues in the code i was touching as I knew someone else was working on it. So the routine that I based my code on had warpsync issues which persisted into the new routine. |
@kangshiyin: @luitjens has, I believe, reworked or fixed this in some way in his PR #3221 which I'm about to merge. |
No this has not been fixed. reposting my comment here for visibility.
=====
I think we should remove this code from TOT and wait for a fully vetted
patch before reapplying. The CUB reduce code all works fine. I've test
them individually using make test and it passes. However, this assumes
make test covers these functions.
I then went and disabled all warpsyncs and the code passes. I then went
and selectively enabled some warp syncs. Here is at least one area that
fails:
template<EnumTransformReduce TransReduceType, typename Real>
__global__
static void _group_transform_reduce(
Real *y, const Real *x, const MatrixDim d, const int src_stride,
const int group_size, const TransReduceOp<TransReduceType, Real> op) {
…__shared__ Real sreduction[CU1DBLOCK];
const int i = blockIdx.x;
const int x_start = i * src_stride;
const int y_start = i * d.stride;
const int threads_per_group = blockDim.x;
// Reduce n groups per thread block
const int n = blockDim.y;
const int len = group_size * n;
// linear thread id
const int tid = threadIdx.y * threads_per_group + threadIdx.x;
int j = threadIdx.y * group_size + threadIdx.x; // col-id of *x
int group_id = threadIdx.y; // col-id of *y
int group_end = x_start + (group_id + 1) * group_size;
while (group_id < d.cols) {
// reduce to threads_per_group elements per group
int x_idx = x_start + j;
Real treduction = op.Transform(x[x_idx]);
x_idx += threads_per_group;
while (x_idx < group_end) {
treduction = op.Reduce(treduction, op.Transform(x[x_idx]));
x_idx += threads_per_group;
}
sreduction[tid] = treduction;
if (threads_per_group > warpSize) {
__syncthreads();
}
//PROBLEM HERE: when threads_per_group <= 32 this is warp synchronous
and writes to sreduction above may not be visible at reads below.
// tree-reduce to 2x warpSize elements per group
# pragma unroll
for (int shift = threads_per_group / 2; shift > warpSize; shift >>= 1) {
if (threadIdx.x < shift) {
sreduction[tid] = op.Reduce(sreduction[tid], sreduction[tid +
shift]);
}
__syncthreads();
}
// Warp-reduce to 1 element per group.
const int warp_reduce_size =
threads_per_group / 2 < warpSize ? threads_per_group / 2 : warpSize;
if (threadIdx.x < warp_reduce_size) {
# pragma unroll
for (int shift = warp_reduce_size; shift > 0; shift >>= 1) {
Real buf = op.Reduce(sreduction[tid], sreduction[tid + shift]);
__syncwarp(); //PROBLEM HERE assumes 0xFFFFFFFF
sreduction[tid] = buf;
__syncwarp(); //PROBLEM HERE assumes 0xFFFFFFFF
}
}
There are two issues in the code above.
1) When blockDim.x<=32 we don't call syncwarp before reading from shared
memory.
2) When doing the warp reduction we don't pass in the mask which is
wrong when blockDim.x<32.
Everywhere in this patchset that __syncwarp() is called we are just using
the default 0xFFFFFFFF which is only safe if we can guarantee
blockDim.x>=32.
If it is not then we need to rework those kernels.
Overall i think we should move this code to CUB or cooperative groups and
get rid of the __syncwarp() stuff as it is very brittle.
See here for more details on the issues with warpsync:
https://devblogs.nvidia.com/using-cuda-warp-level-primitives/
If we don't want to move directly to CUB for these reductions we should at
minimum use cooperative groups which gives us the portable abstractions we
need to safely express these algorithms. In particular we could use the
tiled partition: https://devblogs.nvidia.com/cooperative-groups/
Note the reason i'm seeing these failures and you are not are likely that
this is a Volta+ issue and internally your hardware is not Volta+. Do you
have any Turing boards you could test on?
On Mon, Apr 22, 2019 at 1:48 PM Daniel Povey ***@***.***> wrote:
@kangshiyin <https://github.com/kangshiyin>: @luitjens
<https://github.com/luitjens> has, I believe, reworked or fixed this in
some way in his PR #3221 <#3221>
which I'm about to merge.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3211 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABSFS4WTKWXTTZLYE2DNO23PRYI7DANCNFSM4HECJXPQ>
.
|
…pute capability 7.x (kaldi-asr#3211)" (kaldi-asr#3236) This reverts commit 4cfbd21.
Trying to fix the issue mentioned in #3080.
Adding
__syncwarp()
to all warp reduction code as some of them may not be able to be replaced by CUB block reduction. It should not change the behavior for CC6.x and older GPUs.All GPU tests have passed on my CC5.x GPU with CUDA9. Performance is exactly the same on
TaceMatMat()
. But I don't have a CC7.x GPU, so I'm not 100% sure if it works on it.