Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make no warp sync assumption for CC7.x #3211

Merged
merged 2 commits into from
Apr 12, 2019

Conversation

kangshiyin
Copy link
Contributor

Trying to fix the issue mentioned in #3080.

Adding __syncwarp() to all warp reduction code as some of them may not be able to be replaced by CUB block reduction. It should not change the behavior for CC6.x and older GPUs.

All GPU tests have passed on my CC5.x GPU with CUDA9. Performance is exactly the same on TaceMatMat(). But I don't have a CC7.x GPU, so I'm not 100% sure if it works on it.

@@ -1010,11 +1010,14 @@ static void _trace_mat_mat(const Real* A, const Real* B, MatrixDim dA,
__syncthreads();
}

// Warp reduce. Implicitly synchronized within a warp.
// Warp reduce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I think this reduction is just summing the array. Surely there must be a cub approach for this-- wouldn't that be a more standard approach?
I don't know much about this stuff, just want to know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cub may be better if it can be used here. I will take a look soon.

                            Speed(gflops)   size  no-cub  cub    speedup
            CuMatrix::TraceMatMat<double>,    16   0.01   0.01   1.01x
            CuMatrix::TraceMatMat<double>,    32   0.05   0.05   0.99x
            CuMatrix::TraceMatMat<double>,    64   0.20   0.20   1.00x
            CuMatrix::TraceMatMat<double>,   128   0.73   0.80   1.10x
            CuMatrix::TraceMatMat<double>,   256   2.34   2.33   1.00x
            CuMatrix::TraceMatMat<double>,   512   6.74   5.60   0.83x
            CuMatrix::TraceMatMat<double>,  1024  11.78  11.54   0.98x
            CuMatrix::TraceMatMat<double>,  2048  14.71  14.58   0.99x
            CuMatrix::TraceMatMat<double>,  4096  15.82  15.70   0.99x
            CuMatrix::TraceMatMat<double>,  8192  16.01  15.90   0.99x
CuMatrix::TraceMatMat<double>[transposed],    16   0.01   0.01   1.03x
CuMatrix::TraceMatMat<double>[transposed],    32   0.05   0.05   1.02x
CuMatrix::TraceMatMat<double>[transposed],    64   0.19   0.20   1.05x
CuMatrix::TraceMatMat<double>[transposed],   128   0.64   0.78   1.23x
CuMatrix::TraceMatMat<double>[transposed],   256   2.33   2.34   1.00x
CuMatrix::TraceMatMat<double>[transposed],   512   6.60   5.68   0.86x
CuMatrix::TraceMatMat<double>[transposed],  1024  11.83  10.99   0.93x
CuMatrix::TraceMatMat<double>[transposed],  2048  14.78  14.77   1.00x
CuMatrix::TraceMatMat<double>[transposed],  4096  15.98  15.93   1.00x
CuMatrix::TraceMatMat<double>[transposed],  8192  16.17  16.17   1.00x
             CuMatrix::TraceMatMat<float>,    16   0.01   0.01   1.01x
             CuMatrix::TraceMatMat<float>,    32   0.05   0.05   1.01x
             CuMatrix::TraceMatMat<float>,    64   0.21   0.22   1.02x
             CuMatrix::TraceMatMat<float>,   128   0.83   0.85   1.03x
             CuMatrix::TraceMatMat<float>,   256   3.21   3.27   1.02x
             CuMatrix::TraceMatMat<float>,   512   9.09   9.10   1.00x
             CuMatrix::TraceMatMat<float>,  1024  19.55  19.67   1.01x
             CuMatrix::TraceMatMat<float>,  2048  27.42  27.53   1.00x
             CuMatrix::TraceMatMat<float>,  4096  30.54  30.50   1.00x
             CuMatrix::TraceMatMat<float>,  8192  31.49  31.44   1.00x
 CuMatrix::TraceMatMat<float>[transposed],    16   0.01   0.01   1.03x
 CuMatrix::TraceMatMat<float>[transposed],    32   0.05   0.05   1.05x
 CuMatrix::TraceMatMat<float>[transposed],    64   0.21   0.22   1.05x
 CuMatrix::TraceMatMat<float>[transposed],   128   0.81   0.86   1.05x
 CuMatrix::TraceMatMat<float>[transposed],   256   3.20   3.25   1.02x
 CuMatrix::TraceMatMat<float>[transposed],   512   9.06   9.13   1.01x
 CuMatrix::TraceMatMat<float>[transposed],  1024  17.29  19.05   1.10x
 CuMatrix::TraceMatMat<float>[transposed],  2048  26.17  26.22   1.00x
 CuMatrix::TraceMatMat<float>[transposed],  4096  29.22  29.32   1.00x
 CuMatrix::TraceMatMat<float>[transposed],  8192  30.68  30.63   1.00x

cub block reduce for _add_diag_mat_mat_MNT
@kangshiyin
Copy link
Contributor Author

CUB block reduce for _trace_mat_mat and _add_diag_mat_mat_MNT. Performance is almost the same.

All GPU tests have passed.

                            Speed(gflops)   size  no-cub  cub    speedup
            CuMatrix::TraceMatMat<double>,    16   0.01   0.01   1.01x
            CuMatrix::TraceMatMat<double>,    32   0.05   0.05   0.99x
            CuMatrix::TraceMatMat<double>,    64   0.20   0.20   1.00x
            CuMatrix::TraceMatMat<double>,   128   0.73   0.80   1.10x
            CuMatrix::TraceMatMat<double>,   256   2.34   2.33   1.00x
            CuMatrix::TraceMatMat<double>,   512   6.74   5.60   0.83x
            CuMatrix::TraceMatMat<double>,  1024  11.78  11.54   0.98x
            CuMatrix::TraceMatMat<double>,  2048  14.71  14.58   0.99x
            CuMatrix::TraceMatMat<double>,  4096  15.82  15.70   0.99x
            CuMatrix::TraceMatMat<double>,  8192  16.01  15.90   0.99x
CuMatrix::TraceMatMat<double>[transposed],    16   0.01   0.01   1.03x
CuMatrix::TraceMatMat<double>[transposed],    32   0.05   0.05   1.02x
CuMatrix::TraceMatMat<double>[transposed],    64   0.19   0.20   1.05x
CuMatrix::TraceMatMat<double>[transposed],   128   0.64   0.78   1.23x
CuMatrix::TraceMatMat<double>[transposed],   256   2.33   2.34   1.00x
CuMatrix::TraceMatMat<double>[transposed],   512   6.60   5.68   0.86x
CuMatrix::TraceMatMat<double>[transposed],  1024  11.83  10.99   0.93x
CuMatrix::TraceMatMat<double>[transposed],  2048  14.78  14.77   1.00x
CuMatrix::TraceMatMat<double>[transposed],  4096  15.98  15.93   1.00x
CuMatrix::TraceMatMat<double>[transposed],  8192  16.17  16.17   1.00x
             CuMatrix::TraceMatMat<float>,    16   0.01   0.01   1.01x
             CuMatrix::TraceMatMat<float>,    32   0.05   0.05   1.01x
             CuMatrix::TraceMatMat<float>,    64   0.21   0.22   1.02x
             CuMatrix::TraceMatMat<float>,   128   0.83   0.85   1.03x
             CuMatrix::TraceMatMat<float>,   256   3.21   3.27   1.02x
             CuMatrix::TraceMatMat<float>,   512   9.09   9.10   1.00x
             CuMatrix::TraceMatMat<float>,  1024  19.55  19.67   1.01x
             CuMatrix::TraceMatMat<float>,  2048  27.42  27.53   1.00x
             CuMatrix::TraceMatMat<float>,  4096  30.54  30.50   1.00x
             CuMatrix::TraceMatMat<float>,  8192  31.49  31.44   1.00x
 CuMatrix::TraceMatMat<float>[transposed],    16   0.01   0.01   1.03x
 CuMatrix::TraceMatMat<float>[transposed],    32   0.05   0.05   1.05x
 CuMatrix::TraceMatMat<float>[transposed],    64   0.21   0.22   1.05x
 CuMatrix::TraceMatMat<float>[transposed],   128   0.81   0.86   1.05x
 CuMatrix::TraceMatMat<float>[transposed],   256   3.20   3.25   1.02x
 CuMatrix::TraceMatMat<float>[transposed],   512   9.06   9.13   1.01x
 CuMatrix::TraceMatMat<float>[transposed],  1024  17.29  19.05   1.10x
 CuMatrix::TraceMatMat<float>[transposed],  2048  26.17  26.22   1.00x
 CuMatrix::TraceMatMat<float>[transposed],  4096  29.22  29.32   1.00x
 CuMatrix::TraceMatMat<float>[transposed],  8192  30.68  30.63   1.00x

@danpovey
Copy link
Contributor

Thanks. There is a largish PR from @luitjens that I want to merge first to check for conflicts, before I merge this.

@luitjens
Copy link
Contributor

Hopefully won't conflict but there will be one new routine that will need to be updated to match. I chose not to fix the warpsync issues in the code i was touching as I knew someone else was working on it. So the routine that I based my code on had warpsync issues which persisted into the new routine.

@danpovey danpovey merged commit 4cfbd21 into kaldi-asr:master Apr 12, 2019
danpovey added a commit that referenced this pull request Apr 16, 2019
danpovey added a commit that referenced this pull request Apr 16, 2019
@danpovey
Copy link
Contributor

@kangshiyin: @luitjens has, I believe, reworked or fixed this in some way in his PR #3221 which I'm about to merge.

@luitjens
Copy link
Contributor

luitjens commented Apr 22, 2019 via email

danpovey pushed a commit to danpovey/kaldi that referenced this pull request Jun 19, 2019
danpovey added a commit to danpovey/kaldi that referenced this pull request Jun 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants