Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NHWC] add gpu reference kernel for nhwc #728

Merged
merged 8 commits into from
Mar 18, 2021
Merged

Conversation

carlushuang
Copy link
Contributor

@carlushuang carlushuang commented Feb 4, 2021

  • support nhwc/ndhwc, fwd/bwd/wrw, fp32/fp16/bf16, group conv, for reference kernel.
  • fix MIOpenDriver to have correct result of cpu NHWC convolution.
  • refactor LayoutToStrides() into test/tensor_layout.hpp, to let driver/test code both can call this function
  • dump layout info when MIOPEN_ENABLE_LOGGING_CMD=1, if currently is not default NCHW layout

Testing

Basically all test is in test/gpu_reference_kernel.cpp, this contains cases with a combination of these conv problems with a little bit of randomness to speedup ctest process, and contains NCHW/NHWC, 2d/3d, fp32/fp16/bf16 cases.

@codecov

This comment has been minimized.

driver/conv_driver.hpp Show resolved Hide resolved
@JehandadKhan
Copy link
Collaborator

@asroy Can I bother you to review the kernel ?

src/problem_description.cpp Show resolved Hide resolved
warmup_wei = tensor<warmup_Tgpu>(miopen::deref(warmupWeightTensor).GetLengths());
warmup_out = tensor<warmup_Tgpu>(miopen::deref(warmupOutputTensor).GetLengths());
warmup_in = tensor<warmup_Tgpu>(miopen::deref(warmupInputTensor).GetLengths(),
miopen::deref(warmupInputTensor).GetStrides());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization for speed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope... this is indeed the way we pass in layout information into tensor type, by giving the stride information when call constructor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

[Recommendation] Revert changes for warmup_* tensors. These are always NCHW, and strides can be initialized implicitly.

@jerryyin
Copy link
Member

jerryyin commented Feb 9, 2021

I did a test run for this branch:

MIOPEN_FIND_MODE=1 MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=6 ./bin/MIOpenDriver convfp16 -n 256 -c 128 -H 28 -W 28 -k 512 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --out_layout NHWC --fil_layout NHWC -m conv -g 1 -F 4 -t 1

The result is:

MIOpen Backward Weights Conv. Algorithm: 0, Solution: 33/gemm
GPU Kernel Time Backward Weights Conv. Elapsed: 10.103986 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 256, 128, 28, 28, 1, 1, 512, 26306674688, 0, 0, 2604, 0, 10.103986
Backward Convolution Weights Verifies OK on CPU reference (0.00277143)

Only when I add the MIOPEN_DEBUG_CONV_GEMM=0 that I can see Solution: 87/ConvDIrectNaiveConvWrw.

My question is: why is gemm still selected when I didn't disable it explicitly. I'd assume that either cases, only naive convolution kernel being picked.

@carlushuang
Copy link
Contributor Author

I did a test run for this branch:

MIOPEN_FIND_MODE=1 MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=6 ./bin/MIOpenDriver convfp16 -n 256 -c 128 -H 28 -W 28 -k 512 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --out_layout NHWC --fil_layout NHWC -m conv -g 1 -F 4 -t 1

The result is:

MIOpen Backward Weights Conv. Algorithm: 0, Solution: 33/gemm
GPU Kernel Time Backward Weights Conv. Elapsed: 10.103986 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 256, 128, 28, 28, 1, 1, 512, 26306674688, 0, 0, 2604, 0, 10.103986
Backward Convolution Weights Verifies OK on CPU reference (0.00277143)

Only when I add the MIOPEN_DEBUG_CONV_GEMM=0 that I can see Solution: 87/ConvDIrectNaiveConvWrw.

My question is: why is gemm still selected when I didn't disable it explicitly. I'd assume that either cases, only naive convolution kernel being picked.

I guess gemm maybe already support NHWC? by some sort of NHWC2NCHW? Anyway, this naive implementation is not performance optimized, so speed might be very slow

@jerryyin
Copy link
Member

jerryyin commented Feb 9, 2021

I guess gemm maybe already support NHWC? by some sort of NHWC2NCHW?

I don't think so. As far as I remember, gemm is hard-coded to support NCHW. What this means is that MIOpen profiles the available algorithms, finding the gemm is faster (naturally), and therefore using gemm instead of naive direct convolution. In this situation, execution will give wrong results.

I think what this implies is that you will need to disable gemm in NHWC mode in order not to return wrong results. I will leave it for MIOpen developers to decide whether this needs to be addressed in this PR or file an issue to address it later (in case it got forgotten). For now looks like a short term alternative is to use the macro to disable gemm.

@carlushuang
Copy link
Contributor Author

Let's first open an issue for gemm NHWC #742

@atamazov
Copy link
Contributor

@jerryyin @carlushuang

I don't think so. As far as I remember, gemm is hard-coded to support NCHW.

But how it passed validation then?

@carlushuang
Copy link
Contributor Author

@atamazov I think it is due to computation error. I tested a fwd case, and computation error is larger than default nrms, while @jerryyin 's case may happen to be within default nrms.

 ./bin/MIOpenDriver  conv -n 256 -c 128 -H 28 -W 28 -k 512 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --out_layout NHWC --fil_layout NHWC -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 256 -c 128 -H 28 -W 28 -k 512 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 --in_layout NHWC --out_layout NHWC --fil_layout NHWC -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 0, Solution: 33/gemm
GPU Kernel Time Forward Conv. Elapsed: 1.145352 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 256, 128, 28, 28, 1, 1, 512,  26306674688, 103022592, 411041792, 22968, 449, 1.145352
Forward Convolution Failed: 0.28159 > 1.5e-06

@asroy
Copy link
Contributor

asroy commented Feb 12, 2021

@carlushuang Could you elaborate on what tests you have done for these reference kernels?

@carlushuang
Copy link
Contributor Author

carlushuang commented Feb 12, 2021

@asroy basically all test is in test/gpu_reference_kernel.cpp, this contains cases with a combination of these conv problems with a little bit of randomness to speedup ctest process, and contains NCHW/NHWC, 2d/3d, fp32/fp16/bf16 cases.

Copy link
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Otherwise LGTM.

kernel.kernel_name = ConvDirectNaiveConvKernelName(ctx);
kernel.g_wk.clear();

kernel.g_wk.push_back(grid_size * block_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Notice] If there is possibility that grid_size * block_size exceeds INT_MAX, please consider using size_t. g_wk and l_wk in KernelInfo are vectors of size_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me change this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

src/solver/conv_direct_naive_conv_bwd.cpp Outdated Show resolved Hide resolved
src/solver/conv_direct_naive_conv_fwd.cpp Outdated Show resolved Hide resolved
src/solver/conv_direct_naive_conv_wrw.cpp Outdated Show resolved Hide resolved
@atamazov
Copy link
Contributor

PR description updated with testing info from #728 (comment)

int block_size = 256;
int grid_size = k;
size_t block_size = 256;
size_t grid_size = static_cast<size_t>(k);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Note] FYI static cast is not needed here. No need to fix.

Copy link
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@atamazov atamazov marked this pull request as draft March 3, 2021 23:20
@atamazov
Copy link
Contributor

atamazov commented Mar 3, 2021

Converted to draft to avoid spoiling docker image cache on CI. Please change this back to "normal" PR only after ending of CI moratorium (see mail). Then first merge from develop. Push the [Ready for review] button only after that. Thanks.

@carlushuang carlushuang marked this pull request as ready for review March 15, 2021 23:55
@atamazov
Copy link
Contributor

@carlushuang Please first merge from develop, then re-run CI, otherwise your job will never pass. CI job stopped.

@atamazov
Copy link
Contributor

atamazov commented Mar 16, 2021

@jerryyin Is this PR ready to be merged? There is a blocking review from you.

@carlushuang
Copy link
Contributor Author

@atamazov OK let me merge develop

@atamazov
Copy link
Contributor

Gagarin: Off we go! 🚀

@atamazov atamazov merged commit 1c72099 into develop Mar 18, 2021
@carlushuang carlushuang deleted the reference_kernel_nhwc branch April 9, 2021 00:21
@atamazov
Copy link
Contributor

@carlushuang Please look at #1532 (comment), thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 This PR creates a copy of tensor_layout.hpp from ./src/include/miopen, which is wrong as this creates implicit dependence and error-prone. This immediate bug is that this header uses THE SAME guard as the original: GUARD_TENSOR_LAYOUT_HPP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants