Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3x3 convolutions performance problem #19

Closed
ghost opened this issue Oct 12, 2017 · 10 comments
Closed

3x3 convolutions performance problem #19

ghost opened this issue Oct 12, 2017 · 10 comments

Comments

@ghost
Copy link

ghost commented Oct 12, 2017

Hi, dear friends!

We are investigating MiOpen potential to use in our deep learning applications, but Resnet-52 shows strangely long times for layers with 3x3 convolutions (batch size 64).

For 1x1 convolutions AMD Vega often performs on a par with GTX 1080 Ti, but for 3x3 convs the time is up to 10 times longer.

For example,
{28, 28, 128} -> {28, 28, 128} by filter {3, 3} with stride {1, 1}:
NVIDIA 1080 Ti: 13.6 ns (73470 ips)
AMD VX VEGA: 127.2 ns (7862 ips)

All layers perf in TSV: https://gist.github.com/hex000/de2aebf622d2120fb6a57c42a0c7d90e

Here the the time is given per image. The test goes for the whole 1 sec before hipDeviceSynchronize(), so the sync time is amortized.

What can be wrong here? Maybe I haven't set some secret flag? The algorithm chosen by FindConvolution is always "miopenConvolutionFwdAlgoWinograd", and for such convolutions it offers no other options.

The test code is at https://gist.github.com/hex000/81fe99ae8c10f4384e64960275e91554, it compiles both for AMD and NVIDIA.

@dagamayank
Copy link
Contributor

dagamayank commented Oct 12, 2017

@Hex000 Thanks for contacting us. While I go and run your test code and understand the issue, can you please tell me about your system configuration?

  • Which distro?
  • Output of uname -a
  • Output of /opt/rocm/bin/hipconfig

How was MIOpen installed? Was it built from source?

@dagamayank
Copy link
Contributor

@Hex000 I am mostly certain something is incorrect with your environment. On my Vega GPU I see quite similar times to NVIDIA 1080Ti for that particular layer you mentioned above. Can you please let me know the exact model of your GPU as well.

12	13.0 ns (76852 ips)	conv {28, 28, 128} -> {28, 28, 128} by filter {3, 3} with stride {1, 1}	algo 3	expected 11.9 ns	0.0 Mb

@ghost
Copy link
Author

ghost commented Oct 13, 2017

@dagamayank, thank you!
This is good news, I need to check my configuration.

GPU is AMD RX Vega 64

hipconfig output:
https://gist.github.com/hex000/2fdc37178854038cf7600133acebf92d

The MiOpen installation info:
https://gist.github.com/hex000/12462cd6e0a75796b9beb24b6a798f4b
Official binary distribution from http://repo.radeon.com/rocm, is it the best origin?
Or maybe it's better to build from source?

Driver info:
https://gist.github.com/hex000/1c311c4a6fc5545b5aba4231bbf9300b
It's integrated into kernel, is it the best way?

@prostowww
Copy link

@dagamayank We tried to run with driver provided with kernel 4.11.0-kfd-compute-rocm-rel-1.6-148, from http://repo.radeon.com/rocm and got around 120 ns.
With ubuntu default kernel, 4.4.0-62-generic, and AMDGPU-PRO 17.30 driver we got 19 ns, which is much better, but still more than your result.

@dagamayank
Copy link
Contributor

@Hex000 @prostowww Please consider the performance that I shared as "dev-preview".

MIOpen requires some improvements in the base software stack which are planned to be released within the next two weeks as part of ROCm 1.6.4. You systems are currently configured with ROCm 1.6.3 (the last public release), which may be the cause of the poor performance that you notice.

I will highly recommend not to mix the AMDGPU-PRO and ROCm software stacks on the same system.

As an experiment can you please try setting this env. variable and check the performance again?
export MIOPEN_DEBUG_AMD_ROCM_PRECOMPILED_BINARIES=0

I will ping you once ROCm 1.6.4 is public.

@masahi
Copy link

masahi commented Oct 18, 2017

@Hex000 just curious, do you have both cuda and rocm installed on the same machine? Have you experienced any issue?

@ghost
Copy link
Author

ghost commented Oct 18, 2017

Dear friends, thank you for the answers!

No, CUDA and ROCM are on different machines, so this was not tested.

One more complain from me as a user, warping nine-dimensional space for 30 min is too tough:
[02] Testing conv {56, 56, 64} -> {56, 56, 64} by filter {3, 3} with stride {1, 1}
Searching the best solution in the 9 dim space. Please, be patient it may take few minutes.
Runs left : 13714, min time so far : 3.7838, curr time : 3.96322, 8, 16, 8, 16, 1, 2, 5, 3, 1

@dagamayank
Copy link
Contributor

@Hex000 @prostowww

Just wanted to let you know we released ROCm 1.6.4 last week along with MIOpen v1.1.4, if you are already not aware. Can you please try updating your systems and run the performance experiment again?

For performance measurements I would first like you to set an additional parameter for now.

sudo -s
echo 1 > /sys/module/amdkfd/parameters/noretry 
exit

The above is right now a workaround for an issue with page migration in ROCm. This will be fixed soon.

@ghost
Copy link
Author

ghost commented Nov 14, 2017

dagamayank, thank you!
noretry gives ~10%

dagamayank pushed a commit that referenced this issue Dec 20, 2017
@dagamayank
Copy link
Contributor

@Hex000 FYI - there is a new release of both ROCm and MIOpen, so you may want to benchmark again. For now, I am closing this issue. Please create a new issue if you have questions or notice discrepancies.

ltqin pushed a commit that referenced this issue Oct 28, 2021
646fcc268 Merge pull request #47 from ROCmSoftwarePlatform/develop
6014185ac [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44)
3e9113707 Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all
211dae822 Merge branch 'develop' into miopen_downstream_all
5890e3007 [Composable Kernel] update develop branch code to ck_upstream
d5297abae fix bug in gridwise gemm xdlops v2r3 (#45)
38a90b6ed Merge pull request #43 from ROCmSoftwarePlatform/develop
c3018794b bug fix (#39)
fd49ff808 add nchw atomic , nhwc and nhwc atomic method   for backward weight (#30)
b2dc55f82 [MIOpen Downstream] Fix Reduction Kernel (#34)
b3e8d57d5 Tweak GEMM kernel (#38)
846f462bd Add VectorType support into StaticBuffer (#27)
dfb80c4e3 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction  (#1156)
8557901d0 Merge pull request #1165 from ROCmSoftwarePlatform/develop
f305bebdc Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr
b725e3fc8 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr
88833bd9a Merge pull request #32 from ROCmSoftwarePlatform/develop
df0d68106 :Merge remote-tracking branch 'origin/develop' into CK_upstream
f3acd2510 Add  a version of Merge transform that use integerdivision and mod (#25)
19613902b GEMM driver and kernel (#29)
627d8ef35 Backward weight v4r4r2 with xdlops (#18)
10bb81106 Misc fixes (#24)
9e80cdceb [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction  (#1108)
a7a758d8c GlobalAtomicAdd for fp32/int32 (#23)
9d3f634a3 Xdlops refactor fix (#22)
c6f26bb48 magic division use __umulhi() (#19)
6fe3627a9 Composable kernel init integration v3 (#1097)
a2ad6d353 refactor dynamic xdlops iGemm (#13)
ba6f79a75 Added host_conv_wrw for verification (#15)

git-subtree-dir: src/composable_kernel
git-subtree-split: 646fcc268ede841a16cdaafb68aa64803d8390e1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants