Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancements] Several bugfixes and refactoring of dynamic generic reduction #1156

Merged
merged 65 commits into from
Sep 29, 2021

Conversation

qianfengz
Copy link
Contributor

@qianfengz qianfengz commented Sep 15, 2021

This P.R fixes several visible bugs in dynamic generic reduction, and commits are also added to refactor the codes following the review from 54615.

Here are my explanations about three commits

  • 2f4fe70 fixes the issue that one passed type parameter of the template is not consistent with the dst buffer data type (compType) while using BlockWise and ThreadWise transfer
  • f098bfb fixes the issue that GetZeroVal() is not correctly implemented with some binary operator on half type on both the kernel and host
  • ba91b99 fixes the issue in the kernel that value of dstDataType is converted to compType before writing out, this could cause losing of precision when srcDataType and compType are both half, while the dstDataType is float, even though this situation might not be common, miopenReduceTensor() does support it

The commits have passed the following testing locally on MI100

bin/test_reduce_test --all
bin/test_reduce_test --all --half
bin/test_reduce_test --all --double

MIOpenDriver script 1 for non-indexable operations

#!/bin/bash

PRECISION=reduce  ## reducefp64 reducefp16

if test -n $PRECISION && test "$PRECISION" = "reducefp16"; then 
   CTYPE="-C 1"
else
   CTYPE=""
fi

if [ $# -ge 1 ] ; then
    NREPEAT=$1
else
    NREPEAT=1
fi

for op in 0 5 6 7; do
    set -x
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0 -O $op $CTYPE  -t 1 -i $NREPEAT 
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 1 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 2 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 3 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,1 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 1,2 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 2,3 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,2 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 1,3 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,3 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,1,2 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,1,3 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 1,2,3 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,2,3 -O $op $CTYPE  -t 1 -i $NREPEAT
    bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,1,2,3 -O $op $CTYPE  -t 1 -i $NREPEAT
    set +x
done 

MIOpenDriver script 2 for indexable operations

#!/bin/bash

PRECISION=reducefp16    ## reducefp64 reduce

if test -n $PRECISION && test "$PRECISION" = "reducefp16"; then
   CTYPE="-C 0"
else
   CTYPE="-C 1"
fi

if [ $# -ge 1 ] ; then
    NREPEAT=$1
else
    NREPEAT=1
fi

for op in 2 3 4; do
    for use_idx in 0 1; do
        set -x
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 1 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 2 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 3 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,1 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 1,2 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 2,3 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,2 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 1,3 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,3 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,1,2 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,1,3 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 1,2,3 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,2,3 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        bin/MIOpenDriver $PRECISION -D 64,3,280,81 -R 0,1,2,3 -O $op -I $use_idx $CTYPE -t 1 -i $NREPEAT
        set +x
    done
done

Chao Liu added 30 commits July 30, 2021 17:31
git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda6119ebbb237dfa6270797b34f960d7b190
5781adf5c Update develop (#5) (#6)
97e6d514f Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e5 refactor
49c33aaea refactor
54b3e73d1 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5cf4ac753e2e36da7385791775b744bf7
@qianfengz
Copy link
Contributor Author

Commit 21cbb is a big commit, it does the following

  1. Re-order the input tensor into sequences of [invariant dimensions, toReduce dimensions] and pass re-ordered lengths/strides to the kernel. This can reduce the number of kernels to be considered
  2. Split each second_call kernel wrapper file into two files of reduce_all_dims and reduce_partial_dims respectively
  3. Remove unused tensor transformations in all reduce_all kernel wrappers.

@qianfengz
Copy link
Contributor Author

@qianfengz And last but not least -- shall we also add tests to cover the new functionality (i.e. improved precision)?

Some situations could not be easily covered by CTEST since for test_reduce_test, our implementation assumes the srcDataType and dstDataType are the same, and also test_reduce_test --all can only cover those cases that srcDataType is the same as compType. To test those cases where srcDataType is half, compDataType is float , we have to use test_reduce_test or MIOpenDriver reducefp16 manually with specific options.

Some Correction to the above comments:

  1. For both MIOpenDriver reduce<xxx> and test_reduce_test, the srcDataType and dstDataType are assumed to be same
  2. For test_reduce_test --all --half, MIN/MAX/AMAX still uses half as compType, while ADD/AVG/MUL/NORM1/NORM2 use float as compType, the reason is that ADD/AVG/MUL/NORM1/NORM2 could lead to overflow during the accumulating

@qianfengz
Copy link
Contributor Author

qianfengz commented Sep 20, 2021

Some situations could not be easily covered by CTEST ... To test those cases where srcDataType is half, compDataType is float, we have to use test_reduce_test or MIOpenDriver reducefp16 manually with specific options.

Let's add a custom test to test/CMakeLists.txt. If it is difficult for you to create it, then just give me a list of configs for reduce_test (not for the driver), and I will do it for you, right in this thread. Thanks!

@qianfengz And last but not least -- shall we also add tests to cover the new functionality (i.e. improved precision)?

No need. Since

  1. Issued fixed by Commit ec192 only occurred with codes implementing indiced operation (op = MIN/MAX/AMAX and --I 1) and when srcDataType/dstDataType is half, compType is float. For example bin/MIOpenDriver reducefp16 -D 64,3,280,81 -R 0 -O 2 -I 1 -C 1 -t 1 can arouse the issue, but we have no reason to use -C 1 when the op is MIN/MAX/AMAX for half
  2. Issue fixed by commit b361dc can only be reproduced if either the host or the kernel has the issue, but not both of them.
  3. Issued fixed by commit 3e79f can only occur if dstDataType has higher precision than the compType. But with MIOpenDriver and test_reduce_all, we all assume dstDataType and srcDataType are same, and compType should have higher or equal precision than srcDataType. And so far, we regard dstDataType different than srcDataType as unusual situation.

@atamazov
Copy link
Contributor

@qianfengz Okay.

There will be many commits in next several days

Is PR ready for review/testing/merge?

@atamazov atamazov dismissed their stale review September 20, 2021 22:35

new tests not required

@qianfengz
Copy link
Contributor Author

@qianfengz Okay.

There will be many commits in next several days

Is PR ready for review/testing/merge?

Yes, please

atamazov
atamazov previously approved these changes Sep 22, 2021
Copy link
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code-quality related nitpicks.

const void* p_src2dDesc = ws_global;
const void* p_dst1dDesc = static_cast<char*>(ws_global) + 2048;
const void* p_src2dDesc = cast_pointer_to_generic_address_space(ws_global);
const void* p_dst1dDesc = static_cast<const char*>(p_src2dDesc) + 2048;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Why 2048?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an implicit convention between two kernels to use one device page to store the two descriptors.

Copy link
Contributor

@atamazov atamazov Sep 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define a symbol (a macro would be fine) and use that symbol instead of literal value. Please also put a comment which explains the convention nearby the definition of a symbol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping. I'm worrying about the clarity of the code. Or this is documented elsewhere already? --

This is just an implicit convention between two kernels to use one device page to store the two descriptors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define a symbol (a macro would be fine) and use that symbol instead of literal value. Please also put a comment which explains the convention nearby the definition of a symbol.

I will add some comments in next P.R

make_naive_tensor_descriptor(ref_tupleDstLengths, ref_tupleDstLengths);

static constexpr index_t ref_invariantLen = ref_dstDesc.GetLength(Number<0>{});
static constexpr index_t ref_toReduceLen = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Why 8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any positive integer constant should be ok here. The compiler-time codes here is to get the descriptor object's type, so that we can copy the descriptor object generated by the "preparing kernel". According to @asroy , the value of the constant here does not affect the descriptor object's type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires some explanation. Can you please use the same approach that I requested here, thanks.

src/reducetensor.cpp Outdated Show resolved Hide resolved
src/reducetensor.cpp Outdated Show resolved Hide resolved
@atamazov
Copy link
Contributor

I recommend improving clarity of the code. For example, adding explanations for all the non-trivial literals in a program.

@qianfengz It's up to you. I will approve this now.

@qianfengz
Copy link
Contributor Author

I recommend improving clarity of the code. For example, adding explanations for all the non-trivial literals in a program.

@qianfengz It's up to you. I will approve this now.

Thanks. I will add some comments in next P.R

@junliume
Copy link
Collaborator

I recommend improving clarity of the code. For example, adding explanations for all the non-trivial literals in a program.
@qianfengz It's up to you. I will approve this now.

Thanks. I will add some comments in next P.R

Let's wrap up this PR for now :) Priority has shifted and we can revisit reduction only after the urgent task.

@junliume junliume changed the title Several bugfixes and refactoring of dynamic generic reduction [Enhancements] Several bugfixes and refactoring of dynamic generic reduction Sep 29, 2021
@junliume junliume merged commit f21cdc1 into develop Sep 29, 2021
@qianfengz qianfengz deleted the reduction_fix_generic branch October 13, 2021 10:00
ltqin pushed a commit that referenced this pull request Oct 28, 2021
646fcc268 Merge pull request #47 from ROCmSoftwarePlatform/develop
6014185ac [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44)
3e9113707 Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all
211dae822 Merge branch 'develop' into miopen_downstream_all
5890e3007 [Composable Kernel] update develop branch code to ck_upstream
d5297abae fix bug in gridwise gemm xdlops v2r3 (#45)
38a90b6ed Merge pull request #43 from ROCmSoftwarePlatform/develop
c3018794b bug fix (#39)
fd49ff808 add nchw atomic , nhwc and nhwc atomic method   for backward weight (#30)
b2dc55f82 [MIOpen Downstream] Fix Reduction Kernel (#34)
b3e8d57d5 Tweak GEMM kernel (#38)
846f462bd Add VectorType support into StaticBuffer (#27)
dfb80c4e3 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction  (#1156)
8557901d0 Merge pull request #1165 from ROCmSoftwarePlatform/develop
f305bebdc Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr
b725e3fc8 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr
88833bd9a Merge pull request #32 from ROCmSoftwarePlatform/develop
df0d68106 :Merge remote-tracking branch 'origin/develop' into CK_upstream
f3acd2510 Add  a version of Merge transform that use integerdivision and mod (#25)
19613902b GEMM driver and kernel (#29)
627d8ef35 Backward weight v4r4r2 with xdlops (#18)
10bb81106 Misc fixes (#24)
9e80cdceb [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction  (#1108)
a7a758d8c GlobalAtomicAdd for fp32/int32 (#23)
9d3f634a3 Xdlops refactor fix (#22)
c6f26bb48 magic division use __umulhi() (#19)
6fe3627a9 Composable kernel init integration v3 (#1097)
a2ad6d353 refactor dynamic xdlops iGemm (#13)
ba6f79a75 Added host_conv_wrw for verification (#15)

git-subtree-dir: src/composable_kernel
git-subtree-split: 646fcc268ede841a16cdaafb68aa64803d8390e1
junliume pushed a commit that referenced this pull request Jan 18, 2022
…duction (#1156)

* Squashed 'src/composable_kernel/' content from commit f6edda611

git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda6119ebbb237dfa6270797b34f960d7b190

* add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

* Squashed 'src/composable_kernel/' changes from f6edda611..5781adf5c

5781adf5c Update develop (#5) (#6)
97e6d514f Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e5 refactor
49c33aaea refactor
54b3e73d1 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5cf4ac753e2e36da7385791775b744bf7

* fix

* refactor

* remove online compilation from CK

* refactor

* fix

* add ctest

* tidy

* add tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* tidy

* add c-style pointer cast

* vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast

* fix clang warning suppression

* tidy

* suppress cppcheck

* fix enum issue

* revert chagnes to hip build

* fix kernel filename

* update CK build script

* rename

* rename

* make innner product compatiable on gfx900

* Update src/include/miopen/solver/ck_utility_common.hpp

Co-authored-by: JD <Jehandad.Khan@amd.com>

* compiler parameter use stream

* use int instead of index_t in kernel wrapper

* DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element

* refactor

* refactor

* change cmakelist

* change ck common utility

* fix

* Squashed 'src/composable_kernel/' changes from 5781adf5c..31b403526

31b403526 Merge pull request #16 from ROCmSoftwarePlatform/develop
b62bf8c3f Merge pull request #14 from ROCmSoftwarePlatform/miopen_downstream_init_integration
ccc4a1d36 Merge pull request #8 from ROCmSoftwarePlatform/miopen_downstream_init_integration
67ad47e7c refactor
16effa767 refactor
a91b68dfc DynamicBuffer, StaticBuffer, amd_buffer_load support customized value for invalid element
2cbabbba5 use int instead of index_t in kernel wrapper
0834bc763 compiler parameter use stream
f2ac7832c make innner product compatiable on gfx900
4e57b30a6 rename
c03045ce2 rename
b2589957f update CK build script
2c48039d0 fix kernel filename
d626dccc9 fix enum issue
643ebd4f3 tidy
ddd49ec9e fix clang warning suppression
4f566c622 vector/scalar pointer cast use c-style pointer cast instead of reinterpret_cast
172036d72 add c-style pointer cast
76f313193 tidy
d18428901 tidy
f885c131d tidy
80120f0a0 tidy
c3efeb5e2 tidy
56fc0842b tidy
54fba515b tidy
e62bae7a4 tidy
24c872894 add tidy
61487e0a0 fix
ae98b52ad remove online compilation from CK
cb9542131 refactor
73ca97015 Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composable_kernel_init_integration_v3
3b8664611 Merge pull request #7 from ROCmSoftwarePlatform/master
d09ea4f4e Update develop (#5)
3d32ae940 add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source files

git-subtree-dir: src/composable_kernel
git-subtree-split: 31b403526ec54abf13c4bb58dfb6635b4d2aa619

* Tiny fix in using data type template parameters in blockwise and direct_threadwise kernel

* Fix with regard to implementing GetZeroVal() in both kernel and host

* Avoid convert to compType from dstDataType before writting the output value

* Add half_t support to NumericLimits and make constexpr GetZeroVal() of binary operator

* Add CONSTANT decorator for descriptor read buffer

* Use get_thread_local_1d_id() for thread local Id

* Rename GetZeroVal() to GetReductionZeroVal() in the kernels

* Remove constexpr from initialized zeroVal and tiny fix in reduction_operator.hpp

* Occasional tiny simplification and update in the kernel files

* Update in src/reducetensor.cpp for consistent IDs passing to the kernel

* Update to re-order tensor dimensions on the host, split second_call kernel wrapper files and simplify reduce_all kernel wrappers

* Update to remove OpenCL tidy checking failures

* Small updates in src/reducetensor.cpp

* Update for better readability

* Remove unused codes and not-needed template parameters in the kernel wrappers

Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: JD <Jehandad.Khan@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants