Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAF-13167] Implement outer  #39

Closed
wants to merge 7 commits into from
Closed

Conversation

Yunyoung-Choi-96
Copy link

@Yunyoung-Choi-96 Yunyoung-Choi-96 commented Jul 11, 2024

Added an outer operation kernel with a solver.
Added driver tests and gtests for the outer operation.
Compared to ROCm PyTorch, there is a performance improvement in some cases.

[benchmark result compared to ROCm pytorch](https://morehio-my.sharepoint.com/:x:/r/personal/kyeonghwan_ryu_moreh_io/_layouts/15/Doc.aspx?sourcedoc=%7B6BF2D5F6-D6F2-47D7-94B5-DE831CA02884%7D&file=ROCm%20pytorch%20vs%20moDNN%20pytorch%20%20(outer).xlsx&action=default&mobileredirect=true&wdsle=0)

Geomean of all cases

type direction geomean
float16 fwd 1.81
float16 bwd 0.09
float32 fwd 1.29
float32 bwd 0.09
bfloat16 fwd 1.36
bfloat16 bwd 0.09

@kyeonghwanryu kyeonghwanryu self-requested a review July 11, 2024 10:49
@kyeonghwanryu
Copy link

Prior to the code review,

  1. As I see the benchmark result, modnn's backward outer is very slow but forward ones are competitive to rocm. But we need to get rid of cases that it's losing to rocm as hard as we can, so please think of it and add a condition to the solver's IsApplicable.
  2. Please provide the general improvement on performance by showing geomean improvement by each dtypes. Yes I know you shared to benchmark result but we need a simple and final results in the PR description. And you should only count the cases that passes IsApplicable. (which is all currently but you need to narrow it down) If possible, specify the configs used to calculate the geomean and each one's perf results. Check Add GroupNorm forward operation ROCm/MIOpen#2623 for example.
  3. This PR's purpose was solely on your adaptation to the job so I'll review the full code, but as I mentioned the backward is too slow to be ported, so I'll ask you to remove it at some point after I finish the review.

driver/driver.hpp Outdated Show resolved Hide resolved
driver/main.cpp Outdated Show resolved Hide resolved
@kyeonghwanryu
Copy link

Check all the texts in the codes again. I'll resume the review after that.

@kyeonghwanryu
Copy link

kyeonghwanryu commented Jul 11, 2024

And it is not about this operation or PR, but is there a reason there's a single commit? Did you squash it? Any reason?

driver/driver.hpp Outdated Show resolved Hide resolved
@kyeonghwanryu
Copy link

And it is not about this operation or PR, but is there a reason there's a single commit? Did you squash it? Any reason?

@Yunyoung-Choi-96 ?

@Yunyoung-Choi-96
Copy link
Author

@kyeonghwanryu There was no strict reason. I just thought that too many commits would hamper readability. I will not use squash later.

@kyeonghwanryu
Copy link

@kyeonghwanryu There was no strict reason. I just thought that too many commits would hamper readability. I will not use squash later.

  1. It's squashed when merged so you don't have to.
  2. People cannot see your workflow if you squash. It hamper readability more if you do this.

So please don't.

@kyeonghwanryu
Copy link

There is static code analysis you have to run but I think I forgot to deliver it so please do this if you didn't:

Static Code Analysis
MIOpen uses cppcheck & tidy and it is recommend to run before committing the code.cppcheck
The tool is installed as part of dependency installation of MIOpen
Option 1: Run 'make tidy' & 'make cppcheck' separately inside the "build" directory
Option 2: Run a 'make analyze' inside the "build" directory
Please refer to this post for more information

test/cpu_outer.hpp Outdated Show resolved Hide resolved
test/gtest/outer.hpp Outdated Show resolved Hide resolved
#include <../test/verify.hpp>

template <typename Tgpu, typename Tcheck>
int32_t mloSumForwardRunHost(miopenTensorDescriptor_t input1Desc,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int32_t mloSumForwardRunHost(miopenTensorDescriptor_t input1Desc,
int32_t mloOuterForwardRunHost(miopenTensorDescriptor_t input1Desc,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced "mloSumForwardRunHost" with "mloOuterForwardRunHost".

@kyeonghwanryu
Copy link

kyeonghwanryu commented Jul 16, 2024

Good work. 고생하셨습니다.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants