Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the dynamic reduction GPU Memory Acccess failure with double on ROCM 4.3 #1131

Merged
merged 5 commits into from
Sep 7, 2021

Conversation

qianfengz
Copy link
Contributor

@qianfengz qianfengz commented Aug 31, 2021

Resolves #1123 (pls. see explanation here).

The issue has been reproduced with ROCM 4.3 on both MI100 and MI25.

This fix has passed the testing with ROCM 4.2/ROCM 4.3/ROCM 4.3.1 on either MI100 or MI25.

@codecov

This comment has been minimized.

asroy
asroy previously requested changes Aug 31, 2021
src/reducetensor.cpp Outdated Show resolved Hide resolved
src/reducetensor.cpp Outdated Show resolved Hide resolved
@qianfengz
Copy link
Contributor Author

There is some problem with the CI

@shaojiewang
Copy link
Contributor

There is some problem with the CI

Maybe you could restart the test.

@junliume junliume requested a review from asroy September 6, 2021 01:42
@qianfengz
Copy link
Contributor Author

Luckily, the CI just passed

@atamazov
Copy link
Contributor

atamazov commented Sep 6, 2021

After ~15 retries... awesome :/

Copy link
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

So it was a problem in MIOpen, not in ROCm 4.3?

@atamazov
Copy link
Contributor

atamazov commented Sep 6, 2021

@qianfengz No answer required, I see the explanation at #1123 (comment)

@junliume junliume dismissed asroy’s stale review September 7, 2021 03:22

Review outdated and PR changed accordingly

@junliume junliume merged commit d70ece3 into develop Sep 7, 2021
@qianfengz qianfengz deleted the dynamic_reduction_fix branch September 7, 2021 09:54
junliume pushed a commit that referenced this pull request Sep 30, 2021
…e on ROCM 4.3 (#1131)

* Fix the calculation of ws_buf2_bytes_offset for dynamic reduction in src/reducetensor.cpp

* Just remove IsDynamicReductionEnabled()

* Tiny fix in ReduceTensorDescriptor::GetWorkspaceSize()

* Update to the calculation of ws_buf2_bytes_offset
junliume added a commit that referenced this pull request Oct 6, 2021
….5-staging-verified

[REL][Tuning][Cherry-Pick] MIOpen Release Cherry Picks and DB Updates for ROCm 4.5 (#1127)((#1131)(#1142)(#1150)(#1166)(#1195)(#1207)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dynamic reduction GPU access fault
5 participants