-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the dynamic reduction GPU Memory Acccess failure with double on ROCM 4.3 #1131
Conversation
…src/reducetensor.cpp
This comment has been minimized.
This comment has been minimized.
There is some problem with the CI |
Maybe you could restart the test. |
Luckily, the CI just passed |
After ~15 retries... awesome :/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
So it was a problem in MIOpen, not in ROCm 4.3?
@qianfengz No answer required, I see the explanation at #1123 (comment) |
Review outdated and PR changed accordingly
…e on ROCM 4.3 (#1131) * Fix the calculation of ws_buf2_bytes_offset for dynamic reduction in src/reducetensor.cpp * Just remove IsDynamicReductionEnabled() * Tiny fix in ReduceTensorDescriptor::GetWorkspaceSize() * Update to the calculation of ws_buf2_bytes_offset
Resolves #1123 (pls. see explanation here).
The issue has been reproduced with ROCM 4.3 on both MI100 and MI25.
This fix has passed the testing with ROCM 4.2/ROCM 4.3/ROCM 4.3.1 on either MI100 or MI25.