Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CK] UB when invoker of ConvHipImplicitGemmFwdXdlops is called #1971

Closed
atamazov opened this issue Feb 7, 2023 · 6 comments
Closed

[CK] UB when invoker of ConvHipImplicitGemmFwdXdlops is called #1971

atamazov opened this issue Feb 7, 2023 · 6 comments

Comments

@atamazov
Copy link
Contributor

atamazov commented Feb 7, 2023

The issue is originated from #1906 (comment)

UndefinedBehaviorSanitizer:DEADLYSIGNAL
==120887==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000001 (pc 0x7f26b2fe0def bp 0x7f26e8def960 sp 0x7fff058fe750 T120887)
==120887==The signal is caused by a READ memory access.
==120887==Hint: address points to the zero page.

How to reproduce

  1. Configure the debug build
CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS=-Werror \
cmake \
-DMIOPEN_TEST_FLAGS=--disable-verification-cache \
-DCMAKE_BUILD_TYPE=debug \
-DCMAKE_CXX_FLAGS_DEBUG="-g -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored" \
-DBUILD_DEV=On \
-DMIOPEN_GPU_SYNC=Off \
..
  1. Build and run the test
make -j $((`nproc`-4)) test_conv2d && \
MIOPEN_FIND_MODE=normal \
MIOPEN_DEBUG_FIND_ONLY_SOLVER=ConvHipImplicitGemmFwdXdlops \
./bin/test_conv2d \
--float \
--verbose \
--disable-backward-data \
--disable-backward-weights \
--input 128 64 56 56 --weights 64 64 1 1 --pads_strides_dilations 0 0 1 1 1 1 \
--in_layout NHWC --fil_layout NHWC --out_layout NHWC \
--disable-verification-cache

Root reason

The member function ConvHipImplicitGemmFwdXdlops::RunCKSolution() is called from within the Invoker. There is no alive instance of the ConvHipImplicitGemmFwdXdlops class when Invoker is used.

ConvHipImplicitGemmBwdXdlops has the same defect.

How to fix

  • Instantiate the class before use, e.g. ConvHipImplicitGemmFwdXdlops{}.RunCKSolution<int8_t>(...)
    • This is sub-optimal from the performance POV.
  • Convert member function to a static member function.
    • The Solvers are stateless by design so all member functions do not really require access to the class state (data members).
  • Convert member function to a "normal" function.
    • The simplest and thus preferable solution.

[Attribution] @junliume @johnny-keker https://github.com/ROCmSoftwarePlatform/MIOpen/labels/bug https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_blocker

I will provide a fix very soon.

@JehandadKhan @iq136boy @averinevg FYI

@atamazov
Copy link
Contributor Author

atamazov commented Feb 8, 2023

Good news: I've looked at all the other solvers and didn't find similar problems.

@iq136boy
Copy link
Contributor

iq136boy commented Feb 8, 2023

@atamazov Thanks for finding the defect in the ck solvers and provide a quick fix and solution on this. I will make the fix in all my current ck integration tasks. Thanks.

@iq136boy
Copy link
Contributor

iq136boy commented Feb 8, 2023

@atamazov One more concern, @JehandadKhan and I were doing the test locally to test the correctness of integrated ck solver. It passes the test and get the same result as the naïve convolution solver. If the ConvHipImplicitGemmFwdXdlops class does not has an instance, why the test case passes locally? Please see the following test case and result:

MIOPEN_FIND_MODE=1 MIOPEN_DEBUG_FIND_ONLY_SOLVER=ConvHipImplicitGemmFwdXdlops ./bin/test_conv2d --float --disable-backward-data --disable-backward-weights --verbose --input 256 128 28 28 --weights 128 128 3 3 --in_layout NHWC --fil_layout NHWC --out_layout NHWC --pads_strides_dilations 1 1 1 1 1 1 --disable-verification-cache

launch_and_time_kernel: grid_dim {784, 1, 1}, block_dim {256, 1, 1}
Warm up 1 time
Start running 10 times...
123456789
calling here
error: 0
Max diff: 0

Forward convolution: ConvHipImplicitGemmFwdXdlops
Input tensor: 256, 128, 28, 28
Weights tensor: 128, 128, 3, 3
Output tensor: 256, 128, 28, 28
Filter: conv2d, miopenConvolution, miopenPaddingDefault, {1, 1}, {1, 1}, {1, 1},

BTW, I also get the test passed when using the test case you provide above. Did I miss sth important here? Thanks.

@atamazov
Copy link
Contributor Author

atamazov commented Feb 8, 2023

@iq136boy

If the ConvHipImplicitGemmFwdXdlops class does not has an instance, why the test case passes locally?

RunCKSolution does not use internal state of the solver class. Calling it from Invoker leads to Undefined behavior, which does NOT mean that the program is guaranteed to fail, especially if it has been built without debugging features (i.e. it is allowed to work properly under certain conditions). https://en.wikipedia.org/wiki/Undefined_behavior.

However I would expect a failure when built with sanitizer stuff enabled.

BTW, I also get the test passed when using the test case you provide above. Did I miss sth important here? Thanks.

Maybe you've used the release build. Reproduce instructions were missing the rebuild step (fixed now). Please clean up the build directory and retry with updated instructions.

@iq136boy
Copy link
Contributor

iq136boy commented Feb 9, 2023

@atamazov That make sense. Thanks for the explanation. I will rebuild the test and do the same fixes on my current too.

@atamazov
Copy link
Contributor Author

By #1972

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants