Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boost cmake version for Tensile 4.2 #961

Merged
merged 6 commits into from
Jun 21, 2021
Merged

boost cmake version for Tensile 4.2 #961

merged 6 commits into from
Jun 21, 2021

Conversation

ce1adon
Copy link
Contributor

@ce1adon ce1adon commented Jun 1, 2021

To issue #960

2021/6/11 blocker #980

@ce1adon ce1adon added this to the ROCm 4.4 milestone Jun 1, 2021
@codecov

This comment has been minimized.

Copy link
Collaborator

@junliume junliume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryyin could you comment on the selection of cmake version for libMLIRMIOpen?

Dockerfile Outdated
@@ -80,7 +80,7 @@ RUN cget -p $PREFIX init --cxx /opt/rocm/llvm/bin/clang++ --std=c++14 -DAMDGPU_T
# Install dependencies
RUN cget -p $PREFIX install pfultz2/rocm-recipes
# Install a newer version of cmake for libMLIRMIOpen
RUN cget -p $PREFIX install kitware/cmake@v3.13.4
RUN if [ "$USE_TARGETID" = "ON" ] ; then cget -p $PREFIX install kitware/cmake@v3.15.1; else cget -p $PREFIX install kitware/cmake@v3.13.4; fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ce1adon It's more a question to discuss how do we determine the versions for Common Development Environment (CDE) and Common Runtime Environment (CRE)?
https://pkgs.org/download/cmake and https://cliutils.gitlab.io/modern-cmake/chapters/intro/installing.html as references for cmake versions:
Ubuntu 18.04 LTS has 3.10.2
Ubuntu 20.04 LTS has 3.16.3
RHEL/CentOS well being RHEL/CentOS 2.8.12.2 (7)

The two versions mentioned in this change, however:
3.13.4 seems to be with Ubuntu 19.04, which EOL on Jan 23rd 2021 or Debian 10
Why 3.15.1? Is it selected based on a specific feature, or in other words, why not 3.16.3 or even 3.20.3 directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.15.1 is the minimum version for the latest tensile to build.
3.20.x is not working, too new for latest tensile build.
3.16.x may work but there may be concern about its compatibility with ubuntu 18.04.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. No objections to this PR, just curious how we determine these dependencies and their long term effects on support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully @pfultz2 can give some comment on this question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junliume Sorry I was busy with the TF/MLIR triaging and didn't got a chance to look at this. I'm a bit concerned at this comment:

 3.20.x is not working, too new for latest tensile build.

The cmake required to build the latest LLVM usually bump to the latest version. I can forsee this causing conflict between libMLIRMIOpen and Tensile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a possibility that we have to bump it to use 3.15.1 at all scenarios for next MLIR PRs, we'll see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I do not see anything that enforces us to stay with 3.13.4. Let's switch to 3.15 globally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#978 (that bumps CMake to 3.15.1 unconditionally) is merged in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solved

Copy link
Collaborator

@junliume junliume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the rest of the stack are all using cmake 3.15.1 then it is fine with us. Consistency and stability are my only concerns.

@ce1adon
Copy link
Contributor Author

ce1adon commented Jun 3, 2021

If the rest of the stack are all using cmake 3.15.1 then it is fine with us. Consistency and stability are my only concerns.

@pfultz2 Hey Paul, I created issue #960 and PR #961 for switching cmake version. Do you have any idea about the impact on MIOpen?

@pfultz2
Copy link
Contributor

pfultz2 commented Jun 3, 2021

Ideally, all of our components should use the cmake version in 18.04(with the exception of LLVM), and most of our components follow that. It is concerning that tensile requires such a new version. The entire rocm stack needs to decide the minimum version of cmake we will use and then there should be a process in place which includes all component owners to consider a newer version.

For now, we have no choice but to upgrade the cmake version in order to build tensile, but there should be a larger discussion about what minimum version of cmake we plan to support in rocm.

Copy link
Collaborator

@junliume junliume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from MLIR, blocking this PR for now

@jerryyin jerryyin self-requested a review June 4, 2021 16:46
@jerryyin
Copy link
Member

jerryyin commented Jun 4, 2021

@junliume I think this PR is good to go with the result here. @krzysz00 confirmed that 3.15.1 successfully builds/installs libMLIRMIOpen.a so we are good.

On the other hand, long term wise, we will plan on exporting artifact from MLIR CI so MIOpen doesn't have to rebuild every time.

jerryyin
jerryyin previously approved these changes Jun 4, 2021
junliume
junliume previously approved these changes Jun 4, 2021
Copy link
Collaborator

@junliume junliume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unblock it for comments from MLIR.
3.15.1 is still a little bit awkward, while again as long as overall consistency is kept.

P.S. this page is outdated: https://github.com/ROCmSoftwarePlatform/Tensile/wiki/Dependencies

@krzysz00
Copy link
Contributor

krzysz00 commented Jun 4, 2021

Y'all could go for any 3.15.x

We're good through 3.20

@atamazov
Copy link
Contributor

atamazov commented Jun 7, 2021

CI errors.

@junliume
Copy link
Collaborator

junliume commented Jun 7, 2021

CI errors.

Strange that boost cmake version can cause Memory access fault

Memory access fault by GPU node-2 (Agent handle: 0x278a6d0) on address 0x7f6d32400000. Reason: Page not present or supervisor privilege.

@ce1adon
Copy link
Contributor Author

ce1adon commented Jun 7, 2021

CI errors.

@atamazov For some reason the CI is no longer print out the configures that failed in conv2d and immed_con2d, which makes the tracking of the issue very difficult. Is it possible to enable printing out of the driver cmd in CI?

@atamazov
Copy link
Contributor

atamazov commented Jun 7, 2021

@ce1adon

@atamazov For some reason the CI is no longer print out the configures that failed in conv2d and immed_con2d, which makes the tracking of the issue very difficult. Is it possible to enable printing out of the driver cmd in CI?

WRT "Fp16 Hip Tensile All gfx908" that failed at run #3. I think that adding the following:

-DMIOPEN_TEST_FLAGS='--verbose --disable-verification-cache'

to this line: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/01ff3da68b68b2ffbc31cb681184a8082ac7ce26/Jenkinsfile#L843

should add necessary printing. The line have been added in #556.


🔴 Please note that we should use --disable-verification-cache in all conv* tests (I do not remember exact reasoning, but otherwise CI can run into false verification failures).

@atamazov
Copy link
Contributor

atamazov commented Jun 8, 2021

Please review.

@junliume
Copy link
Collaborator

@ce1adon @atamazov could you take a look at the CI failure issues?

@ce1adon
Copy link
Contributor Author

ce1adon commented Jun 10, 2021

@ce1adon @atamazov could you take a look at the CI failure issues?

I've updated the branch to develop with verbose flag. Hopefully it will give some clue.

@atamazov
Copy link
Contributor

test_immed_conv2d was created some time ago by copy-pasting from test_conv2d. At that time there was no valid support for logging. Later, I added acceptable logging to test_conv2d, but test_immed_conv2d still lacks it.

Therefore I recommend using MIOPEN_LOG_LEVEL=6 for triaging the issue. Is it reproducible locally?

@ce1adon
Copy link
Contributor Author

ce1adon commented Jun 10, 2021

Please keep both 7 and 8 build alive. They are testing different stages.

@atamazov
Copy link
Contributor

@ce1adon I suspect ther ecould be some problem with some other algorithm, Maybe it's worth disabling ALL algorithms except GEMM for MIOpenTensile "Full" stages? This way, the problems in other algorithms won't affect these stages.

I am looking into ways of implementing this.

@ce1adon
Copy link
Contributor Author

ce1adon commented Jun 10, 2021

@ce1adon I suspect ther ecould be some problem with some other algorithm, Maybe it's worth disabling ALL algorithms except GEMM for MIOpenTensile "Full" stages? This way, the problems in other algorithms won't affect these stages.

I am looking into ways of implementing this.

Good thoughts. Will try to prove this theory locally.

@ce1adon
Copy link
Contributor Author

ce1adon commented Jun 10, 2021

As @atamazov said, cmd "MIOPEN_DEBUG_HIP_KERNELS=0 MIOPEN_LOG_LEVEL=5 CTEST_PARALLEL_LEVEL=4 MIOPEN_CONV_PRECISE_ROCBLAS_TIMING=0 MIOPEN_DEBUG_CONV_FFT=0 MIOPEN_DEBUG_CONV_DIRECT=0 MIOPEN_DEBUG_CONV_WINOGRAD=0 MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 MIOPEN_DEBUG_CONV_SCGEMM=0 make -j check" actually passed the test so the issue is not in GEMM.

I ran the tests with log level at 6 and got the following error msg. I will share the full log through email.

MIOpen(HIP): Info [GetWrwSolutions] 
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 1-26-26-5x5-128-56-56-1-0x0-2x2-1x1-0-NCHW-FP16-W in file /dockerx/MIOpen/src/kernels/gfx90878.HIP.2_12_0.ufdb.txt
MIOpen(HIP): Info2 [FindRecordUnsafe] Key match: 1-26-26-5x5-128-56-56-1-0x0-2x2-1x1-0-NCHW-FP16-W
MIOpen(HIP): Info2 [FindRecordUnsafe] Contents found: miopenConvolutionBwdWeightsAlgoImplicitGEMM:ConvAsmImplicitGemmGTCDynamicWrwXdlops,0.058538,12800,miopenConvolutionBwdWeightsAlgoImplicitGEMM,<unused>;miopenConvolutionBwdWeightsAlgoGEMM:GemmWrwUniversal,0.315882,4326400,miopenConvolutionBwdWeightsAlgoGEMM,<unused>;miopenConvolutionBwdWeightsAlgoDirect:ConvDirectNaiveConvWrw,2.67181,0,miopenConvolutionBwdWeightsAlgoDirect,<unused>
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.485527 ms
MIOpen(HIP): Info2 [GetWrwSolutionWorkspaceSize] solver_id = ConvAsmImplicitGemmGTCDynamicWrwXdlops
MIOpen(HIP): Info [CompileWrwSolution] solver_id = ConvAsmImplicitGemmGTCDynamicWrwXdlops
MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 1x26x26x5x5x128x56x56x1xNCHWxFP16x0x0x2x2x1x1x1xW and solver ConvAsmImplicitGemmGTCDynamicWrwXdlops
MIOpen(HIP): Info [ConvolutionWrwImmediate] solver_id = ConvAsmImplicitGemmGTCDynamicWrwXdlops, workspace = 12800
MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 1x26x26x5x5x128x56x56x1xNCHWxFP16x0x0x2x2x1x1x1xW and solver ConvAsmImplicitGemmGTCDynamicWrwXdlops
Memory access fault by GPU node-2 (Agent handle: 0x14faf10) on address 0x7f6436e05000. Reason: Page not present or supervisor privilege.
CMake Error at test_test_immed_conv2d.cmake:7 (message):
  Test failed

@atamazov
Copy link
Contributor

@ce1adon I will provide you with the necessary patch (that disables all but GEMM for conv tests) soon.

@atamazov
Copy link
Contributor

@ce1adon Please let me know if you would like me to update this branch directly. I will give you a .diff file otherwise.

@atamazov
Copy link
Contributor

@ce1adon Here it is: test_conv_all_miotensile_only_gemm.diff.txt. Currently it affects test_conv2d/3d and test_immed_conv2d/3d only.

@ce1adon
Copy link
Contributor Author

ce1adon commented Jun 11, 2021

@ce1adon Here it is: test_conv_all_miotensile_only_gemm.diff.txt. Currently it affects test_conv2d/3d and test_immed_conv2d/3d only.

@atamazov This is not going to work since some configures requires workspace that exceeds the memory limitation in GEMM path. Let's pin relevant author to solve the root cause in ConvAsmImplicitGemmGTCDynamicWrwXdlops, unless there are other concerns.

@atamazov
Copy link
Contributor

@ce1adon

@atamazov This is not going to work since some configures requires workspace that exceeds the memory limitation in GEMM path.

Please use MIOPEN_USE_MIOPENTENSILE from config.h and skip configures that not applicable for GEMM. See how it is done under #if TEST_DIRECT_SUPPORTED_CONFIG_ONLY and under #if WORKAROUND_MI100_ROM37_HIP_COMPILER_CRASH in conv_common.hpp. I can do that for you, if you wish.

@ce1adon
Copy link
Contributor Author

ce1adon commented Jun 11, 2021

@ce1adon

@atamazov This is not going to work since some configures requires workspace that exceeds the memory limitation in GEMM path.

Please use MIOPEN_USE_MIOPENTENSILE from config.h and skip configures that not applicable for GEMM. See how it is done under #if TEST_DIRECT_SUPPORTED_CONFIG_ONLY and under #if WORKAROUND_MI100_ROM37_HIP_COMPILER_CRASH in conv_common.hpp. I can do that for you, if you wish.

Thank you Artem. These are all good thoughts. But I'm afraid I'm not planning to add workarounds to let this test pass. I would prefer to wait for solution to the blocker. Thanks for understanding!

@atamazov
Copy link
Contributor

No problem. But please note that "my" solution improves the quality of miopentensile testing. It makes sure that GEMM will be always fully validated. Without it, the "best" solver will be verified.

@ce1adon
Copy link
Contributor Author

ce1adon commented Jun 11, 2021

No problem. But please note that "my" solution improves the quality of miopentensile testing. It makes sure that GEMM will be always fully validated. Without it, the "best" solver will be verified.

Sure. This is in fact my original proposal to run miopentensile test one year ago. Since the test rule has been settled, let's follow it.

@atamazov
Copy link
Contributor

Merge conflicts.

@atamazov atamazov merged commit fca9fed into develop Jun 21, 2021
@atamazov atamazov deleted the cmake_ver branch June 23, 2021 13:17
atamazov pushed a commit that referenced this pull request Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants