Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TF] Eigen unit tests on GPU failed #46333

Open
smuzaffar opened this issue Oct 10, 2024 · 5 comments
Open

[TF] Eigen unit tests on GPU failed #46333

smuzaffar opened this issue Oct 10, 2024 · 5 comments

Comments

@smuzaffar
Copy link
Contributor

Hi,

For tensorflow special IBs TF_X, where we have TF 2.17 (cuda build enabled) and new eigen https://github.com/cms-externals/eigen-git-mirror/tree/cms/master/c1d637433e3b3f9012b226c2c9125c494b470ae6 , few unit tests when use eigen are failing [a]. To reproduce this one can do

> ssh lxplus-gpu
> cd /tmp/$(whoami)
> cmssw-el8 --nv
> scram p CMSSW_14_2_TF_X_2024-10-08-1100
> cd CMSSW_14_2_TF_X_2024-10-08-1100
> cmsenv
> git cms-addpkg RecoTracker/PixelTrackFitting
> scram b -j 8
> scram b runtests_testEigenGPUNoFit_t

Note that we do apply cms-externals/eigen-git-mirror@3cbe8e7 patch on top of eigen. So may be we are missing something to patch?

@fwyzard , do you have any idea howto fix this?

[a]

Pass    0s ... RecoTracker/PixelTrackFitting/testFits
Pass    0s ... RecoTracker/PixelTrackFitting/testFitsDump
Pass    0s ... RecoTracker/PixelTrackFitting/testEigenJacobian
Pass    0s ... RecoTracker/PixelTrackFitting/testRecoPixelVertexingPixelTrackFittingRZLine
Fail    3s ... RecoTracker/PixelTrackFitting/testFitsGPU_t
Fail    3s ... RecoTracker/PixelTrackFitting/testBrokenLineFitGPU_t
Fail    3s ... RecoTracker/PixelTrackFitting/testEigenGPUNoFit_t
Pass  158s ... RecoTracker/PixelTrackFitting/PixelTrackFits
Pass  158s ... RecoTracker/PixelTrackFitting/PixelTrackFits_Debug
Pass  158s ... RecoTracker/PixelTrackFitting/PixelTrackBrokenLineFit
> cat uunit_tests/testEigenGPUNoFit_t.lognit_tests/testEigenGPUNoFit_t.log
===== Test "testEigenGPUNoFit_t" ====
TEST EIGENVALUES
TEST INVERSE 3x3
TEST INVERSE 4x4
TEST INVERSE 5x5
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02858/el8_amd64_gcc12/external/eigen/c1d637433e3b3f9012b226c2c9125c494b470ae6-42b72b714d1a11d439b86af5ed2418e1/include/eigen3/Eigen/src/Core/PermutationMatrix.h:184: Derived &Eigen::PermutationBase<Derived>::applyTranspositionOnTheRight(long, long) [with Derived = Eigen::PermutationMatrix<5, 5, int>]: block: [0,0,0], thread: [0,0,0] Assertion `i >= 0 && j >= 0 && i < size() && j < size()` failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  
src/RecoTracker/PixelTrackFitting/test/testEigenGPUNoFit.cu, line 173:
cudaCheck(cudaMemcpy(mCPUret, mGPUret, sizeof(Matrix5d), cudaMemcpyDeviceToHost));
cudaErrorAssert: device-side assert triggered

/bin/sh: line 1: 3864396 Aborted                 (core dumped) sh -c 'testEigenGPUNoFit_t '

---> test testEigenGPUNoFit_t had ERRORS
TestTime:3
^^^^ End Test testEigenGPUNoFit_t ^^^^
@smuzaffar
Copy link
Contributor Author

assign RecoTracker/PixelTrackFitting

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @smuzaffar.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Oct 10, 2024

FYI I will not be able to look into this (or other issues) until the end of November.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants