Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed in CPU version of kernel_countMultiplicity #392

Open
makortel opened this issue Sep 23, 2019 · 6 comments
Open

Assertion failed in CPU version of kernel_countMultiplicity #392

makortel opened this issue Sep 23, 2019 · 6 comments
Labels
bug Pixels Pixels-related developments

Comments

@makortel
Copy link

makortel commented Sep 23, 2019

While running the CPU profiling workflow (customizePixelTracksSoAonCPUForProfiling()) on 11_0_0_pre7_Patatrack at NERSC, I got an assertion failure

Begin processing the 3901st record. Run 321177, Event 188714878, LumiSection 142 on stream 13 at 20-Sep-2019 20:12:57.849 PDT
RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernelsImpl.h:320: void kernel_countMultiplicity(const HitContainer*, const Quality*, CAConstants::TupleMultiplicity*): Assertion `nhits < 8' failed.
wrong mult 347 -1412
...
Thread 76 (Thread 0x2aaebc280700 (LWP 4401)):
...
#4  <signal handler called>
#5  0x00002aaaad63f207 in raise () from /lib64/libc.so.6
#6  0x00002aaaad6408f8 in abort () from /lib64/libc.so.6
#7  0x00002aaaad638026 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00002aaaad6380d2 in __assert_fail () from /lib64/libc.so.6
#9  0x00002aab9a99280d in CAHitNtupletGeneratorKernels<cudaCompat::CPUTraits>::launchKernels(TrackingRecHit2DHeterogeneous<cudaCompat::CPUTraits> const&, TrackSoAT<32768>*, CUstream_st*) () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so
#10 0x00002aab9a947413 in CAHitNtupletGeneratorOnGPU::makeTuples(TrackingRecHit2DHeterogeneous<cudaCompat::CPUTraits> const&, float) const () from .../cmssw/CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so
#11 0x00002aab9a993b99 in CAHitNtupletCUDA::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixel
TripletsPlugins.so
...

when running on 64 streams/threads. This failure occurred only once though during my tests on 4x{1, 16, 32}, 10x64, 4x{1, 20, 40}, and 10x80 streams/threads, but I thought to report it anyway ("NxM" meaning "N runs of M streams/threads").

@makortel
Copy link
Author

FYI @VinInn

@VinInn
Copy link

VinInn commented Sep 23, 2019

would be interesting to understand if it is reproducible (at event level).
It could be due to "memory" corruption.
What events were those? mc/real? 2018/2021?

@makortel
Copy link
Author

would be interesting to understand if it is reproducible (at event level).

It's not very reproducible. As I wrote in the description, it occurred once in 44 executions (with varying number of streams/threads). I could of course try to repeat it (with high thread count).

It could be due to "memory" corruption.
What events were those? mc/real? 2018/2021?

Real, from the LS 142 of run 321177 from 2018D JetHT ("the usual").

@makortel
Copy link
Author

On a closer inspection I found another assertion failure in the logs of the 44 jobs. It was certainly a different event

Begin processing the 801st record. Run 321177, Event 188206932, LumiSection 142 on stream 0 at 18-Sep-2019 10:21:15.290 PDT
cmsRun: .../CMSSW_11_0_0_pre7_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/CAHit
NtupletGeneratorKernelsImpl.h:320: void kernel_countMultiplicity(const HitContainer*, const Quality*, CAConstants::TupleMultiplicity*): Assertion `nhits < 8' failed.
wrong mult 439 -1787
...
Thread 91 (Thread 0x2aaec5800700 (LWP 70795)):
...
#5  0x00002aaaad63f207 in raise () from /lib64/libc.so.6
#6  0x00002aaaad6408f8 in abort () from /lib64/libc.so.6
#7  0x00002aaaad638026 in __assert_fail_base () from /lib64/libc.so.6
#8  0x00002aaaad6380d2 in __assert_fail () from /lib64/libc.so.6
#9  0x00002aab9a8c980d in CAHitNtupletGeneratorKernels<cudaCompat::CPUTraits>::launchKernels(TrackingRecHit2DHeterogeneous<cudaCompat::CPUTraits> const&, TrackSoAT<32768>*, CUstream_st*) () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so
#10 0x00002aab9a87e413 in CAHitNtupletGeneratorOnGPU::makeTuples(TrackingRecHit2DHeterogeneous<cudaCompat::CPUTraits> const&, float) const () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so
#11 0x00002aab9a8cab99 in CAHitNtupletCUDA::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from .../CMSSW_11_0_0_pre7_Patatrack/lib/slc7_amd64_gcc820/pluginRecoPixelVertexingPixelTripletsPlugins.so

(this was on 80-stream/thread job)

@makortel
Copy link
Author

Hmm, I just repeated 80-stream/thread job for 150 times, no failures.

@VinInn
Copy link

VinInn commented Sep 24, 2019

The CPU workflow is supposed to be thread safe (but the stats (not used in perfWf) that I have still to fix (require proper handling of AtomicAdd))
I can only think of uninitialized memory that is zeroed by "chance".
It was the case at some point.
One may have to try to run it under valgrind...
I will not blame cosmic rays nor bad-memory at NIRSC

@fwyzard fwyzard added bug Pixels Pixels-related developments labels Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Pixels Pixels-related developments
Projects
None yet
Development

No branches or pull requests

3 participants