Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crash in run-368566 (BasicSingleVertexState error from PFTauPrimaryVertexProducer) #41914

Closed
missirol opened this issue Jun 9, 2023 · 21 comments

Comments

@missirol
Copy link
Contributor

missirol commented Jun 9, 2023

In run-368566 (pp collisions, release CMSSW_13_0_7), DAQ reported a CMSSW crash at HLT not seen previously, to my knowledge [link to HLT elog]. Metadata and exception message can be found in [1].

Unfortunately, the error-stream file could not be recovered this time, so there is no reproducer right now. I'm opening the issue anyway, in case experts have feedback, or in case this happens again.

FYI: @cms-sw/hlt-l2 @silviodonato @fwyzard @mzarucki @trtomei

[1]

msgtime:2023-06-07 02:37:39
doc_type:cmsswlog
date:2023-06-07T00:37:40.349Z
run:368566
host:fu-c2b05-37-01
pid:1509400
doctype:exception
severity:FATAL
severityVal:4
instance:global
lexicalId:4094303210
message:

An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 368566 lumi: 308 event: 634032227 stream: 18
   [1] Running path 'HLT_IsoMu24_eta2p1_MediumDeepTauPFTauHPS20_eta2p1_SingleL1_v4'
   [2] Calling method for module PFTauPrimaryVertexProducer/'hltHpsPFTauPrimaryVertexProducerForDeepTauForVBFIsoTau'
Exception Message:
BasicSingleVertexState::could not invert error matrix
@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 9, 2023

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@missirol
Copy link
Contributor Author

missirol commented Jun 9, 2023

assign hlt

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 9, 2023

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

assign reconstruction

FYI @cms-sw/pf-l2 @cms-sw/tau-pog-l2

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@VinInn
Copy link
Contributor

VinInn commented Jun 11, 2023

since when CMS allows that not being able to invert a matrix abort a production job?
Maybe all throw instances shall be reviewed (and removed, as apparently nobody is able to properly catch them)

@mmusich
Copy link
Contributor

mmusich commented Jul 12, 2023

There were two another instances of this in run 370497:

An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 370497 lumi: 433 event: 922478106 stream: 13
   [1] Running path 'HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v5'
   [2] Calling method for module PFTauPrimaryVertexProducer/'hltHpsPFTauPrimaryVertexProducerForDeepTauForVBFIsoTau'
Exception Message:
BasicSingleVertexState::could not invert error matrix 

@missirol
Copy link
Contributor Author

There were two another instances of this in run 370497:

Unfortunately, in this case too we don't happen to have the corresponding error-stream file. @smorovic (DAQ) explained that the file was deleted too early [1], and he provided the full log of the corresponding cmsRun job (old_hlt_run370497_pid2688582.log), an extract of which is in [2].

[1]

info:2023-07-11 14:26:51 - processCRASHfile - RUN:370497 - 'run370497_ls0000_crash_pid2688582.jsn' with errcode: 65
info:2023-07-11 14:26:51 - processCRASHFile - RUN:370497 - crash info -  numEvents: 1500, errCode: 65, streams:
info:2023-07-11 14:26:51 - processCRASHFile - RUN:370497 - error stream input file /fff/BUs/rubu-c2a15-11-01_ramdisk/run370497/fu/run370497_ls0433_index000060.raw is gone, possibly already deleted by the process [Errno 2] No such file or directory: '/fff/BUs/rubu-c2a15-11-01_ramdisk/run370497/fu/run370497_ls0433_index000060.raw'

[2]

%MSG-e UnusableBeamSpot:  PrimaryVertexProducer:hltVerticesL3  11-Jul-2023 14:26:48 CEST Run: 370497 Event: 922478106
Beamspot with invalid errors [            0           0           0
             0           0           0
             0           0           0 ]
%MSG
%MSG-w PrimaryVertexProducer:  PrimaryVertexProducer:hltVerticesL3  11-Jul-2023 14:26:48 CEST Run: 370497 Event: 922478106
no vertex found with 10 tracks and 1 vertex-candidates
%MSG
----- Begin Fatal Exception 11-Jul-2023 14:26:48 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 370497 lumi: 433 event: 922478106 stream: 13
   [1] Running path 'HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v5'
   [2] Calling method for module PFTauPrimaryVertexProducer/'hltHpsPFTauPrimaryVertexProducerForDeepTauForVBFIsoTau'
Exception Message:
BasicSingleVertexState::could not invert error matrix
----- End Fatal Exception -------------------------------------------------
%MSG-w FastMonitoringService:  PostProcessPath 11-Jul-2023 14:26:48 CEST  Run: 370497 Event: 922478106
 STREAM 13 earlyTermination -: ID:run: 370497 lumi: 433 event: 922478106 LS:433  FromThisContext
%MSG

@francescobrivio
Copy link
Contributor

francescobrivio commented Jul 13, 2023

If I understood correctly, this same error happened again in run 370560:

An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 370560 lumi: 249 event: 374762449 stream: 2
   [1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v2'
   [2] Calling method for module PFTauPrimaryVertexProducer/'hltHpsPFTauPrimaryVertexProducerForDeepTau'
Exception Message:
BasicSingleVertexState::could not invert error matrix 

@smorovic Maybe in this case the streamers were not deleted? It would be really useful to have them to debug this issue...

@VinInn
Copy link
Contributor

VinInn commented Jul 13, 2023

Still: it should NOT throw, just report an error!

@smorovic
Copy link
Contributor

Hi @francescobrivio

Unfortunately, it seems this time it is also gone:
error stream input file /fff/BUs/rubu-c2a11-25-01_ramdisk/run370560/fu/run370560_ls0249_index000014.raw is gone

@mmusich
Copy link
Contributor

mmusich commented Jul 13, 2023

Still: it should NOT throw, just report an error!

While I agree disrupting processing is nasty, I am wondering if we would have ever uncovered the underlying issue without this. I am a bit hesitant in removing the throw.

@VinInn
Copy link
Contributor

VinInn commented Jul 13, 2023

an error message would do it. Throwing is against CMS coding rules

@fwyzard
Copy link
Contributor

fwyzard commented Jul 13, 2023

I would agree, but unfortunately error messages are completely ignored at HLT, and I presume also during the Offline processing.

@mmusich
Copy link
Contributor

mmusich commented Jul 13, 2023

I presume also during the Offline processing.

there's some effort in cleaning things up in #41456

@fwyzard
Copy link
Contributor

fwyzard commented Jul 13, 2023

Ah, thanks for the pointer.

@francescobrivio
Copy link
Contributor

francescobrivio commented Jul 13, 2023

Let me summarize the investigation we have done from the BeamSpot point of view:

  • The crash seems likely to appear due to the fact that HLT consumed a fakeBS (the crash should be now cured by [master] Adding cov matrix to fakeBS object #42239)
  • So the question moves to why the esProducer produced a fakeBS?
    AFIU there are only 3 possible cases:
    • Either the last IOVs/payloads are too old --> this can't be the case in the middle of a run
    • Or the produced BeamSpot fits from the DQM clients converged, but don't satisfy minimal selections --> we checked the payloads and this is also not the case for the LSs in question (Run 370560, LS=246/7/8/9)
    • Or the fwk wasn't able to access the DB so it returned a fakeBS --> this at the moment seems the most likely explanation

This was checked running "offline" (doing re-HLT) on the same LS reading the conditions from the HLT GT and the ESProducer doesn't produce any fake BS.

@smorovic is there a way to check if there were some Frontier/squid errors or warnings during the processing of these LSs?

FYI @gennai @dzuolo @lguzzi

@smorovic
Copy link
Contributor

Last case has a lot of various CMSSW log errors and warnings, not sure what to search for. The log file is here in my .cms home:
/nfshome0/smorovic/old_hlt_run370560_pid3163692.log

However, I think the failure to read condition data would crash the process (it's a fatal error), as far as we know from previous experience with frontier services going down. That would be appearing in F3mon as exception, stacktrace or Fatal log level.

Also, I found a bug causing files to be deleted early and prepared PRs with a fix:
master: #42258 backports: #42259 and #42260

@mmusich
Copy link
Contributor

mmusich commented Sep 12, 2024

As far as I know, this precise exception never happened in 2024.
A similar, albeit not same, issue is documented at #45555, in which a possible explanation is given in terms of temporary connection to DB glitch. We don't have a full proof, but we've equipped the menus with measures to be able to tell if it happens again.
@missirol I would be in favor of closing this issue to clear up the hlt-pending queue, let me know if you think otherwise.

@cmsbuild
Copy link
Contributor

cms-bot internal usage

@missirol
Copy link
Contributor Author

Okay. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants