[13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033

sroychow · 2023-06-21T09:18:43Z

PR description:

Backport of #42010

PR validation:

code compiles

cmsbuild · 2023-06-21T09:19:07Z

A new Pull Request was created by @sroychow (Suvankar Roy Chowdhury) for CMSSW_13_0_X.

It involves the following packages:

EventFilter/SiPixelRawToDigi (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@mroguljic, @VinInn, @Martin-Grunewald, @missirol, @dkotlins, @ferencek, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

Backported from Fixed channel decoding for the timeout error in SiPixel RawToDigi #42010

sroychow · 2023-06-21T09:19:12Z

type bug-fix

sroychow · 2023-06-21T09:21:16Z

urgent

address issue Large GPU/CPU difference in soft electron reconstruction related to pixel unpacker #41715
further comment

sroychow · 2023-06-21T09:23:48Z

test parameters:

enable_tests = gpu
workflows_gpu = 10824.507
relvals_opt= -w upgrade
relvals_opt_gpu = -w upgrade

sroychow · 2023-06-21T09:24:02Z

please test

cmsbuild · 2023-06-21T13:46:38Z

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b59733/33287/summary.html
COMMIT: 1e8b5fd
CMSSW: CMSSW_13_0_X_2023-06-20-2300/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/42033/33287/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

The relvals timed out after 4 hours.

140.065505140.065505_Run3-2023_JetMET2022D_RecoPixelOnlyTripletsCPU/step2_Run3-2023_JetMET2022D_RecoPixelOnlyTripletsCPU.log
140.065140.065_RunJetMET2022D/step2_RunJetMET2022D.log
140.065511140.065511_Run3-2023_JetMET2022D_RecoECALOnlyCPU/step2_Run3-2023_JetMET2022D_RecoECALOnlyCPU.log

Expand to see more relval errors ...

Comparison Summary

Summary:

You potentially removed 239 lines from the logs
Reco comparison results: 675 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3315916
DQMHistoTests: Total failures: 2216
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3313678
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
Checked 213 log files, 164 edm output root files, 49 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 47 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 56220
DQMHistoTests: Total failures: 2849
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 53371
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

malbouis · 2023-06-22T12:08:38Z

should the tests be re-triggered?

mmusich · 2023-06-22T12:09:15Z

please test

cmsbuild · 2023-06-22T16:07:01Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b59733/33338/summary.html
COMMIT: 1e8b5fd
CMSSW: CMSSW_13_0_X_2023-06-22-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/42033/33338/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 249 lines to the logs
Reco comparison results: 675 differences found in the comparisons
DQMHistoTests: Total files compared: 49
DQMHistoTests: Total histograms compared: 3317136
DQMHistoTests: Total failures: 2216
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3314898
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
Checked 213 log files, 164 edm output root files, 49 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 31 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 56220
DQMHistoTests: Total failures: 2020
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 54200
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

malbouis · 2023-06-23T07:06:26Z

@sroychow , would you please confirm that the differences are as expected, also in the master PR: #42010 (comment)
If we would like to have it for data-taking in the new Run2023D Era, this PR should be merged and a new release cut very soon.
@saumyaphor4252 , FYI as ORM.
Thanks!

sroychow · 2023-06-23T14:21:35Z

@malbouis @jordan-martins After some checks, and also discussing with @mmusich, we think the differences in the GPU comparisons are spurious. If you look at even older PRs (which were merged before), e.g. 42014 or before, you should see some differences pointed out by the bot in the GPU comparisons. I would propose we can merge this PR.

malbouis · 2023-06-23T16:25:09Z

Thanks, @sroychow ! Indeed, it would be good to have this PR merged and a new release cut with this in.
@rappoccio , @perrotta , do you think we could move ahead?

rappoccio · 2023-06-23T16:30:04Z

@cms-sw/reconstruction-l2 can someone take a look at this urgent PR to sign off?

mandrenguyen · 2023-06-23T17:34:34Z

Apologies, but can someone reiterate why we would like to put this bug-fix directly into reconstruction without proper release validation?
I've read the comment from PPD here:
#42010 (comment)

But according to HLT, this PR will only affect the fallback mode when no GPU is available.
The motivation to modify the offline reconstruction without full validation is not completely clear to me.
I understand it's a bug-fix, but any change to reco risks to introduce further bugs that we may not be able to spot immediately.

missirol · 2023-06-23T17:47:31Z

But according to HLT, this PR will only affect the fallback mode when no GPU is available.

Small clarification (already mentioned at the last ORP): the CPU unpacker also runs at HLT for the fraction of events used for GPU-vs-CPU comparisons.

Edit : at HLT, the CPU pixel unpacker corresponds to the module hltSiPixelDigisLegacy. This module runs in the trigger AlCa_PFJet40_CPUOnly_v, which, in turn, is used by HLT_PFJet40_GPUvsCPU_v. The so-called "GPUvsCPU" comparisons in DQM use the trigger DQM_PixelReconstruction_v, but I'm not actually sure that the CPU pixel unpacker runs as part of that trigger. The latter uses hltSiPixelRecHitsFromLegacy, but that module consumes the SwitchProducer hltSiPixelClusters which would corresponds to the PixelClusters reconstructed on GPU, because DQM_PixelReconstruction_v only processes events if GPU offloading is enabled.

process.hltSiPixelRecHitsFromLegacy = cms.EDProducer( "SiPixelRecHitSoAFromLegacyPhase1",
    beamSpot = cms.InputTag( "hltOnlineBeamSpot" ),
    src = cms.InputTag( "hltSiPixelClusters" ),
    CPE = cms.string( "hltESPPixelCPEFast" ),
    convertToLegacy = cms.bool( True )
)

mmusich · 2023-06-23T18:00:04Z

But according to HLT, this PR will only affect the fallback mode when no GPU is available.

Do we have a number about the fraction of events that take this path?

directly into reconstruction without proper release validation?

I agree. This needs more validation. OTOH done the standard way it will come so late to not be useful if the intention from PPD is to not reprocess the last chunk of 2023 data (and have it consistent with the reprocessed part, that presumably will have the fix).
Is there a way to expedite the validation, based on central resources?

mmusich · 2023-06-23T18:07:48Z

but I'm not actually sure that the CPU pixel unpacker runs as part of that trigge

This statement is in square contradiction with the whole of #41715 .
If the pixel legacy unpacker is never run there is no issue to talk about. Please clarify.

missirol · 2023-06-23T18:16:17Z

(I edited my comment above about the GPU-vs-CPU comparisons, as already noticed)

Do we have a number about the fraction of events that take this path?

I think this has happened very rarely. I think the number is 0 for 2023 collisions, and I could only remember one case where some nodes could not recognise their GPU during MWGR2 of 2023 (run-363833).

missirol · 2023-06-23T19:01:39Z

but I'm not actually sure that the CPU pixel unpacker runs as part of that trigger

This statement is in square contradiction with the whole of #41715 . If the pixel legacy unpacker is never run there is no issue to talk about. Please clarify.

#41715 relates to offline studies where we compare the triggers results 'running on GPU' vs 'running on CPU'. The goal is to make sure the two reconstructions give the same results, since (in general) GPU is the default online, while CPU is the fallback online and is used in most offline use cases (e.g. MC).

Part of this validation also runs online as part of the HLT menu: there is a Path (HLT_PFJet40_GPUvsCPU_v) which fires when AlCa_PFJet40_v (GPU) and AlCa_PFJet40_CPUOnly_v (CPU) disagree, and there is the DQMGPUvsCPU stream for DQM comparisons involving Pixel, ECAL and HCAL. Online, the CPU pixel unpacker runs as part of AlCa_PFJet40_CPUOnly_v (which is needed by HLT_PFJet40_GPUvsCPU_v), but I'm not sure it runs as part of the DQMGPUvsCPU stream (pixel tracking on CPU does, but I'm not sure about pixel unpacking on CPU).

#41715 led to identifying a bugfix, and I thought PPD was in favour of backporting it (that's how I read #42010 (comment)). Whether or not this backport is critical for HLT, it can be debated. Maybe @silviodonato or @fwyzard have a different opinion.

jordan-martins · 2023-06-23T19:12:45Z

Hi @mandrenguyen,

Yes, since it is a bug fix, we decided to get in asap for the start of ERA D.

We do indeed want to perform some validation, and we have asked TRK to propose some approaches to assist PdmV with the best way to propose a quick validation that we could rely on. We wanted to get this in now because we only foresee a rereco of the initial chunk of the DATA from ERA A to C.

Do you think this is too risky? Could you propose a way around this to help us move in the safest way possible?

Thanks,
PPD

FYI @cms-sw/ppd-l2 @cms-sw/pdmv-l2

mandrenguyen · 2023-06-23T19:51:10Z

It would appear this is a long-standing bug in the unpacker, which has a very minor effect on the offline reconstruction, at least judging by the comparisons.
We can validate this bug-fix by the usual procedure, integrating it into the master release, and eventually backporting.
This procedure can be accelerated, but if we cannot meet the deadline for ERA D, I think I'm not properly understanding the negative consequences.
The prompt reco will be consistent across all eras with a minor bug (that is already present in all of our published papers?) that will be fixed for an eventual reprocessing.
The HLT will essentially be unaffected, since we are not using CPU at HLT.

I understand that there is some CPU-GPU validation that we would like to converge, but do we really want to risk breaking reco to fix validation? Especially since we heard this is only a partial fix, and there are at least more changes coming on the GPU side. The risk of directly implementing this bug-fix is admittedly small, but the consequences could be quite bad. I suppose we should be able to fix the validation using the fixed CPU code without actually deploying the fix in prompt reco for Era D.

mmusich · 2023-06-23T20:26:49Z

The HLT will essentially be unaffected, since we are not using CPU at HLT.

The goal is to make sure the two reconstructions give the same results, since (in general) GPU is the default online, while CPU is the fallback online and is used in most offline use cases (e.g. MC).

So it validates a use case that never happens (in practice). Now that this is clarified I guess it also settles the urgency for a fix (and for all future fix requests of this type as well).

It would appear this is a long-standing bug in the unpacker, which has a very minor effect on the offline reconstruction, at least judging by the comparisons.

The effect is minor when the detector is well-behaved. I am under the impression the effect becomes more sizeable in presence of large rates of soft error recoveries.

fwyzard · 2023-06-23T20:34:03Z

The bugfix should definitely be backported and included in the online reconstruction ASAP.

RECO conveners may have a cavalier approach towards the integrity of the data taking, but DAQ and TSG (should) consider the stability and correctness of the online reconstruction and data taking of paramount importance.

mmusich · 2023-06-23T20:35:48Z

Mostly to clarify to myself

but I'm not sure it runs as part of the DQMGPUvsCPU stream (pixel tracking on CPU does, but I'm not sure about pixel unpacking on CPU).

In case one of the legacy unpacker event products would be persisted for being outputed in the event stram as we're planning in https://its.cern.ch/jira/browse/CMSHLT-2846 I would think it will start to be run as well.

mandrenguyen · 2023-06-24T08:51:45Z

Does something prevent us from using an era or process modifier such that:

We can delay the deployment of the new pixel unpacker code offline until it's validated
GPUvsCPU validation can be conducted with the modified code
HLT can deploy the modified code in the unlikely event that it's executed

?

perrotta · 2023-06-25T14:30:11Z

backport of #42010

clacaputo · 2023-06-26T10:49:31Z

+reconstruction

differences expected, see [13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033 (comment)

cmsbuild · 2023-06-26T10:49:49Z

This pull request is fully signed and it will be integrated in one of the next CMSSW_13_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_13_2_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2023-06-26T11:50:30Z

+1

As it was agreed at the joint operation meeting of June 26: this may enter a passible new release on top of the already built 13_0_8

fwyzard · 2023-06-26T12:07:10Z

Does something prevent us from using an era or process modifier such that:

* We can delay the deployment of the new pixel unpacker code offline until it's validated

* GPUvsCPU validation can be conducted with the modified code

* HLT can deploy the modified code in the unlikely event that it's executed

?

@mandrenguyen that's an interesting approach, but looking at the changes, I do not think it can be done: an era or process modifier can affect only the python configuration, while this bug fix is a c++ change.

missirol · 2023-06-26T12:37:16Z

Small clarification (already mentioned at the last ORP): the CPU unpacker also runs at HLT for the fraction of events used for GPU-vs-CPU comparisons. Edit : at HLT, the CPU pixel unpacker corresponds to the module hltSiPixelDigisLegacy [..]

For completeness, I have to correct myself again wrt #42033 (comment) (apologies).

As one can see from here, there are two instances of the CPU Pixel unpacker at HLT, hltSiPixelDigisLegacy and hltSiPixelDigisRegForDisplaced. The latter is used by a subset of EXO triggers that still haven't been ported to the heterogeneous pixel reconstruction (see, for example, the trigger HLT_HT430_DelayedJet40_SingleDelay0p5nsTrackless_v*, which uses the module hltSiPixelDigisRegForDisplaced).

fixed channel decoding for the timeout error

1e8b5fd

cmsbuild added this to the CMSSW_13_0_X milestone Jun 21, 2023

cmsbuild added reconstruction-pending pending-signatures tests-pending orp-pending trk labels Jun 21, 2023

cmsbuild added the bug-fix label Jun 21, 2023

cmsbuild added the urgent label Jun 21, 2023

cmsbuild added tests-started and removed tests-pending labels Jun 21, 2023

cmsbuild added tests-rejected and removed tests-started labels Jun 21, 2023

cmsbuild added tests-started and removed tests-rejected labels Jun 22, 2023

cmsbuild added tests-approved and removed tests-started labels Jun 22, 2023

cmsbuild added the backport-ok label Jun 25, 2023

clacaputo mentioned this pull request Jun 26, 2023

[13.1.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42034

Merged

cmsbuild added reconstruction-approved fully-signed and removed reconstruction-pending pending-signatures labels Jun 26, 2023

cmsbuild added orp-approved and removed orp-pending labels Jun 26, 2023

cmsbuild merged commit 8d8e22c into cms-sw:CMSSW_13_0_X Jun 26, 2023

cmsbuild mentioned this pull request Jun 26, 2023

[13_0_X] Addition of 2023 WFs and Fixing NANO step #42089

Merged

fwyzard mentioned this pull request Jul 27, 2023

HLT crash in run-367906 (sistrip::FEDBuffer::findChannels()) #41786

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033

[13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033

sroychow commented Jun 21, 2023

cmsbuild commented Jun 21, 2023 •

edited

Loading

sroychow commented Jun 21, 2023

sroychow commented Jun 21, 2023

sroychow commented Jun 21, 2023

sroychow commented Jun 21, 2023

cmsbuild commented Jun 21, 2023

malbouis commented Jun 22, 2023

mmusich commented Jun 22, 2023

cmsbuild commented Jun 22, 2023

malbouis commented Jun 23, 2023

sroychow commented Jun 23, 2023

malbouis commented Jun 23, 2023

rappoccio commented Jun 23, 2023

mandrenguyen commented Jun 23, 2023

missirol commented Jun 23, 2023 •

edited

Loading

mmusich commented Jun 23, 2023

mmusich commented Jun 23, 2023

missirol commented Jun 23, 2023

missirol commented Jun 23, 2023 •

edited

Loading

jordan-martins commented Jun 23, 2023

mandrenguyen commented Jun 23, 2023

mmusich commented Jun 23, 2023 •

edited

Loading

fwyzard commented Jun 23, 2023

mmusich commented Jun 23, 2023

mandrenguyen commented Jun 24, 2023

perrotta commented Jun 25, 2023

clacaputo commented Jun 26, 2023

cmsbuild commented Jun 26, 2023

perrotta commented Jun 26, 2023

fwyzard commented Jun 26, 2023

missirol commented Jun 26, 2023

[13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033

[13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033

Conversation

sroychow commented Jun 21, 2023

PR description:

PR validation:

cmsbuild commented Jun 21, 2023 • edited Loading

sroychow commented Jun 21, 2023

sroychow commented Jun 21, 2023

sroychow commented Jun 21, 2023

sroychow commented Jun 21, 2023

cmsbuild commented Jun 21, 2023

RelVals-INPUT

Comparison Summary

GPU Comparison Summary

malbouis commented Jun 22, 2023

mmusich commented Jun 22, 2023

cmsbuild commented Jun 22, 2023

Comparison Summary

GPU Comparison Summary

malbouis commented Jun 23, 2023

sroychow commented Jun 23, 2023

malbouis commented Jun 23, 2023

rappoccio commented Jun 23, 2023

mandrenguyen commented Jun 23, 2023

missirol commented Jun 23, 2023 • edited Loading

mmusich commented Jun 23, 2023

mmusich commented Jun 23, 2023

missirol commented Jun 23, 2023

missirol commented Jun 23, 2023 • edited Loading

jordan-martins commented Jun 23, 2023

mandrenguyen commented Jun 23, 2023

mmusich commented Jun 23, 2023 • edited Loading

fwyzard commented Jun 23, 2023

mmusich commented Jun 23, 2023

mandrenguyen commented Jun 24, 2023

perrotta commented Jun 25, 2023

clacaputo commented Jun 26, 2023

cmsbuild commented Jun 26, 2023

perrotta commented Jun 26, 2023

fwyzard commented Jun 26, 2023

missirol commented Jun 26, 2023

cmsbuild commented Jun 21, 2023 •

edited

Loading

missirol commented Jun 23, 2023 •

edited

Loading

missirol commented Jun 23, 2023 •

edited

Loading

mmusich commented Jun 23, 2023 •

edited

Loading