Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[13.0.X] Fixed channel decoding for the timeout error in SiPixel RawToDigi #42033

Conversation

sroychow
Copy link
Contributor

PR description:

Backport of #42010

PR validation:

code compiles

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 21, 2023

A new Pull Request was created by @sroychow (Suvankar Roy Chowdhury) for CMSSW_13_0_X.

It involves the following packages:

  • EventFilter/SiPixelRawToDigi (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@mroguljic, @VinInn, @Martin-Grunewald, @missirol, @dkotlins, @ferencek, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@sroychow
Copy link
Contributor Author

type bug-fix

@sroychow
Copy link
Contributor Author

@sroychow
Copy link
Contributor Author

test parameters:

  • enable_tests = gpu
  • workflows_gpu = 10824.507
  • relvals_opt= -w upgrade
  • relvals_opt_gpu = -w upgrade

@sroychow
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b59733/33287/summary.html
COMMIT: 1e8b5fd
CMSSW: CMSSW_13_0_X_2023-06-20-2300/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/42033/33287/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

The relvals timed out after 4 hours.

  • 140.065505140.065505_Run3-2023_JetMET2022D_RecoPixelOnlyTripletsCPU/step2_Run3-2023_JetMET2022D_RecoPixelOnlyTripletsCPU.log
  • 140.065140.065_RunJetMET2022D/step2_RunJetMET2022D.log
  • 140.065511140.065511_Run3-2023_JetMET2022D_RecoECALOnlyCPU/step2_Run3-2023_JetMET2022D_RecoECALOnlyCPU.log
Expand to see more relval errors ...

Comparison Summary

Summary:

  • You potentially removed 239 lines from the logs
  • Reco comparison results: 675 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3315916
  • DQMHistoTests: Total failures: 2216
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3313678
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 47 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 56220
  • DQMHistoTests: Total failures: 2849
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 53371
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@malbouis
Copy link
Contributor

should the tests be re-triggered?

@mmusich
Copy link
Contributor

mmusich commented Jun 22, 2023

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b59733/33338/summary.html
COMMIT: 1e8b5fd
CMSSW: CMSSW_13_0_X_2023-06-22-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/42033/33338/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 249 lines to the logs
  • Reco comparison results: 675 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3317136
  • DQMHistoTests: Total failures: 2216
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3314898
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 31 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 56220
  • DQMHistoTests: Total failures: 2020
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 54200
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@malbouis
Copy link
Contributor

@sroychow , would you please confirm that the differences are as expected, also in the master PR: #42010 (comment)
If we would like to have it for data-taking in the new Run2023D Era, this PR should be merged and a new release cut very soon.
@saumyaphor4252 , FYI as ORM.
Thanks!

@sroychow
Copy link
Contributor Author

@malbouis @jordan-martins After some checks, and also discussing with @mmusich, we think the differences in the GPU comparisons are spurious. If you look at even older PRs (which were merged before), e.g. 42014 or before, you should see some differences pointed out by the bot in the GPU comparisons. I would propose we can merge this PR.

@malbouis
Copy link
Contributor

Thanks, @sroychow ! Indeed, it would be good to have this PR merged and a new release cut with this in.
@rappoccio , @perrotta , do you think we could move ahead?

@rappoccio
Copy link
Contributor

@cms-sw/reconstruction-l2 can someone take a look at this urgent PR to sign off?

@mandrenguyen
Copy link
Contributor

Apologies, but can someone reiterate why we would like to put this bug-fix directly into reconstruction without proper release validation?
I've read the comment from PPD here:
#42010 (comment)

But according to HLT, this PR will only affect the fallback mode when no GPU is available.
The motivation to modify the offline reconstruction without full validation is not completely clear to me.
I understand it's a bug-fix, but any change to reco risks to introduce further bugs that we may not be able to spot immediately.

@missirol
Copy link
Contributor

missirol commented Jun 23, 2023

But according to HLT, this PR will only affect the fallback mode when no GPU is available.

Small clarification (already mentioned at the last ORP): the CPU unpacker also runs at HLT for the fraction of events used for GPU-vs-CPU comparisons.

Edit : at HLT, the CPU pixel unpacker corresponds to the module hltSiPixelDigisLegacy. This module runs in the trigger AlCa_PFJet40_CPUOnly_v, which, in turn, is used by HLT_PFJet40_GPUvsCPU_v. The so-called "GPUvsCPU" comparisons in DQM use the trigger DQM_PixelReconstruction_v, but I'm not actually sure that the CPU pixel unpacker runs as part of that trigger. The latter uses hltSiPixelRecHitsFromLegacy, but that module consumes the SwitchProducer hltSiPixelClusters which would corresponds to the PixelClusters reconstructed on GPU, because DQM_PixelReconstruction_v only processes events if GPU offloading is enabled.

process.hltSiPixelRecHitsFromLegacy = cms.EDProducer( "SiPixelRecHitSoAFromLegacyPhase1",
    beamSpot = cms.InputTag( "hltOnlineBeamSpot" ),
    src = cms.InputTag( "hltSiPixelClusters" ),
    CPE = cms.string( "hltESPPixelCPEFast" ),
    convertToLegacy = cms.bool( True )
)

@mmusich
Copy link
Contributor

mmusich commented Jun 23, 2023

But according to HLT, this PR will only affect the fallback mode when no GPU is available.

Do we have a number about the fraction of events that take this path?

directly into reconstruction without proper release validation?

I agree. This needs more validation. OTOH done the standard way it will come so late to not be useful if the intention from PPD is to not reprocess the last chunk of 2023 data (and have it consistent with the reprocessed part, that presumably will have the fix).
Is there a way to expedite the validation, based on central resources?

@mmusich
Copy link
Contributor

mmusich commented Jun 23, 2023

but I'm not actually sure that the CPU pixel unpacker runs as part of that trigge

This statement is in square contradiction with the whole of #41715 .
If the pixel legacy unpacker is never run there is no issue to talk about. Please clarify.

@missirol
Copy link
Contributor

(I edited my comment above about the GPU-vs-CPU comparisons, as already noticed)

Do we have a number about the fraction of events that take this path?

I think this has happened very rarely. I think the number is 0 for 2023 collisions, and I could only remember one case where some nodes could not recognise their GPU during MWGR2 of 2023 (run-363833).

@missirol
Copy link
Contributor

missirol commented Jun 23, 2023

but I'm not actually sure that the CPU pixel unpacker runs as part of that trigger

This statement is in square contradiction with the whole of #41715 . If the pixel legacy unpacker is never run there is no issue to talk about. Please clarify.

#41715 relates to offline studies where we compare the triggers results 'running on GPU' vs 'running on CPU'. The goal is to make sure the two reconstructions give the same results, since (in general) GPU is the default online, while CPU is the fallback online and is used in most offline use cases (e.g. MC).

Part of this validation also runs online as part of the HLT menu: there is a Path (HLT_PFJet40_GPUvsCPU_v) which fires when AlCa_PFJet40_v (GPU) and AlCa_PFJet40_CPUOnly_v (CPU) disagree, and there is the DQMGPUvsCPU stream for DQM comparisons involving Pixel, ECAL and HCAL. Online, the CPU pixel unpacker runs as part of AlCa_PFJet40_CPUOnly_v (which is needed by HLT_PFJet40_GPUvsCPU_v), but I'm not sure it runs as part of the DQMGPUvsCPU stream (pixel tracking on CPU does, but I'm not sure about pixel unpacking on CPU).

#41715 led to identifying a bugfix, and I thought PPD was in favour of backporting it (that's how I read #42010 (comment)). Whether or not this backport is critical for HLT, it can be debated. Maybe @silviodonato or @fwyzard have a different opinion.

@jordan-martins
Copy link
Contributor

Hi @mandrenguyen,

Yes, since it is a bug fix, we decided to get in asap for the start of ERA D.

We do indeed want to perform some validation, and we have asked TRK to propose some approaches to assist PdmV with the best way to propose a quick validation that we could rely on. We wanted to get this in now because we only foresee a rereco of the initial chunk of the DATA from ERA A to C.

Do you think this is too risky? Could you propose a way around this to help us move in the safest way possible?

Thanks,
PPD

FYI @cms-sw/ppd-l2 @cms-sw/pdmv-l2

@mandrenguyen
Copy link
Contributor

It would appear this is a long-standing bug in the unpacker, which has a very minor effect on the offline reconstruction, at least judging by the comparisons.
We can validate this bug-fix by the usual procedure, integrating it into the master release, and eventually backporting.
This procedure can be accelerated, but if we cannot meet the deadline for ERA D, I think I'm not properly understanding the negative consequences.
The prompt reco will be consistent across all eras with a minor bug (that is already present in all of our published papers?) that will be fixed for an eventual reprocessing.
The HLT will essentially be unaffected, since we are not using CPU at HLT.

I understand that there is some CPU-GPU validation that we would like to converge, but do we really want to risk breaking reco to fix validation? Especially since we heard this is only a partial fix, and there are at least more changes coming on the GPU side. The risk of directly implementing this bug-fix is admittedly small, but the consequences could be quite bad. I suppose we should be able to fix the validation using the fixed CPU code without actually deploying the fix in prompt reco for Era D.

@mmusich
Copy link
Contributor

mmusich commented Jun 23, 2023

The HLT will essentially be unaffected, since we are not using CPU at HLT.

The goal is to make sure the two reconstructions give the same results, since (in general) GPU is the default online, while CPU is the fallback online and is used in most offline use cases (e.g. MC).

So it validates a use case that never happens (in practice). Now that this is clarified I guess it also settles the urgency for a fix (and for all future fix requests of this type as well).

It would appear this is a long-standing bug in the unpacker, which has a very minor effect on the offline reconstruction, at least judging by the comparisons.

The effect is minor when the detector is well-behaved. I am under the impression the effect becomes more sizeable in presence of large rates of soft error recoveries.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 23, 2023

The bugfix should definitely be backported and included in the online reconstruction ASAP.

RECO conveners may have a cavalier approach towards the integrity of the data taking, but DAQ and TSG (should) consider the stability and correctness of the online reconstruction and data taking of paramount importance.

@mmusich
Copy link
Contributor

mmusich commented Jun 23, 2023

Mostly to clarify to myself

but I'm not sure it runs as part of the DQMGPUvsCPU stream (pixel tracking on CPU does, but I'm not sure about pixel unpacking on CPU).

In case one of the legacy unpacker event products would be persisted for being outputed in the event stram as we're planning in https://its.cern.ch/jira/browse/CMSHLT-2846 I would think it will start to be run as well.

@mandrenguyen
Copy link
Contributor

Does something prevent us from using an era or process modifier such that:

  • We can delay the deployment of the new pixel unpacker code offline until it's validated
  • GPUvsCPU validation can be conducted with the modified code
  • HLT can deploy the modified code in the unlikely event that it's executed

?

@perrotta
Copy link
Contributor

backport of #42010

@clacaputo
Copy link
Contributor

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_13_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_13_2_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

  • As it was agreed at the joint operation meeting of June 26: this may enter a passible new release on top of the already built 13_0_8

@cmsbuild cmsbuild merged commit 8d8e22c into cms-sw:CMSSW_13_0_X Jun 26, 2023
@fwyzard
Copy link
Contributor

fwyzard commented Jun 26, 2023

Does something prevent us from using an era or process modifier such that:

* We can delay the deployment of the new pixel unpacker code offline until it's validated

* GPUvsCPU validation can be conducted with the modified code

* HLT can deploy the modified code in the unlikely event that it's executed

?

@mandrenguyen that's an interesting approach, but looking at the changes, I do not think it can be done: an era or process modifier can affect only the python configuration, while this bug fix is a c++ change.

@missirol
Copy link
Contributor

Small clarification (already mentioned at the last ORP): the CPU unpacker also runs at HLT for the fraction of events used for GPU-vs-CPU comparisons. Edit : at HLT, the CPU pixel unpacker corresponds to the module hltSiPixelDigisLegacy [..]

For completeness, I have to correct myself again wrt #42033 (comment) (apologies).

As one can see from here, there are two instances of the CPU Pixel unpacker at HLT, hltSiPixelDigisLegacy and hltSiPixelDigisRegForDisplaced. The latter is used by a subset of EXO triggers that still haven't been ported to the heterogeneous pixel reconstruction (see, for example, the trigger HLT_HT430_DelayedJet40_SingleDelay0p5nsTrackless_v*, which uses the module hltSiPixelDigisRegForDisplaced).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.