Add debugging capabilities to the CachingAllocator #45341

fwyzard · 2024-06-28T14:41:13Z

PR description:

Extend the alpaka CachingAllocator to optionally fill with a configurable value all memory blocks that are: allocated, cached for re-use, re-used, or deallocated.

Extend the AlpakaService to configure the host and device CachingAllocators.

Add a simple test to load the AlpakaService.

To fill the NVIDIA GPU memory before every allocation or reuse with 0xA5, you can now use

process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocations = True

To fill the NVIDIA GPU memory before every deallocation or caching with 0x5A, you can now use

process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocations = True

To use different values and combination for allocations, deallocation, caching, and reuse, the full options are

process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocationValue = 0xA5,
process.AlpakaServiceCudaAsync.deviceAllocator.fillReallocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillReallocationValue = 0x69,
process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocationValue = 0x5A,
process.AlpakaServiceCudaAsync.deviceAllocator.fillCaches = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillCacheValue = 0x96

To do the same for the pinned host memory used in the GPU transfers, process.AlpakaServiceCudaAsync.hostAllocator accepts the same options.

To do the same for AMD GPUs, replace AlpakaServiceCudaAsync with AlpakaServiceROCmAsync.

To do the same for the CPU memory used by the alpaka modules running on the host, replace AlpakaServiceCudaAsync with AlpakaServiceSerialSync.

PR validation:

The new unit tests pass.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 14.0.x for data taking.

Extend the CachingAllocator to optionally fill with a configurable value all memory blocks that are: allocated, cached for re-use, re-used, or deallocated. Extend the AlpakaService to configure the host and device CachingAllocators.

fwyzard · 2024-06-28T14:41:25Z

enable gpu

fwyzard · 2024-06-28T14:41:28Z

please test

cmsbuild · 2024-06-28T14:41:33Z

cms-bot internal usage

cmsbuild · 2024-06-28T14:49:13Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45341/40754

This PR adds an extra 28KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h modified in PR(s): Aggiornamento efficienze singola traccia #45326
- File HeterogeneousCore/AlpakaInterface/interface/getDeviceCachingAllocator.h modified in PR(s): Aggiornamento efficienze singola traccia #45326

cmsbuild · 2024-06-28T14:49:28Z

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

HeterogeneousCore/AlpakaInterface (heterogeneous)
HeterogeneousCore/AlpakaServices (heterogeneous)

@fwyzard, @makortel can you please review it and eventually sign? Thanks.
@makortel, @missirol, @rovere this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

fwyzard · 2024-06-28T14:50:20Z

+heterogeneous

Self-signed because @makortel is still away. I'm happy to address any comments and accept any suggestions to improve the system when he comes back.

cmsbuild · 2024-06-28T14:50:46Z

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

cmsbuild · 2024-06-28T19:41:39Z

-1

Failed Tests: HeaderConsistency
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-753c9a/40148/summary.html
COMMIT: 660603a
CMSSW: CMSSW_14_1_X_2024-06-28-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45341/40148/install.sh to create a dev area with all the needed externals and cmssw changes.

DAS Queries: The DAS query tests failed, see the summary page for details.

Comparison Summary

Summary:

You potentially removed 1 lines from the logs
Reco comparison results: 2660 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3345088
DQMHistoTests: Total failures: 1046
DQMHistoTests: Total nulls: 4
DQMHistoTests: Total successes: 3344018
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: found differences in 1 / 46 workflows

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 3
DQMHistoTests: Total histograms compared: 39744
DQMHistoTests: Total failures: 18
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 39726
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
Checked 8 log files, 10 edm output root files, 3 DQM output files
TriggerResults: no differences found

mmusich · 2024-07-01T09:35:55Z

ignore tests-rejected with external-failure

rappoccio · 2024-07-01T15:38:44Z

+1

rappoccio · 2024-07-01T15:38:48Z

merge

fwyzard added 2 commits June 28, 2024 16:34

Add debugging capabilities to the CachingAllocator

1577ce4

Extend the CachingAllocator to optionally fill with a configurable value all memory blocks that are: allocated, cached for re-use, re-used, or deallocated. Extend the AlpakaService to configure the host and device CachingAllocators.

Add a simple test to load the AlpakaService

660603a

cmsbuild added this to the CMSSW_14_1_X milestone Jun 28, 2024

cmsbuild added pending-signatures orp-pending tests-started code-checks-pending heterogeneous-pending labels Jun 28, 2024

This was referenced Jun 28, 2024

Add debugging capabilities to the CachingAllocator [14.0.x] #45342

Merged

HLT crashes in Run 380399 #44923

Closed

cmsbuild added code-checks-approved and removed code-checks-pending labels Jun 28, 2024

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels Jun 28, 2024

cmsbuild added tests-rejected and removed tests-started labels Jun 28, 2024

cmsbuild added tests-approved tests-external-failure and removed tests-rejected labels Jul 1, 2024

cmsbuild added orp-approved and removed orp-pending labels Jul 1, 2024

cmsbuild merged commit 919a242 into cms-sw:master Jul 1, 2024
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add debugging capabilities to the CachingAllocator #45341

Add debugging capabilities to the CachingAllocator #45341

fwyzard commented Jun 28, 2024 •

edited

Loading

fwyzard commented Jun 28, 2024

fwyzard commented Jun 28, 2024

cmsbuild commented Jun 28, 2024 •

edited

Loading

cmsbuild commented Jun 28, 2024

cmsbuild commented Jun 28, 2024

fwyzard commented Jun 28, 2024

cmsbuild commented Jun 28, 2024

cmsbuild commented Jun 28, 2024

mmusich commented Jul 1, 2024

rappoccio commented Jul 1, 2024

rappoccio commented Jul 1, 2024

Add debugging capabilities to the CachingAllocator #45341

Add debugging capabilities to the CachingAllocator #45341

Conversation

fwyzard commented Jun 28, 2024 • edited Loading

PR description:

PR validation:

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

fwyzard commented Jun 28, 2024

fwyzard commented Jun 28, 2024

cmsbuild commented Jun 28, 2024 • edited Loading

cmsbuild commented Jun 28, 2024

cmsbuild commented Jun 28, 2024

fwyzard commented Jun 28, 2024

cmsbuild commented Jun 28, 2024

cmsbuild commented Jun 28, 2024

Comparison Summary

GPU Comparison Summary

mmusich commented Jul 1, 2024

rappoccio commented Jul 1, 2024

rappoccio commented Jul 1, 2024

fwyzard commented Jun 28, 2024 •

edited

Loading

cmsbuild commented Jun 28, 2024 •

edited

Loading