Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add special logic for 'step' in _optimizer_to_device #20019

Merged
merged 10 commits into from
Aug 5, 2024

Conversation

corwinjoy
Copy link
Contributor

@corwinjoy corwinjoy commented Jun 27, 2024

Fix performance degradation when restoring optimizer from checkpoint.
This fix is to address the issue discussed in #19955

Fixes #19955

This fix is also due to the related isssue in PyTorch:
pytorch/pytorch#74424

This issue could also use a test to check for continued performance, but I'm not sure how to do it.
On a dedicated GPU the transfer time is negligible, this really becomes an issue when the GPU is shared or has more of a transfer bottleneck.


📚 Documentation preview 📚: https://pytorch-lightning--20019.org.readthedocs.build/en/20019/

@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Jun 27, 2024
@corwinjoy
Copy link
Contributor Author

corwinjoy commented Jun 27, 2024

Here is the performance information when using the test code from issue #19955 and continuing from a checkpoint. With the old code many memory synchronizations are forced, with the update to keep 'step' as-is, this issue is removed:

nsys profile --stats=true /home/cjoy/src/adam_gpu/.venv/bin/python /home/cjoy/src/adam_gpu/src/test.py


Original _optimizer_to_device function:
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)    Med (ns)  Min (ns)   Max (ns)   StdDev (ns)            Operation          
 --------  ---------------  -----  -----------  --------  --------  ----------  ------------  ----------------------------
     60.4      129,388,373  4,094     31,604.4   1,344.0     1,024  16,394,576     672,557.9  [CUDA memcpy Device-to-Host]
     38.7       82,982,124     44  1,885,957.4     608.0       415  67,438,426  10,172,712.8  [CUDA memcpy Host-to-Device]
      0.9        1,971,518  2,000        985.8     992.0       416       2,368         166.8  [CUDA memset]            
      

with special handling for 'step as per this PR':
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)    Med (ns)  Min (ns)   Max (ns)   StdDev (ns)            Operation          
 --------  ---------------  -----  -----------  --------  --------  ----------  ------------  ----------------------------
     59.3      122,887,554     74  1,660,642.6   1,424.0     1,024  16,134,055   4,710,020.5  [CUDA memcpy Device-to-Host]
     39.8       82,420,918     34  2,424,144.6     799.5       415  67,068,637  11,489,504.0  [CUDA memcpy Host-to-Device]
      0.9        1,940,579  2,000        970.3     991.0       415       5,727         197.9  [CUDA memset] 
       

@awaelchli awaelchli added bug Something isn't working performance optimization labels Aug 3, 2024
@awaelchli awaelchli added this to the 2.4 milestone Aug 3, 2024
Copy link

codecov bot commented Aug 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89%. Comparing base (345450b) to head (8bacec9).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #20019   +/-   ##
=======================================
  Coverage      89%      89%           
=======================================
  Files         267      267           
  Lines       23071    23076    +5     
=======================================
+ Hits        20572    20577    +5     
  Misses       2499     2499           

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 3, 2024
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corwinjoy Thanks for the PR. I added missing tests to cover the change. There was a previous test that allows dataclasses as part of the state, so I had to keep this functionality. So I think the PR is ready to land.

@awaelchli awaelchli added the community This PR is from the community label Aug 3, 2024
@mergify mergify bot added the has conflicts label Aug 5, 2024
@mergify mergify bot removed the has conflicts label Aug 5, 2024
@awaelchli awaelchli merged commit 631911c into Lightning-AI:master Aug 5, 2024
98 checks passed
ammyk9 pushed a commit to ammyk9/pytorch-lightning that referenced this pull request Aug 6, 2024
)

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community This PR is from the community fabric lightning.fabric.Fabric optimization performance pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adam optimizer is slower after loading model from checkpoint
2 participants