Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better graceful shutdown for KeyboardInterrupt #19976

Merged
merged 16 commits into from
Jun 16, 2024
Merged

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Jun 14, 2024

What does this PR do?

When the user sends KeyboardInterrupt (Ctrl+C), Lightning runs shutdown logic. However, it does not guarantee that all processes get stopped. Users sometimes get hanging zombie processes. Furthermore, our logic does not play well when the user spams Ctrl+C repeatedly. This PR addresses both concerns.

Video Demo:
https://www.loom.com/share/a7e105baab5a493b89412434abb7c7fc?sid=1ab5b6b4-f9a1-46a1-888d-726336c5f360


📚 Documentation preview 📚: https://pytorch-lightning--19976.org.readthedocs.build/en/19976/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Jun 14, 2024
Copy link
Contributor

github-actions bot commented Jun 14, 2024

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow
Check ID Status
pl-cpu (macOS-11, lightning, 3.8, 2.0, oldest) success
pl-cpu (macOS-11, lightning, 3.10, 2.0) success
pl-cpu (macOS-11, lightning, 3.10, 2.1) success
pl-cpu (macOS-11, lightning, 3.10, 2.2) success
pl-cpu (macOS-14, lightning, 3.10, 2.3) success
pl-cpu (ubuntu-20.04, lightning, 3.8, 2.0, oldest) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.1) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.2) success
pl-cpu (ubuntu-20.04, lightning, 3.10, 2.3) success
pl-cpu (windows-2022, lightning, 3.8, 2.0, oldest) success
pl-cpu (windows-2022, lightning, 3.10, 2.0) success
pl-cpu (windows-2022, lightning, 3.10, 2.1) success
pl-cpu (windows-2022, lightning, 3.10, 2.2) success
pl-cpu (windows-2022, lightning, 3.10, 2.3) success
pl-cpu (macOS-11, pytorch, 3.8, 2.0) success
pl-cpu (ubuntu-20.04, pytorch, 3.8, 2.0) success
pl-cpu (windows-2022, pytorch, 3.8, 2.0) success
pl-cpu (macOS-12, pytorch, 3.11, 2.0) success
pl-cpu (macOS-12, pytorch, 3.11, 2.1) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.0) success
pl-cpu (ubuntu-22.04, pytorch, 3.11, 2.1) success
pl-cpu (windows-2022, pytorch, 3.11, 2.0) success
pl-cpu (windows-2022, pytorch, 3.11, 2.1) success

These checks are required after the changes to src/lightning/fabric/utilities/distributed.py, src/lightning/pytorch/strategies/launchers/multiprocessing.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/call.py, src/lightning/pytorch/trainer/connectors/signal_connector.py, tests/tests_pytorch/callbacks/progress/test_rich_progress_bar.py, tests/tests_pytorch/callbacks/test_lambda_function.py, tests/tests_pytorch/trainer/test_states.py, tests/tests_pytorch/trainer/test_trainer.py.

🟢 pytorch_lightning: Azure GPU
Check ID Status
pytorch-lightning (GPUs) (testing Lightning | latest) success
pytorch-lightning (GPUs) (testing PyTorch | latest) success

These checks are required after the changes to src/lightning/pytorch/strategies/launchers/multiprocessing.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/call.py, src/lightning/pytorch/trainer/connectors/signal_connector.py, tests/tests_pytorch/callbacks/progress/test_rich_progress_bar.py, tests/tests_pytorch/callbacks/test_lambda_function.py, tests/tests_pytorch/trainer/test_states.py, tests/tests_pytorch/trainer/test_trainer.py, src/lightning/fabric/utilities/distributed.py.

🟢 pytorch_lightning: Benchmarks
Check ID Status
lightning.Benchmarks success

These checks are required after the changes to src/lightning/fabric/utilities/distributed.py, src/lightning/pytorch/strategies/launchers/multiprocessing.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/call.py, src/lightning/pytorch/trainer/connectors/signal_connector.py.

🟢 fabric: Docs
Check ID Status
docs-make (fabric, doctest) success
docs-make (fabric, html) success

These checks are required after the changes to src/lightning/fabric/utilities/distributed.py.

🟢 pytorch_lightning: Docs
Check ID Status
docs-make (pytorch, doctest) success
docs-make (pytorch, html) success

These checks are required after the changes to src/lightning/pytorch/strategies/launchers/multiprocessing.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/call.py, src/lightning/pytorch/trainer/connectors/signal_connector.py.

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, lightning, 3.8, 2.0, oldest) success
fabric-cpu (macOS-11, lightning, 3.10, 2.0) success
fabric-cpu (macOS-11, lightning, 3.11, 2.1) success
fabric-cpu (macOS-11, lightning, 3.11, 2.2) success
fabric-cpu (macOS-14, lightning, 3.10, 2.3) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 2.0, oldest) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 2.0) success
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.1) success
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.2) success
fabric-cpu (ubuntu-20.04, lightning, 3.11, 2.3) success
fabric-cpu (windows-2022, lightning, 3.8, 2.0, oldest) success
fabric-cpu (windows-2022, lightning, 3.10, 2.0) success
fabric-cpu (windows-2022, lightning, 3.11, 2.1) success
fabric-cpu (windows-2022, lightning, 3.11, 2.2) success
fabric-cpu (windows-2022, lightning, 3.11, 2.3) success
fabric-cpu (macOS-11, fabric, 3.8, 2.0) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 2.0) success
fabric-cpu (windows-2022, fabric, 3.8, 2.0) success
fabric-cpu (macOS-12, fabric, 3.11, 2.0) success
fabric-cpu (macOS-12, fabric, 3.11, 2.1) success
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.0) success
fabric-cpu (ubuntu-22.04, fabric, 3.11, 2.1) success
fabric-cpu (windows-2022, fabric, 3.11, 2.0) success
fabric-cpu (windows-2022, fabric, 3.11, 2.1) success

These checks are required after the changes to src/lightning/fabric/utilities/distributed.py.

🟢 lightning_fabric: Azure GPU
Check ID Status
lightning-fabric (GPUs) (testing Fabric | latest) success
lightning-fabric (GPUs) (testing Lightning | latest) success

These checks are required after the changes to src/lightning/fabric/utilities/distributed.py.

🟢 mypy
Check ID Status
mypy success

These checks are required after the changes to src/lightning/fabric/utilities/distributed.py, src/lightning/pytorch/strategies/launchers/multiprocessing.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/call.py, src/lightning/pytorch/trainer/connectors/signal_connector.py.

🟢 install
Check ID Status
install-pkg (ubuntu-22.04, app, 3.8) success
install-pkg (ubuntu-22.04, app, 3.11) success
install-pkg (ubuntu-22.04, fabric, 3.8) success
install-pkg (ubuntu-22.04, fabric, 3.11) success
install-pkg (ubuntu-22.04, pytorch, 3.8) success
install-pkg (ubuntu-22.04, pytorch, 3.11) success
install-pkg (ubuntu-22.04, lightning, 3.8) success
install-pkg (ubuntu-22.04, lightning, 3.11) success
install-pkg (ubuntu-22.04, notset, 3.8) success
install-pkg (ubuntu-22.04, notset, 3.11) success
install-pkg (macOS-12, app, 3.8) success
install-pkg (macOS-12, app, 3.11) success
install-pkg (macOS-12, fabric, 3.8) success
install-pkg (macOS-12, fabric, 3.11) success
install-pkg (macOS-12, pytorch, 3.8) success
install-pkg (macOS-12, pytorch, 3.11) success
install-pkg (macOS-12, lightning, 3.8) success
install-pkg (macOS-12, lightning, 3.11) success
install-pkg (macOS-12, notset, 3.8) success
install-pkg (macOS-12, notset, 3.11) success
install-pkg (windows-2022, app, 3.8) success
install-pkg (windows-2022, app, 3.11) success
install-pkg (windows-2022, fabric, 3.8) success
install-pkg (windows-2022, fabric, 3.11) success
install-pkg (windows-2022, pytorch, 3.8) success
install-pkg (windows-2022, pytorch, 3.11) success
install-pkg (windows-2022, lightning, 3.8) success
install-pkg (windows-2022, lightning, 3.11) success
install-pkg (windows-2022, notset, 3.8) success
install-pkg (windows-2022, notset, 3.11) success

These checks are required after the changes to src/lightning/fabric/utilities/distributed.py, src/lightning/pytorch/strategies/launchers/multiprocessing.py, src/lightning/pytorch/strategies/launchers/subprocess_script.py, src/lightning/pytorch/trainer/call.py, src/lightning/pytorch/trainer/connectors/signal_connector.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@awaelchli awaelchli added this to the 2.4 milestone Jun 14, 2024
@github-actions github-actions bot added the fabric lightning.fabric.Fabric label Jun 14, 2024
Copy link

codecov bot commented Jun 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 59%. Comparing base (bb511b0) to head (5c87355).
Report is 87 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (bb511b0) and HEAD (5c87355). Click for more details.

HEAD has 401 uploads less than BASE
Flag BASE (bb511b0) HEAD (5c87355)
python3.10 38 16
cpu 146 48
lightning 88 32
pytest 103 28
python3.8 29 12
lightning_fabric 28 10
python3.11 53 20
examples 9 0
app 9 0
tpu 1 0
pytorch2.0 24 12
pytest-full 48 24
pytorch_lightning 19 10
pytorch2.2 6 3
pytorch2.1 12 6
pytorch2.3 6 3
lightning_app 6 0
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #19976     +/-   ##
=========================================
- Coverage      84%      59%    -25%     
=========================================
  Files         426      421      -5     
  Lines       35284    35195     -89     
=========================================
- Hits        29620    20785   -8835     
- Misses       5664    14410   +8746     

Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic

@awaelchli awaelchli changed the title WIP: Better graceful shutdown for KeyboardInterrupt Better graceful shutdown for KeyboardInterrupt Jun 16, 2024
@awaelchli awaelchli merged commit c1af4d0 into master Jun 16, 2024
118 checks passed
@awaelchli awaelchli deleted the feature/graceful-exit branch June 16, 2024 14:43
@quentinblampey
Copy link

Hello @awaelchli,

I recently upgraded Lightning and discovered this change. Before, we were able to stop the training while not stopping the rest of the program. I thought it was nice to be able to stop the training in a notebook and then continue experiments. Of course, not useful for training a "real" model, but it's pretty nice when exploring and testing new things.

Is there another reason for such a change (apart from better stopping the processes)? Or maybe I'm missing something?

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants