-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Hydra+DDP support #11617
Improving Hydra+DDP support #11617
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really neat! Mind moving the hydra-specific logic to its own utility file?
@tchaton I'm working through some issues with |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
tests/tests_pytorch/strategies/launchers/test_subprocess_script.py
Outdated
Show resolved
Hide resolved
Finally merged! Thank you so much for your time @jgbos |
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: rohitgr7 <rohitgr1998@gmail.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz>
@jgbos @awaelchli @carmocca @rohitgr7
These issues was not happended with hydra-submitit-launcher (with slurm). |
@Liangtaiwan Not sure what is going on there. Could you report the exact issue, steps to reproduce, and environment you run in a GitHub issue? Perhaps @jgbos will be able to help too? |
@Liangtaiwan sorry, I would need to know more about the error. Definitely open an issue and I will try to help. Would be good to know Hydra version and how you executed the script. Also, this error should not occur if launching on slurm as each gpu task is launched via the submission script (i.e., submitit launcher) instead of this script. |
This reverts commit 45ca781.
This reverts commit 45ca781.
This reverts commit 45ca781.
What does this PR do?
There are documented issues for Hydra+DDP in #11300 and #2727. This PR attempts to fix these issues by redefining how a process is spawned when Hydra is used.
Fixes #11300
Problem
Current approach is defined here: https://github.com/PyTorchLightning/pytorch-lightning/blob/fe34bf2a653ebd50e6a3a00be829e3611f820c3c/pytorch_lightning/strategies/ddp.py#L233
This PR addresses the issue of running with Hydra
multirun
. For example, lets say we have the following Hydra app:We can execute a
multirun
job with this app using the following command:This command will attempt to launch 2 jobs sequentially: one with
foo=1
and one withfoo=2
. For the first job,foo=1
, PL launchers a normal job that begins execution while the second job is spawned with a subprocess using the following command derived fromsys.argv
:This will spawn a new
mutlirun
job instead of running a normal job withfoo=2
. The command should beSolution
Every Hydra process will save the reproducible configuration of the job in a
config.yaml
file located in thehydra.output_subdir
experiment directory. Using Hydra's CLI, we can execute the app with same configuration as the current experiment by defining--config_path, -cp
and--config_name, -cn
:Here
config.yaml
contains the value forfoo
appropriate for the currentmultirun
job. I've outlined the support formultirun
, but this should support any Hydra application launched from the command line.Lingering Issue Not Solved
In order to run
multirun
on a local machine the user must add additional code to their task function before launching the next job. This code, shown below, will destroy all distributed processes and remove PL related environment variables. Without this code themultirun
job will hang after the first job.Thoughts
This solution should help most people, but IMHO the current method of spawning using
sys.argv
is not robust. It would be nice to be able to execute a PL application similar to how I implemented the Hydra solution. This would require dynamically creating a configurations for the application. The use ofsave_hyperparameters
is nice but is only done inside the creation of aLightningModule
, it would be nice to create a description of the model at a higher level — outside of the module. By defining the description outside of the module, one could use a similar approach as the Hydra solution above for PL applications. I would recommend taking a look at hydra-zen (FYI, I'm a contributor).Does your PR introduce any breaking changes? If yes, please list them.
None
Before submitting
Personal TODO:
PL TODO: