Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with CodeCarbon lock #265

Merged
merged 3 commits into from
Sep 20, 2024
Merged

Conversation

regisss
Copy link
Collaborator

@regisss regisss commented Sep 18, 2024

From CodeCarbon v2.7, a lock file is introduced (see here) to check whether another instance of codecarbon is running. If yes, an error is raised. One has to call my_energy_tracker.stop() to release the lock file.

This PR adds a stop method in the EnergyTracker class that should be called once the energy tracker is not needed anymore.
It also updates an import from huggingface_hub since v0.25 introduced a change (the new import is backward-compatible).

Fixes #260.

@regisss
Copy link
Collaborator Author

regisss commented Sep 18, 2024

@IlyasMoutawwakil We need to merge this fix in Optimum to make the CLI CUDA Torch-ORT Multi- and Single-GPU tests pass: huggingface/optimum#2028

@regisss
Copy link
Collaborator Author

regisss commented Sep 18, 2024

For the CLI ROCm Pytorch Multi- and Single-GPU tests , the error is:

ERROR! Intel® Extension for PyTorch* needs to work with PyTorch 2.4.*, but PyTorch 2.2.0.dev20231010+rocm5.7 is found. Please switch to the matching version and run again.

Not sure why the Intel extension for PyTorch is installed for this test as I don't think we need it. Should we explicitly uninstall it in the workflow @IlyasMoutawwakil ?
Not sure if this is the actual reason for the failed tests though 🤔

@IlyasMoutawwakil
Copy link
Member

the failing tests are from the ongoing changes in #263

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Sep 19, 2024

Acquiring the lock and releasing it between processes will probably not work (or cause issues) in multi-gpu/process setting (torchrun). I can see one process waiting for the other to release the lock to start measuring and causing async issues. We actually don't test that (codecarbon+multi-gpu). I think a better thing to do here is to set allow_multiple_runs to True.
https://github.com/mlco2/codecarbon/blob/v2.7.0/codecarbon/emissions_tracker.py#L239-L255

@regisss
Copy link
Collaborator Author

regisss commented Sep 19, 2024

Acquiring the lock and releasing it between processes will probably not work (or cause issues) in multi-gpu/process setting (torchrun). I can see one process waiting for the other to release the lock to start measuring and causing async issues. We actually don't test that (codecarbon+multi-gpu). I think a better thing to do here is to set allow_multiple_runs to True. https://github.com/mlco2/codecarbon/blob/v2.7.0/codecarbon/emissions_tracker.py#L239-L255

True for multi-process this is needed. I wouldn't set allow_multiple_runs to True for single-process benchmark though, we don't really know how several trackers can interfere with each other.

@regisss
Copy link
Collaborator Author

regisss commented Sep 19, 2024

the failing tests are from the ongoing changes in #263

Let's wait for #263 to be merged then

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Sep 19, 2024

we don't really know how several trackers can interfere with each other

Tbh it would be much better if we have one tracker in distributed settings

@IlyasMoutawwakil IlyasMoutawwakil marked this pull request as ready for review September 20, 2024 09:39
@IlyasMoutawwakil IlyasMoutawwakil merged commit 1992de3 into main Sep 20, 2024
49 of 56 checks passed
@regisss regisss deleted the fix_codecarbon_lock_issue branch September 20, 2024 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error: Another instance of codecarbon is already running
2 participants