encountered different devices in metric calculation #1361

webcoderz · 2023-06-27T13:07:17Z

Prerequisites

Put an X between the brackets on this line if you have done all of the following:
- Reproduced the problem in a new virtualenv with only neuralprophet installed, directly from github:
```
git clone <copied link from github>
cd neural_prophet
pip install .
```
- Checked the Answered Questions on the Github Disscussion board: https://github.com/ourownstory/neural_prophet/discussions
  If you have the same question but the Answer does not solve your issue, please continue the conversation there.
- Checked that your issue isn't already filed: https://github.com/ourownstory/neural_prophet/issues
  If you have the same issue but there is a twist to your situation, please add an explanation there.
- Considered whether your bug might actually be solveable by getting a question answered:
  - Please post a package use question
  - Please post a forecasting best practice question
  - Please post an idea or feedback

Describe the bug

when doing multi threaded gpu training sessions the metrics are not passing the tensors from cpu to gpu and am receiving this error:
Encountered different devices in metric calculation . This could be due to the metrics class not being on same device as the input. Instead of MeanAbsoluteError() use MeanAbsoluteError().to(device)

To Reproduce

Steps to reproduce the behavior:

run multi threaded training session with gpu enabled in the trainer config

Expected behavior

For metrics to cleanly pass from cpu into GPU. the utils_metric.py only calls the error function to be used (L7), no sort of device tracking .

What actually happens

Describe what happens, and how often it happens.

Screenshots

If applicable, add screenshots and console printouts to help explain your problem.

Environement (please complete the following information):

Python environment 3.8.16
NeuralProphet version and install method [e.g. 2.7, installed from PYPI with pip install neuralprophet] 0.5.4

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

leoniewgnr · 2023-06-29T23:21:15Z

Do you have some piece of minimal code for me? Then I'd be happy to fix the bug:)

webcoderz · 2023-06-29T23:29:59Z

Basically where you bring metrics into the class you have to pass it to the device , so you have a dict in utils_metrics.py where you initialize the MeansAbsoluteError function and basically when you call the function like in forecast.py you would have to do something like MeansAbsoluteError().to(device) i have been trying to figure out the proper entrypoint for this but had to move on to other work. My guess is when you determine if your device is cpu or gpu , and metrics is enabled. Set it there

leoniewgnr · 2023-06-30T21:37:41Z

Thanks for your tip! It seems like we have found the error. We think the metrics are not properly configured inside the LightningModule. Because in the documentation, they say there is No need to call .to(device) anymore! (https://torchmetrics.readthedocs.io/en/latest/pages/lightning.html), but the metric should be defined inside a LightningModule and we are defining them outside in utils_metrics.py.
I'll try to fix that!
What was your code so that I can see if the same error occurs or not anymore?

webcoderz · 2023-06-30T22:15:24Z

So basically I am doing multi threaded model training runs on a batch dataframe with ThreadPoolExecutor() something like:

import concurrent.futures

def train_model(*args):
    
    try:

        m = NeuralProphet(**trainer_config, **best_params)  
        df_train, df_val = m.split_df(batch_df, freq='MS', valid_p=0.3)
        m.fit(df_train, freq='MS', validation_df=df_val)  # validation_df=df_val

    except Exception as e:
        logging.error(f"Error occurred while training model for {str(e)}")


with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(lambda args: train_model(*args), args_list)

leoniewgnr · 2023-07-26T18:32:25Z

Hi @webcoderz, I tried to fix the bug and I think it works now but I dont have a computer with a GPU. Would you like to try if this PR fixes the bug? Then I'm happy to merge and do a new release. Thank you!

webcoderz · 2023-07-26T19:01:25Z

Yes will test it tommorow!

webcoderz · 2023-08-02T21:12:38Z

Sorry got wrapped up at work getting a release pushed out will get on this first thing tomorrow!

webcoderz · 2023-08-03T15:21:47Z

LGTM!

leoniewgnr · 2023-08-03T23:11:43Z

@webcoderz thats great! happy to hear that! I will go ahead, merge and make a new release

webcoderz · 2023-08-03T23:46:40Z

Nice work! it went great😊 this is gonna save a bunch of time for these large gpu workloads I have

leoniewgnr · 2023-08-11T00:27:20Z

@webcoderz, so happy that I could help you. I actually also just started to train on a really huge dataset and really dont know where to start...
Would you mind if I send you some questions? Would really appreciate your help! Feel free to shot me a message (leoniew@stanford.edu) or drop your mail here. Thank you a lot!

webcoderz · 2023-08-11T03:00:34Z

Sure! cody.l.webb@gmail.com Feel free to reach out anytime!

webcoderz · 2023-08-11T03:02:13Z

This is great, you will get some experience working with big data that will help you understand some of the things that need to be considered here!

leoniewgnr · 2023-08-28T16:55:21Z

fixed with #1365

leoniewgnr self-assigned this Jul 4, 2023

leoniewgnr added the type:bug Something isn't working label Jul 4, 2023

leoniewgnr mentioned this issue Jul 4, 2023

[bug] Fix correct definition of torchmetrics inside pytorch lightning module #1365

Merged

leoniewgnr closed this as completed Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encountered different devices in metric calculation #1361

encountered different devices in metric calculation #1361

webcoderz commented Jun 27, 2023

leoniewgnr commented Jun 29, 2023

webcoderz commented Jun 29, 2023

leoniewgnr commented Jun 30, 2023

webcoderz commented Jun 30, 2023 •

edited

Loading

leoniewgnr commented Jul 26, 2023

webcoderz commented Jul 26, 2023

webcoderz commented Aug 2, 2023

webcoderz commented Aug 3, 2023

leoniewgnr commented Aug 3, 2023

webcoderz commented Aug 3, 2023

leoniewgnr commented Aug 11, 2023

webcoderz commented Aug 11, 2023 •

edited

Loading

webcoderz commented Aug 11, 2023

leoniewgnr commented Aug 28, 2023

encountered different devices in metric calculation #1361

encountered different devices in metric calculation #1361

Comments

webcoderz commented Jun 27, 2023

leoniewgnr commented Jun 29, 2023

webcoderz commented Jun 29, 2023

leoniewgnr commented Jun 30, 2023

webcoderz commented Jun 30, 2023 • edited Loading

leoniewgnr commented Jul 26, 2023

webcoderz commented Jul 26, 2023

webcoderz commented Aug 2, 2023

webcoderz commented Aug 3, 2023

leoniewgnr commented Aug 3, 2023

webcoderz commented Aug 3, 2023

leoniewgnr commented Aug 11, 2023

webcoderz commented Aug 11, 2023 • edited Loading

webcoderz commented Aug 11, 2023

leoniewgnr commented Aug 28, 2023

webcoderz commented Jun 30, 2023 •

edited

Loading

webcoderz commented Aug 11, 2023 •

edited

Loading