Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encountered different devices in metric calculation #1361

Closed
1 task done
webcoderz opened this issue Jun 27, 2023 · 14 comments
Closed
1 task done

encountered different devices in metric calculation #1361

webcoderz opened this issue Jun 27, 2023 · 14 comments
Assignees
Labels
type:bug Something isn't working

Comments

@webcoderz
Copy link

Prerequisites

Describe the bug

when doing multi threaded gpu training sessions the metrics are not passing the tensors from cpu to gpu and am receiving this error:
Encountered different devices in metric calculation . This could be due to the metrics class not being on same device as the input. Instead of MeanAbsoluteError() use MeanAbsoluteError().to(device)

To Reproduce

Steps to reproduce the behavior:

  1. run multi threaded training session with gpu enabled in the trainer config

Expected behavior

For metrics to cleanly pass from cpu into GPU. the utils_metric.py only calls the error function to be used (L7), no sort of device tracking .

What actually happens

Describe what happens, and how often it happens.

Screenshots

If applicable, add screenshots and console printouts to help explain your problem.

Environement (please complete the following information):

  • Python environment 3.8.16
  • NeuralProphet version and install method [e.g. 2.7, installed from PYPI with pip install neuralprophet] 0.5.4

Additional context

Add any other context about the problem here.

@leoniewgnr
Copy link
Collaborator

Do you have some piece of minimal code for me? Then I'd be happy to fix the bug:)

@webcoderz
Copy link
Author

Basically where you bring metrics into the class you have to pass it to the device , so you have a dict in utils_metrics.py where you initialize the MeansAbsoluteError function and basically when you call the function like in forecast.py you would have to do something like MeansAbsoluteError().to(device) i have been trying to figure out the proper entrypoint for this but had to move on to other work. My guess is when you determine if your device is cpu or gpu , and metrics is enabled. Set it there

@leoniewgnr
Copy link
Collaborator

Thanks for your tip! It seems like we have found the error. We think the metrics are not properly configured inside the LightningModule. Because in the documentation, they say there is No need to call .to(device) anymore! (https://torchmetrics.readthedocs.io/en/latest/pages/lightning.html), but the metric should be defined inside a LightningModule and we are defining them outside in utils_metrics.py.
I'll try to fix that!
What was your code so that I can see if the same error occurs or not anymore?

@webcoderz
Copy link
Author

webcoderz commented Jun 30, 2023

So basically I am doing multi threaded model training runs on a batch dataframe with ThreadPoolExecutor() something like:

import concurrent.futures

def train_model(*args):
    
    try:

        m = NeuralProphet(**trainer_config, **best_params)  
        df_train, df_val = m.split_df(batch_df, freq='MS', valid_p=0.3)
        m.fit(df_train, freq='MS', validation_df=df_val)  # validation_df=df_val

    except Exception as e:
        logging.error(f"Error occurred while training model for {str(e)}")


with concurrent.futures.ThreadPoolExecutor() as executor:
    executor.map(lambda args: train_model(*args), args_list) 

@leoniewgnr
Copy link
Collaborator

Hi @webcoderz, I tried to fix the bug and I think it works now but I dont have a computer with a GPU. Would you like to try if this PR fixes the bug? Then I'm happy to merge and do a new release. Thank you!

@webcoderz
Copy link
Author

Yes will test it tommorow!

@webcoderz
Copy link
Author

Sorry got wrapped up at work getting a release pushed out will get on this first thing tomorrow!

@webcoderz
Copy link
Author

LGTM!

@leoniewgnr
Copy link
Collaborator

@webcoderz thats great! happy to hear that! I will go ahead, merge and make a new release

@webcoderz
Copy link
Author

Nice work! it went great😊 this is gonna save a bunch of time for these large gpu workloads I have

@leoniewgnr
Copy link
Collaborator

@webcoderz, so happy that I could help you. I actually also just started to train on a really huge dataset and really dont know where to start...
Would you mind if I send you some questions? Would really appreciate your help! Feel free to shot me a message (leoniew@stanford.edu) or drop your mail here. Thank you a lot!

@webcoderz
Copy link
Author

webcoderz commented Aug 11, 2023

Sure! cody.l.webb@gmail.com Feel free to reach out anytime!

@webcoderz
Copy link
Author

This is great, you will get some experience working with big data that will help you understand some of the things that need to be considered here!

@leoniewgnr
Copy link
Collaborator

fixed with #1365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants