Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] PyTorch and RMM sharing memory pool #501

Closed
brhodes10 opened this issue Aug 17, 2020 · 6 comments
Closed

[FEA] PyTorch and RMM sharing memory pool #501

brhodes10 opened this issue Aug 17, 2020 · 6 comments
Labels
? - Needs Triage Need team to review and classify feature request New feature or request inactive-30d inactive-90d

Comments

@brhodes10
Copy link

Is your feature request related to a problem? Please describe.
Currently I'm running a streamz workflow that uses pytorch. I notice that I continue to encounter errors like below where pytorch is not able to allocate enough memory.

RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 0; 31.72 GiB total capacity; 29.02 GiB already allocated; 244.88 MiB free; 29.80 GiB reserved in total by PyTorch)

I'm wondering if pytorch and rmm are competing for memory and if so if there's a recommended way to manage
Describe the solution you'd like
If possible, for pytorch and rmm to potentially use the same memory pool. Or a recommended method to resolve this type of memory issue

Describe alternatives you've considered
None

Additional context
The streamz workflow end-to-end can be found here. In short summary, it first initializes a streamz worklfow that uses dask to read in data from kafka. Then processes that data using cyBERT inferencing which can be found here. cyBERT uses cudf for data pre-processing steps and a BERT model for inferencing. Then the processed data is published back to kafka.

@brhodes10 brhodes10 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Aug 17, 2020
@VibhuJawa
Copy link
Member

There was some internal discussion about a related issue that plagued 27 HF implimentation and it was suggested that a path forward can be:

  • We create a PyTorch memory resource for RMM to allow RMM to use the same memory pool as PyTorch.

@jakirkham
Copy link
Member

Another idea that came up was using RMM within PyTorch possibly using an external memory allocator (as was done with CuPy and Numba) or possibly even direct usage (as has recently been done with XGBoost). Have filed this as issue ( pytorch/pytorch#43144 ).

@jakirkham
Copy link
Member

We create a PyTorch memory resource for RMM to allow RMM to use the same memory pool as PyTorch.

On this usage pattern it's worth looking at how CuPy did something similar.

xref: pytorch/pytorch#33860
xref: cupy/cupy#3126

@github-actions
Copy link

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@VibhuJawa
Copy link
Member

This was closed by: #1168 , Can we close this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request inactive-30d inactive-90d
Projects
Status: Done
Development

No branches or pull requests

3 participants