Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Pytorch to share same memory pool as RMM via cli #1392

Merged

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Oct 8, 2024

This PR closes: #1281

Usage example:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster(rmm_allocator_external_lib_list=["torch", "cupy"])
client = Client(cluster)

Verify working

def get_torch_allocator():
    import torch
    return torch.cuda.get_allocator_backend()
    
client.run(get_torch_allocator)
client.run(get_torch_allocator)
{'tcp://127.0.0.1:37167': 'pluggable',
 'tcp://127.0.0.1:38749': 'pluggable',
 'tcp://127.0.0.1:43109': 'pluggable',
 'tcp://127.0.0.1:44259': 'pluggable',
 'tcp://127.0.0.1:44953': 'pluggable',
 'tcp://127.0.0.1:45087': 'pluggable',
 'tcp://127.0.0.1:45623': 'pluggable',
 'tcp://127.0.0.1:45847': 'pluggable'}

Without it its native.

Context: This helps NeMo-Curator to have a more stable use of Pytorch+dask-cuda

CC: @pentschev .

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
@github-actions github-actions bot added the python python code needed label Oct 8, 2024
@VibhuJawa VibhuJawa marked this pull request as ready for review October 8, 2024 00:37
@VibhuJawa VibhuJawa requested a review from a team as a code owner October 8, 2024 00:37
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
dask_cuda/cli.py Outdated Show resolved Hide resolved
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @VibhuJawa , left some minor requests on organization but we should be good to go afterwards.

dask_cuda/cli.py Outdated Show resolved Hide resolved
dask_cuda/local_cuda_cluster.py Outdated Show resolved Hide resolved
Comment on lines 273 to 275

if isinstance(rmm_allocator_external_lib_list, str):
rmm_allocator_external_lib_list = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(rmm_allocator_external_lib_list, str):
rmm_allocator_external_lib_list = []

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i added a type check here, the reason is that i trust lazy users like me to pass in the same config that they do in cli , example this is how cli looks right now.

dask-cuda-worker "tcp://10.33.227.161:8786" --set-rmm-allocator-for-libs "torch"

With the updated behavior i complain loudly (see example below):

cluster = LocalCUDACluster(rmm_allocator_external_lib_list="torch")
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 cluster = LocalCUDACluster(rmm_allocator_external_lib_list="torch")

File ~/dask-cuda/dask_cuda/local_cuda_cluster.py:275, in LocalCUDACluster.__init__(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, enable_cudf_spill, cudf_spill_stats, data, local_directory, shared_filesystem, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, rmm_pool_size, rmm_maximum_pool_size, rmm_managed_memory, rmm_async, rmm_allocator_external_lib_list, rmm_release_threshold, rmm_log_directory, rmm_track_allocations, jit_unspill, log_spilling, worker_class, pre_import, **kwargs)
    272     raise ValueError("Number of workers cannot be less than 1.")
    274 if rmm_allocator_external_lib_list is not None and not isinstance(rmm_allocator_external_lib_list, list):
--> 275     raise ValueError(
    276         "rmm_allocator_external_lib_list must be a list of strings. "
    277         "Valid examples: ['torch'], ['cupy'], or ['torch', 'cupy']. "
    278         f"Received: {type(rmm_allocator_external_lib_list)} "
    279         f"with value: {rmm_allocator_external_lib_list}"
    280     )
    282 # Set nthreads=1 when parsing mem_limit since it only depends on n_workers
    283 logger = logging.getLogger(__name__)

ValueError: rmm_allocator_external_lib_list must be a list of strings. Valid examples: ['torch'], ['cupy'], or ['torch', 'cupy']. Received: <class 'str'> with value: torch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but in that case I think the amount of work/code to support a string is relatively similar, instead of raising the exception should we just support passing a comma-separate string list as well then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it here: 2517874.

Let me know if you want me to change anything, thanks for the suggestion, i agree it made sense.

dask_cuda/utils.py Outdated Show resolved Hide resolved
dask_cuda/utils.py Outdated Show resolved Hide resolved
dask_cuda/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @VibhuJawa !

@pentschev
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 8d88006 into rapidsai:branch-24.12 Oct 12, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add cli option to enable pytorch to use same memory pool as rapids.
2 participants