Enable Pytorch to share same memory pool as RMM via cli #1392

VibhuJawa · 2024-10-08T00:34:38Z

This PR closes: #1281

Usage example:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster(rmm_allocator_external_lib_list=["torch", "cupy"])
client = Client(cluster)

Verify working

def get_torch_allocator():
    import torch
    return torch.cuda.get_allocator_backend()
    
client.run(get_torch_allocator)

client.run(get_torch_allocator)

{'tcp://127.0.0.1:37167': 'pluggable',
 'tcp://127.0.0.1:38749': 'pluggable',
 'tcp://127.0.0.1:43109': 'pluggable',
 'tcp://127.0.0.1:44259': 'pluggable',
 'tcp://127.0.0.1:44953': 'pluggable',
 'tcp://127.0.0.1:45087': 'pluggable',
 'tcp://127.0.0.1:45623': 'pluggable',
 'tcp://127.0.0.1:45847': 'pluggable'}

Without it its native.

Context: This helps NeMo-Curator to have a more stable use of Pytorch+dask-cuda

CC: @pentschev .

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

dask_cuda/cli.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

pentschev

Looking good @VibhuJawa , left some minor requests on organization but we should be good to go afterwards.

dask_cuda/cli.py

dask_cuda/local_cuda_cluster.py

pentschev · 2024-10-09T08:51:46Z

dask_cuda/local_cuda_cluster.py

+
+        if isinstance(rmm_allocator_external_lib_list, str):
+            rmm_allocator_external_lib_list = []


Suggested change

if isinstance(rmm_allocator_external_lib_list, str):

rmm_allocator_external_lib_list = []

So i added a type check here, the reason is that i trust lazy users like me to pass in the same config that they do in cli , example this is how cli looks right now.

dask-cuda-worker "tcp://10.33.227.161:8786" --set-rmm-allocator-for-libs "torch"

With the updated behavior i complain loudly (see example below):

cluster = LocalCUDACluster(rmm_allocator_external_lib_list="torch")

ValueError Traceback (most recent call last) Cell In[2], line 1 ----> 1 cluster = LocalCUDACluster(rmm_allocator_external_lib_list="torch") File ~/dask-cuda/dask_cuda/local_cuda_cluster.py:275, in LocalCUDACluster.__init__(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, enable_cudf_spill, cudf_spill_stats, data, local_directory, shared_filesystem, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, rmm_pool_size, rmm_maximum_pool_size, rmm_managed_memory, rmm_async, rmm_allocator_external_lib_list, rmm_release_threshold, rmm_log_directory, rmm_track_allocations, jit_unspill, log_spilling, worker_class, pre_import, **kwargs) 272 raise ValueError("Number of workers cannot be less than 1.") 274 if rmm_allocator_external_lib_list is not None and not isinstance(rmm_allocator_external_lib_list, list): --> 275 raise ValueError( 276 "rmm_allocator_external_lib_list must be a list of strings. " 277 "Valid examples: ['torch'], ['cupy'], or ['torch', 'cupy']. " 278 f"Received: {type(rmm_allocator_external_lib_list)} " 279 f"with value: {rmm_allocator_external_lib_list}" 280 ) 282 # Set nthreads=1 when parsing mem_limit since it only depends on n_workers 283 logger = logging.getLogger(__name__) ValueError: rmm_allocator_external_lib_list must be a list of strings. Valid examples: ['torch'], ['cupy'], or ['torch', 'cupy']. Received: <class 'str'> with value: torch

Makes sense, but in that case I think the amount of work/code to support a string is relatively similar, instead of raising the exception should we just support passing a comma-separate string list as well then?

Added it here: 2517874.

Let me know if you want me to change anything, thanks for the suggestion, i agree it made sense.

dask_cuda/utils.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

pentschev

LGTM, thanks @VibhuJawa !

pentschev · 2024-10-12T19:51:34Z

/merge

VibhuJawa added 2 commits October 7, 2024 17:24

POC pytorch in dask-cuda

12e0051

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

Allow passing in a list

a2bf28f

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

github-actions bot added the python python code needed label Oct 8, 2024

VibhuJawa marked this pull request as ready for review October 8, 2024 00:37

VibhuJawa requested a review from a team as a code owner October 8, 2024 00:37

Fix cli problem

d046478

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

pentschev reviewed Oct 8, 2024

View reviewed changes

dask_cuda/cli.py Outdated Show resolved Hide resolved

VibhuJawa mentioned this pull request Oct 8, 2024

Fix pytorch memory curve estimation for rmm backed allocator rapidsai/crossfit#94

Merged

VibhuJawa added 3 commits October 8, 2024 16:17

Address Peters review

35647ae

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

Style fixes

494ada6

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

Add a check for string input to allow a single string

94982f6

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

pentschev requested changes Oct 9, 2024

View reviewed changes

pentschev added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change labels Oct 9, 2024

VibhuJawa mentioned this pull request Oct 9, 2024

Improve NeMo Curator Experience for Pytorch Models (with crossfit) NVIDIA/NeMo-Curator#288

Open

6 tasks

VibhuJawa added 2 commits October 9, 2024 16:35

Address Peters Review

33871ac

Add support for comma separated strings

2517874

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

pentschev approved these changes Oct 12, 2024

View reviewed changes

rapids-bot bot merged commit 8d88006 into rapidsai:branch-24.12 Oct 12, 2024
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Pytorch to share same memory pool as RMM via cli #1392

Enable Pytorch to share same memory pool as RMM via cli #1392

VibhuJawa commented Oct 8, 2024 •

edited

Loading

pentschev left a comment

pentschev Oct 9, 2024

VibhuJawa Oct 9, 2024

pentschev Oct 10, 2024

VibhuJawa Oct 11, 2024

pentschev left a comment

pentschev commented Oct 12, 2024


		if isinstance(rmm_allocator_external_lib_list, str):
		rmm_allocator_external_lib_list = []

Enable Pytorch to share same memory pool as RMM via cli #1392

Enable Pytorch to share same memory pool as RMM via cli #1392

Conversation

VibhuJawa commented Oct 8, 2024 • edited Loading

pentschev left a comment

Choose a reason for hiding this comment

pentschev Oct 9, 2024

Choose a reason for hiding this comment

VibhuJawa Oct 9, 2024

Choose a reason for hiding this comment

pentschev Oct 10, 2024

Choose a reason for hiding this comment

VibhuJawa Oct 11, 2024

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

pentschev commented Oct 12, 2024

VibhuJawa commented Oct 8, 2024 •

edited

Loading