UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204

yosefe · 2021-08-07T15:52:23Z

Why

As discussed in #7194 and #7110 , need to add memory hooks support for cuda async allocations. Without this, applications using these allocations may fail to detect Cuda memory and run into segfault/access error.

yosefe · 2021-08-09T09:00:31Z

@Akshay-Venkatesh WDYT?

yosefe · 2021-08-09T19:13:57Z

/azp run

azure-pipelines · 2021-08-09T19:14:15Z

Azure Pipelines successfully started running 2 pipeline(s).

Akshay-Venkatesh · 2021-08-09T19:36:01Z

config/m4/cuda.m4

+                               [[#include <cuda.h>]])
+                AC_CHECK_DECLS([cudaMallocAsync, cudaFreeAsync], [], [],
+                               [[#include <cuda_runtime.h>]])
+               ])


Does this mean that HAVE_CUDA is not set if *Async APIs aren't detected at configure time? That would disallow CUDA for slightly older versions of cuda wouldn't it?

I'm probably missing the commit that defines HAVE_DECL_CUMEMALLOCASYNC/HAVE_DECL_CUMEMFREEASYNC

it does not affect HAVE_CUDA . it sets a different set of macros, specific for async APIs

Akshay-Venkatesh · 2021-08-09T19:47:11Z

src/ucm/cuda/cudamem.c

@@ -75,6 +75,8 @@ UCM_DEFINE_REPLACE_DLSYM_PTR_FUNC(cuMemAlloc, CUresult, -1, CUdeviceptr*,
                                  size_t)
 UCM_DEFINE_REPLACE_DLSYM_PTR_FUNC(cuMemAlloc_v2, CUresult, -1, CUdeviceptr*,
                                  size_t)
+UCM_DEFINE_REPLACE_DLSYM_PTR_FUNC(cuMemAllocAsync, CUresult, -1, CUdeviceptr*,
+                                  size_t, CUstream)


@yosefe I think we should also intercept cuMemAllocFromPoolAsync

ok, will add

Akshay-Venkatesh · 2021-08-09T19:57:42Z

src/ucm/cuda/cudamem.c

@@ -156,6 +164,9 @@ UCM_CUDA_ALLOC_FUNC(cuMemAlloc, UCS_MEMORY_TYPE_CUDA, CUresult, CUDA_SUCCESS,
                    arg0, CUdeviceptr, "size=%zu", size_t)
 UCM_CUDA_ALLOC_FUNC(cuMemAlloc_v2, UCS_MEMORY_TYPE_CUDA, CUresult, CUDA_SUCCESS,
                    arg0, CUdeviceptr, "size=%zu", size_t)
+UCM_CUDA_ALLOC_FUNC(cuMemAllocAsync, UCS_MEMORY_TYPE_CUDA, CUresult,


For now this is fine because cuMemAllocAsync can only allocate pinned memory but setting default memory pool to user created pool can alter the behavior in the future when other memory types are supported.

In the future, it would be better to get memory pool associated with current device and examine allocation properties to decide the memory type instead of hard coding to MEMORY_TYPE_CUDA as the same API may be used for other memory types as well. I don't see an API to get MemPool properties from MemPool yet so we'll need to intercept MemPoolCreate/Destroy API for this.

or we can just set memory type to UKNOWN like we planned to anyway?

Akshay-Venkatesh · 2021-08-09T20:20:06Z

src/ucm/cuda/cudamem.c

@@ -46,15 +46,15 @@
    }

 /* Create a body of CUDA memory release replacement function */
-#define UCM_CUDA_FREE_FUNC(_name, _retval, _ptr_type, _mem_type) \
-    _retval ucm_##_name(_ptr_type ptr) \
+#define UCM_CUDA_FREE_FUNC(_name, _retval, _mem_type, ...) \


Technically, the memory should be freed when the stream moves past FreeAsync. When the API itself returns, this may not be true so in that sense it may not be exactly right to change the attributes of the memory range or remove the memory range from pointer cache. But as we don't have a callback per se when free actually occurs, this should be ok for now as users would be very unlikely to issue ucp transfer operations after freeasync knowing that it may not be actually freed yet.

yes, i guess once this is submitted it's no longer legal to issue data transfer from CPU.
do you know at which exact point the GPU can map a new physical memory to same virtual address?

It would have to be at the next cu*Alloc* call. Since we intercept all of those, I guess we don't have to worry about stream semantics on FreeAsync.

Akshay-Venkatesh · 2021-08-10T01:48:22Z

@yosefe forgot to bring up the issue of lack of sync memops support on MallocAsync memory that may come up because of this PR. Adding this PR would likely result in IB or cuda-ipc UCTs to be used to move memory allocated through MallocAsync but the following sequence could lead to stale data being transferred:

cudaMallocAsync(&x, length1, stream1);
cudaStreamSynchromize(stream1);
...
cudaMemcpy(x, y, length2, cudaMemcpyHostToDevice); // potentially non-blocking wrt CPU and copy to destination x may still be in flight
ucp_tag_send_nbx(x, ...); // region pointed by x is not valid yet because previous memcpy is still in flight

Setting sync memops attribute on x would synchronize all outstanding memory operations on it but it's not supported on MallocAsync memory so this could lead to data validation issues irrespective of zcopy operations through ib/cuda_ipc or through pipeline protocols.

simonbyrne · 2022-05-06T17:43:01Z

Any update on this?

Akshay-Venkatesh · 2022-05-06T18:01:58Z

Any update on this?

@simonbyrne SYNC_MEMOPS is still yet to be supported with Malloc Async API. We plan to support such memory once it becomes available.

yosefe · 2022-11-02T16:24:07Z

replaced by #8623

UCM/CUDA/TEST: Install memory hooks for async Cuda allocations

529d7a3

yosefe mentioned this pull request Aug 7, 2021

process_vm_readv Bad Address causes abort with Cuda and OpenMPI on large application #7194

Open

dmitrygx approved these changes Aug 8, 2021

View reviewed changes

yosefe mentioned this pull request Aug 9, 2021

UCM/BISTRO/TEST: Fix support for cuda memory hooks - v1.11.x #7209

Merged

bureddy approved these changes Aug 9, 2021

View reviewed changes

Akshay-Venkatesh reviewed Aug 9, 2021

View reviewed changes

rongou mentioned this pull request Sep 10, 2021

Add CUDA async memory resource as an option NVIDIA/spark-rapids#3447

Merged

yosefe closed this Nov 2, 2022

yosefe deleted the topic/ucm-cuda-test-install-memory-hooks-for branch May 22, 2023 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204

UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204

yosefe commented Aug 7, 2021 •

edited

Loading

yosefe commented Aug 9, 2021

yosefe commented Aug 9, 2021

azure-pipelines bot commented Aug 9, 2021

Akshay-Venkatesh Aug 9, 2021

yosefe Aug 9, 2021

Akshay-Venkatesh Aug 9, 2021

yosefe Aug 9, 2021

Akshay-Venkatesh Aug 9, 2021

yosefe Aug 9, 2021

Akshay-Venkatesh Aug 9, 2021

yosefe Aug 9, 2021

Akshay-Venkatesh Aug 9, 2021 •

edited

Loading

Akshay-Venkatesh commented Aug 10, 2021

simonbyrne commented May 6, 2022

Akshay-Venkatesh commented May 6, 2022

yosefe commented Nov 2, 2022

UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204

UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204

Conversation

yosefe commented Aug 7, 2021 • edited Loading

Why

yosefe commented Aug 9, 2021

yosefe commented Aug 9, 2021

azure-pipelines bot commented Aug 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Akshay-Venkatesh Aug 9, 2021 • edited Loading

Choose a reason for hiding this comment

Akshay-Venkatesh commented Aug 10, 2021

simonbyrne commented May 6, 2022

Akshay-Venkatesh commented May 6, 2022

yosefe commented Nov 2, 2022

yosefe commented Aug 7, 2021 •

edited

Loading

Akshay-Venkatesh Aug 9, 2021 •

edited

Loading