-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204
UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204
Conversation
@Akshay-Venkatesh WDYT? |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
[[#include <cuda.h>]]) | ||
AC_CHECK_DECLS([cudaMallocAsync, cudaFreeAsync], [], [], | ||
[[#include <cuda_runtime.h>]]) | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that HAVE_CUDA is not set if *Async APIs aren't detected at configure time? That would disallow CUDA for slightly older versions of cuda wouldn't it?
I'm probably missing the commit that defines HAVE_DECL_CUMEMALLOCASYNC/HAVE_DECL_CUMEMFREEASYNC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not affect HAVE_CUDA . it sets a different set of macros, specific for async APIs
@@ -75,6 +75,8 @@ UCM_DEFINE_REPLACE_DLSYM_PTR_FUNC(cuMemAlloc, CUresult, -1, CUdeviceptr*, | |||
size_t) | |||
UCM_DEFINE_REPLACE_DLSYM_PTR_FUNC(cuMemAlloc_v2, CUresult, -1, CUdeviceptr*, | |||
size_t) | |||
UCM_DEFINE_REPLACE_DLSYM_PTR_FUNC(cuMemAllocAsync, CUresult, -1, CUdeviceptr*, | |||
size_t, CUstream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe I think we should also intercept cuMemAllocFromPoolAsync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will add
@@ -156,6 +164,9 @@ UCM_CUDA_ALLOC_FUNC(cuMemAlloc, UCS_MEMORY_TYPE_CUDA, CUresult, CUDA_SUCCESS, | |||
arg0, CUdeviceptr, "size=%zu", size_t) | |||
UCM_CUDA_ALLOC_FUNC(cuMemAlloc_v2, UCS_MEMORY_TYPE_CUDA, CUresult, CUDA_SUCCESS, | |||
arg0, CUdeviceptr, "size=%zu", size_t) | |||
UCM_CUDA_ALLOC_FUNC(cuMemAllocAsync, UCS_MEMORY_TYPE_CUDA, CUresult, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now this is fine because cuMemAllocAsync can only allocate pinned memory but setting default memory pool to user created pool can alter the behavior in the future when other memory types are supported.
In the future, it would be better to get memory pool associated with current device and examine allocation properties to decide the memory type instead of hard coding to MEMORY_TYPE_CUDA as the same API may be used for other memory types as well. I don't see an API to get MemPool properties from MemPool yet so we'll need to intercept MemPoolCreate/Destroy API for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we can just set memory type to UKNOWN like we planned to anyway?
@@ -46,15 +46,15 @@ | |||
} | |||
|
|||
/* Create a body of CUDA memory release replacement function */ | |||
#define UCM_CUDA_FREE_FUNC(_name, _retval, _ptr_type, _mem_type) \ | |||
_retval ucm_##_name(_ptr_type ptr) \ | |||
#define UCM_CUDA_FREE_FUNC(_name, _retval, _mem_type, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, the memory should be freed when the stream moves past FreeAsync. When the API itself returns, this may not be true so in that sense it may not be exactly right to change the attributes of the memory range or remove the memory range from pointer cache. But as we don't have a callback per se when free actually occurs, this should be ok for now as users would be very unlikely to issue ucp transfer operations after freeasync knowing that it may not be actually freed yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i guess once this is submitted it's no longer legal to issue data transfer from CPU.
do you know at which exact point the GPU can map a new physical memory to same virtual address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have to be at the next cu*Alloc* call. Since we intercept all of those, I guess we don't have to worry about stream semantics on FreeAsync.
@yosefe forgot to bring up the issue of lack of sync memops support on MallocAsync memory that may come up because of this PR. Adding this PR would likely result in IB or cuda-ipc UCTs to be used to move memory allocated through MallocAsync but the following sequence could lead to stale data being transferred:
Setting sync memops attribute on x would synchronize all outstanding memory operations on it but it's not supported on MallocAsync memory so this could lead to data validation issues irrespective of zcopy operations through ib/cuda_ipc or through pipeline protocols. |
Any update on this? |
@simonbyrne SYNC_MEMOPS is still yet to be supported with Malloc Async API. We plan to support such memory once it becomes available. |
replaced by #8623 |
Why
As discussed in #7194 and #7110 , need to add memory hooks support for cuda async allocations. Without this, applications using these allocations may fail to detect Cuda memory and run into segfault/access error.