Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCM/CUDA/TEST: Install memory hooks for async Cuda allocations #7204

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Aug 7, 2021

Why

As discussed in #7194 and #7110 , need to add memory hooks support for cuda async allocations. Without this, applications using these allocations may fail to detect Cuda memory and run into segfault/access error.

@yosefe
Copy link
Contributor Author

yosefe commented Aug 9, 2021

@Akshay-Venkatesh WDYT?

@yosefe
Copy link
Contributor Author

yosefe commented Aug 9, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

[[#include <cuda.h>]])
AC_CHECK_DECLS([cudaMallocAsync, cudaFreeAsync], [], [],
[[#include <cuda_runtime.h>]])
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that HAVE_CUDA is not set if *Async APIs aren't detected at configure time? That would disallow CUDA for slightly older versions of cuda wouldn't it?

I'm probably missing the commit that defines HAVE_DECL_CUMEMALLOCASYNC/HAVE_DECL_CUMEMFREEASYNC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not affect HAVE_CUDA . it sets a different set of macros, specific for async APIs

@@ -75,6 +75,8 @@ UCM_DEFINE_REPLACE_DLSYM_PTR_FUNC(cuMemAlloc, CUresult, -1, CUdeviceptr*,
size_t)
UCM_DEFINE_REPLACE_DLSYM_PTR_FUNC(cuMemAlloc_v2, CUresult, -1, CUdeviceptr*,
size_t)
UCM_DEFINE_REPLACE_DLSYM_PTR_FUNC(cuMemAllocAsync, CUresult, -1, CUdeviceptr*,
size_t, CUstream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe I think we should also intercept cuMemAllocFromPoolAsync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will add

@@ -156,6 +164,9 @@ UCM_CUDA_ALLOC_FUNC(cuMemAlloc, UCS_MEMORY_TYPE_CUDA, CUresult, CUDA_SUCCESS,
arg0, CUdeviceptr, "size=%zu", size_t)
UCM_CUDA_ALLOC_FUNC(cuMemAlloc_v2, UCS_MEMORY_TYPE_CUDA, CUresult, CUDA_SUCCESS,
arg0, CUdeviceptr, "size=%zu", size_t)
UCM_CUDA_ALLOC_FUNC(cuMemAllocAsync, UCS_MEMORY_TYPE_CUDA, CUresult,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this is fine because cuMemAllocAsync can only allocate pinned memory but setting default memory pool to user created pool can alter the behavior in the future when other memory types are supported.

In the future, it would be better to get memory pool associated with current device and examine allocation properties to decide the memory type instead of hard coding to MEMORY_TYPE_CUDA as the same API may be used for other memory types as well. I don't see an API to get MemPool properties from MemPool yet so we'll need to intercept MemPoolCreate/Destroy API for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we can just set memory type to UKNOWN like we planned to anyway?

@@ -46,15 +46,15 @@
}

/* Create a body of CUDA memory release replacement function */
#define UCM_CUDA_FREE_FUNC(_name, _retval, _ptr_type, _mem_type) \
_retval ucm_##_name(_ptr_type ptr) \
#define UCM_CUDA_FREE_FUNC(_name, _retval, _mem_type, ...) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the memory should be freed when the stream moves past FreeAsync. When the API itself returns, this may not be true so in that sense it may not be exactly right to change the attributes of the memory range or remove the memory range from pointer cache. But as we don't have a callback per se when free actually occurs, this should be ok for now as users would be very unlikely to issue ucp transfer operations after freeasync knowing that it may not be actually freed yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i guess once this is submitted it's no longer legal to issue data transfer from CPU.
do you know at which exact point the GPU can map a new physical memory to same virtual address?

Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have to be at the next cu*Alloc* call. Since we intercept all of those, I guess we don't have to worry about stream semantics on FreeAsync.

@Akshay-Venkatesh
Copy link
Contributor

@yosefe forgot to bring up the issue of lack of sync memops support on MallocAsync memory that may come up because of this PR. Adding this PR would likely result in IB or cuda-ipc UCTs to be used to move memory allocated through MallocAsync but the following sequence could lead to stale data being transferred:

cudaMallocAsync(&x, length1, stream1);
cudaStreamSynchromize(stream1);
...
cudaMemcpy(x, y, length2, cudaMemcpyHostToDevice); // potentially non-blocking wrt CPU and copy to destination x may still be in flight
ucp_tag_send_nbx(x, ...); // region pointed by x is not valid yet because previous memcpy is still in flight

Setting sync memops attribute on x would synchronize all outstanding memory operations on it but it's not supported on MallocAsync memory so this could lead to data validation issues irrespective of zcopy operations through ib/cuda_ipc or through pipeline protocols.

@simonbyrne
Copy link

Any update on this?

@Akshay-Venkatesh
Copy link
Contributor

Any update on this?

@simonbyrne SYNC_MEMOPS is still yet to be supported with Malloc Async API. We plan to support such memory once it becomes available.

@yosefe
Copy link
Contributor Author

yosefe commented Nov 2, 2022

replaced by #8623

@yosefe yosefe closed this Nov 2, 2022
@yosefe yosefe deleted the topic/ucm-cuda-test-install-memory-hooks-for branch May 22, 2023 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants