Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMEM Copy Callbacks #146

Closed
wants to merge 2 commits into from
Closed

HMEM Copy Callbacks #146

wants to merge 2 commits into from

Conversation

jdinan
Copy link
Contributor

@jdinan jdinan commented Aug 2, 2022

Issue #, if available:

None

Description of changes:

This PR adds HMEM copy callbacks that use GDRCopy to perform memory copies. This allows aws-ofi-nccl to support providers that copy data between device and host buffers from using cudaMemcpy. Because NCCL kernels are running on the GPU, submitting memcpy work to CUDA can (and usually does) result in a deadlock. GDRCopy can be used for such copies without introducing the risk of a deadlock. This change required the introduction of a GDRCopy buffer registration management layer (called hcopy) and a refactor of the MR handle in aws-ofi-nccl so that it can hold both the OFI and GDRCopy registration information for the buffer. GDRCopy callbacks are enabled at compile time by configuring with --with-gdrcopy=....

Note, this PR is based on top of #145.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: James Dinan <jdinan@nvidia.com>
Signed-off-by: James Dinan <jdinan@nvidia.com>
@jdinan
Copy link
Contributor Author

jdinan commented Aug 31, 2022

Libfabric already has GDRCopy support plumbed, so this patch is a temporary workaround for providers that haven't hooked up to this support. Should not be any need to carry these changes upstream in aws-ofi-nccl. Closing this PR as do not merge.

@jdinan jdinan closed this Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant