HMEM Copy Callbacks #146

jdinan · 2022-08-02T22:01:33Z

Issue #, if available:

None

Description of changes:

This PR adds HMEM copy callbacks that use GDRCopy to perform memory copies. This allows aws-ofi-nccl to support providers that copy data between device and host buffers from using cudaMemcpy. Because NCCL kernels are running on the GPU, submitting memcpy work to CUDA can (and usually does) result in a deadlock. GDRCopy can be used for such copies without introducing the risk of a deadlock. This change required the introduction of a GDRCopy buffer registration management layer (called hcopy) and a refactor of the MR handle in aws-ofi-nccl so that it can hold both the OFI and GDRCopy registration information for the buffer. GDRCopy callbacks are enabled at compile time by configuring with --with-gdrcopy=....

Note, this PR is based on top of #145.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: James Dinan <jdinan@nvidia.com>

jdinan · 2022-08-31T21:26:53Z

Libfabric already has GDRCopy support plumbed, so this patch is a temporary workaround for providers that haven't hooked up to this support. Should not be any need to carry these changes upstream in aws-ofi-nccl. Closing this PR as do not merge.

Extend fi_read to support additional MR modes

362865e

Signed-off-by: James Dinan <jdinan@nvidia.com>

jdinan mentioned this pull request Aug 2, 2022

HMEM Copy Callbacks and Extended MR Support #140

Closed

HMEM copy overrides using GDRCopy

44b128a

Signed-off-by: James Dinan <jdinan@nvidia.com>

jdinan force-pushed the pr/hcopy branch from 98eee9a to 44b128a Compare August 4, 2022 15:23

AddyLaddy mentioned this pull request Aug 9, 2022

Does GDRcopy support the HPE/Cray "SlingShot" backbone? NVIDIA/gdrcopy#232

Open

jdinan closed this Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HMEM Copy Callbacks #146

HMEM Copy Callbacks #146

jdinan commented Aug 2, 2022

jdinan commented Aug 31, 2022

HMEM Copy Callbacks #146

HMEM Copy Callbacks #146

Conversation

jdinan commented Aug 2, 2022

jdinan commented Aug 31, 2022