-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/GDR_COPY: Fix gdr_copy registration issues #3044
Conversation
bureddy
commented
Nov 15, 2018
- Align address to GPU PAGE SIZE when rcache is turned off.
- register only requested memory region instead of complete allocation(cudaMalloc) region because this is inefficient if only small region of bigger allocation is used in the communication,
@Akshay-Venkatesh please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a bugfix for the MTT issue?
src/uct/cuda/gdr_copy/gdr_copy_md.c
Outdated
@@ -188,10 +187,10 @@ static ucs_status_t uct_gdr_copy_mem_reg(uct_md_h uct_md, void *address, size_t | |||
return UCS_ERR_NO_MEMORY; | |||
} | |||
|
|||
reg_size = (length + GPU_PAGE_SIZE - 1) & GPU_PAGE_MASK; | |||
ptr = (void *) ((uintptr_t)address & GPU_PAGE_MASK); | |||
start = ucs_align_down_pow2((uintptr_t)address, GPU_PAGE_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use ucs_align_down_pow2_ptr, ucs_align_up_pow2_ptr
src/uct/cuda/gdr_copy/gdr_copy_md.c
Outdated
|
||
status = uct_gdr_copy_mem_reg_internal(uct_md, ptr, reg_size, 0, mem_hndl); | ||
status = uct_gdr_copy_mem_reg_internal(uct_md, (void *)start, (end - start), 0, mem_hndl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need ( ) around end-start
Test PASSed. |
Test FAILed. |
@yosefe observed issue with rache off while debugging another hcoll MTT issue https://github.com/Mellanox/hcoll/issues/811 . The original MTT issue is still pending, working with NVIDIA (NVIDIA/gdrcopy#44) |
Test PASSed. |
Test FAILed. |
@@ -188,10 +187,11 @@ static ucs_status_t uct_gdr_copy_mem_reg(uct_md_h uct_md, void *address, size_t | |||
return UCS_ERR_NO_MEMORY; | |||
} | |||
|
|||
reg_size = (length + GPU_PAGE_SIZE - 1) & GPU_PAGE_MASK; | |||
ptr = (void *) ((uintptr_t)address & GPU_PAGE_MASK); | |||
start = ucs_align_down_pow2_ptr(address, GPU_PAGE_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic question - is this safe to use compile time GPU_PAGE_SIZE ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is NVIDIA's gdr_copy constant https://github.com/NVIDIA/gdrcopy/blob/master/gdrapi.h#L35
Looks like it is safe for current nvidia architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to make sure that it is the same value (64K) across all architectures.
@bureddy Changes look good to me. |
0d064a3
to
bbd0058
Compare
Test PASSed. |
Test PASSed. |