Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/GDR_COPY: Fix gdr_copy registration issues #3044

Merged
merged 3 commits into from
Nov 18, 2018

Conversation

bureddy
Copy link
Contributor

@bureddy bureddy commented Nov 15, 2018

  1. Align address to GPU PAGE SIZE when rcache is turned off.
  2. register only requested memory region instead of complete allocation(cudaMalloc) region because this is inefficient if only small region of bigger allocation is used in the communication,

@bureddy
Copy link
Contributor Author

bureddy commented Nov 15, 2018

@Akshay-Venkatesh please review

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a bugfix for the MTT issue?

@@ -188,10 +187,10 @@ static ucs_status_t uct_gdr_copy_mem_reg(uct_md_h uct_md, void *address, size_t
return UCS_ERR_NO_MEMORY;
}

reg_size = (length + GPU_PAGE_SIZE - 1) & GPU_PAGE_MASK;
ptr = (void *) ((uintptr_t)address & GPU_PAGE_MASK);
start = ucs_align_down_pow2((uintptr_t)address, GPU_PAGE_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use ucs_align_down_pow2_ptr, ucs_align_up_pow2_ptr


status = uct_gdr_copy_mem_reg_internal(uct_md, ptr, reg_size, 0, mem_hndl);
status = uct_gdr_copy_mem_reg_internal(uct_md, (void *)start, (end - start), 0, mem_hndl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need ( ) around end-start

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5611/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8247/ for details (Mellanox internal link).

@bureddy
Copy link
Contributor Author

bureddy commented Nov 16, 2018

@yosefe observed issue with rache off while debugging another hcoll MTT issue https://github.com/Mellanox/hcoll/issues/811 . The original MTT issue is still pending, working with NVIDIA (NVIDIA/gdrcopy#44)

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5612/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8248/ for details (Mellanox internal link).

@@ -188,10 +187,11 @@ static ucs_status_t uct_gdr_copy_mem_reg(uct_md_h uct_md, void *address, size_t
return UCS_ERR_NO_MEMORY;
}

reg_size = (length + GPU_PAGE_SIZE - 1) & GPU_PAGE_MASK;
ptr = (void *) ((uintptr_t)address & GPU_PAGE_MASK);
start = ucs_align_down_pow2_ptr(address, GPU_PAGE_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic question - is this safe to use compile time GPU_PAGE_SIZE ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamisp Do you mean this in the context of #if HAVE_CUDA?
If so, GPU_PAGE_SIZE is defined in gdrapi.h and gets included #if HAVE_CUDA. It should be safe, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is NVIDIA's gdr_copy constant https://github.com/NVIDIA/gdrcopy/blob/master/gdrapi.h#L35
Looks like it is safe for current nvidia architectures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to make sure that it is the same value (64K) across all architectures.

@Akshay-Venkatesh
Copy link
Contributor

@bureddy
You mention inefficient because registration time is longer?

Changes look good to me.

@bureddy
Copy link
Contributor Author

bureddy commented Nov 16, 2018

@Akshay-Venkatesh yes

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5613/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8249/ for details (Mellanox internal link).

@yosefe yosefe merged commit a782840 into openucx:master Nov 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants