UCT/GDR_COPY: Fix gdr_copy registration issues #3044

bureddy · 2018-11-15T23:48:23Z

Align address to GPU PAGE SIZE when rcache is turned off.
register only requested memory region instead of complete allocation(cudaMalloc) region because this is inefficient if only small region of bigger allocation is used in the communication,

bureddy · 2018-11-15T23:48:39Z

yosefe

is this a bugfix for the MTT issue?

yosefe · 2018-11-15T23:54:50Z

src/uct/cuda/gdr_copy/gdr_copy_md.c

@@ -188,10 +187,10 @@ static ucs_status_t uct_gdr_copy_mem_reg(uct_md_h uct_md, void *address, size_t
        return UCS_ERR_NO_MEMORY;
    }

-    reg_size = (length + GPU_PAGE_SIZE - 1) & GPU_PAGE_MASK;
-    ptr = (void *) ((uintptr_t)address & GPU_PAGE_MASK);
+    start = ucs_align_down_pow2((uintptr_t)address, GPU_PAGE_SIZE);


can use ucs_align_down_pow2_ptr, ucs_align_up_pow2_ptr

yosefe · 2018-11-15T23:55:31Z

src/uct/cuda/gdr_copy/gdr_copy_md.c


-    status = uct_gdr_copy_mem_reg_internal(uct_md, ptr, reg_size, 0, mem_hndl);
+    status = uct_gdr_copy_mem_reg_internal(uct_md, (void *)start, (end - start), 0, mem_hndl);


no need ( ) around end-start

swx-jenkins1 · 2018-11-16T00:35:13Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5611/ for details.

mellanox-github · 2018-11-16T02:27:17Z

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8247/ for details (Mellanox internal link).

bureddy · 2018-11-16T05:56:55Z

@yosefe observed issue with rache off while debugging another hcoll MTT issue https://github.com/Mellanox/hcoll/issues/811 . The original MTT issue is still pending, working with NVIDIA (NVIDIA/gdrcopy#44)

swx-jenkins1 · 2018-11-16T06:31:37Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5612/ for details.

mellanox-github · 2018-11-16T08:24:45Z

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8248/ for details (Mellanox internal link).

shamisp · 2018-11-16T16:09:22Z

src/uct/cuda/gdr_copy/gdr_copy_md.c

@@ -188,10 +187,11 @@ static ucs_status_t uct_gdr_copy_mem_reg(uct_md_h uct_md, void *address, size_t
        return UCS_ERR_NO_MEMORY;
    }

-    reg_size = (length + GPU_PAGE_SIZE - 1) & GPU_PAGE_MASK;
-    ptr = (void *) ((uintptr_t)address & GPU_PAGE_MASK);
+    start = ucs_align_down_pow2_ptr(address, GPU_PAGE_SIZE);


Generic question - is this safe to use compile time GPU_PAGE_SIZE ?

@shamisp Do you mean this in the context of #if HAVE_CUDA?
If so, GPU_PAGE_SIZE is defined in gdrapi.h and gets included #if HAVE_CUDA. It should be safe, correct?

This is NVIDIA's gdr_copy constant https://github.com/NVIDIA/gdrcopy/blob/master/gdrapi.h#L35
Looks like it is safe for current nvidia architectures.

I just wanted to make sure that it is the same value (64K) across all architectures.

Akshay-Venkatesh · 2018-11-16T17:56:35Z

@bureddy
You mention inefficient because registration time is longer?

Changes look good to me.

bureddy · 2018-11-16T18:09:56Z

@Akshay-Venkatesh yes

swx-jenkins1 · 2018-11-16T18:49:11Z

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/5613/ for details.

mellanox-github · 2018-11-16T21:04:04Z

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/8249/ for details (Mellanox internal link).

bureddy added 2 commits November 16, 2018 00:47

UCT/GDR_COPY: align address to GPU PAGE SIZE when rcache is turned off

57c8744

UCT/GDR_COPY: avoid registering whole allocation addr range

3b7f72e

yosefe reviewed Nov 15, 2018

View reviewed changes

shamisp reviewed Nov 16, 2018

View reviewed changes

UCT/GDR_COPY: Fix review commnets

bbd0058

bureddy force-pushed the gdr_copy_fix branch from 0d064a3 to bbd0058 Compare November 16, 2018 18:05

yosefe approved these changes Nov 18, 2018

View reviewed changes

yosefe merged commit a782840 into openucx:master Nov 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCT/GDR_COPY: Fix gdr_copy registration issues #3044

UCT/GDR_COPY: Fix gdr_copy registration issues #3044

bureddy commented Nov 15, 2018

bureddy commented Nov 15, 2018

yosefe left a comment

yosefe Nov 15, 2018

yosefe Nov 15, 2018

swx-jenkins1 commented Nov 16, 2018

mellanox-github commented Nov 16, 2018

bureddy commented Nov 16, 2018

swx-jenkins1 commented Nov 16, 2018

mellanox-github commented Nov 16, 2018

shamisp Nov 16, 2018

Akshay-Venkatesh Nov 16, 2018

bureddy Nov 16, 2018

shamisp Nov 16, 2018

Akshay-Venkatesh commented Nov 16, 2018

bureddy commented Nov 16, 2018

swx-jenkins1 commented Nov 16, 2018

mellanox-github commented Nov 16, 2018


		status = uct_gdr_copy_mem_reg_internal(uct_md, ptr, reg_size, 0, mem_hndl);
		status = uct_gdr_copy_mem_reg_internal(uct_md, (void *)start, (end - start), 0, mem_hndl);

UCT/GDR_COPY: Fix gdr_copy registration issues #3044

UCT/GDR_COPY: Fix gdr_copy registration issues #3044

Conversation

bureddy commented Nov 15, 2018

bureddy commented Nov 15, 2018

yosefe left a comment

Choose a reason for hiding this comment

yosefe Nov 15, 2018

Choose a reason for hiding this comment

yosefe Nov 15, 2018

Choose a reason for hiding this comment

swx-jenkins1 commented Nov 16, 2018

mellanox-github commented Nov 16, 2018

bureddy commented Nov 16, 2018

swx-jenkins1 commented Nov 16, 2018

mellanox-github commented Nov 16, 2018

shamisp Nov 16, 2018

Choose a reason for hiding this comment

Akshay-Venkatesh Nov 16, 2018

Choose a reason for hiding this comment

bureddy Nov 16, 2018

Choose a reason for hiding this comment

shamisp Nov 16, 2018

Choose a reason for hiding this comment

Akshay-Venkatesh commented Nov 16, 2018

bureddy commented Nov 16, 2018

swx-jenkins1 commented Nov 16, 2018

mellanox-github commented Nov 16, 2018