Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/GTEST: Fixed multiple tests for gdr_copy transport #9840

Merged
merged 1 commit into from
Apr 26, 2024

Conversation

iyastreb
Copy link
Contributor

@iyastreb iyastreb commented Apr 23, 2024

What

This is the fix for RM#3873368.
The issue is always reproducible on rock, when building and running UCX with the following modules:

module load hpcx-env
module load hpcx-env/cuda
module load hpcx-env/gdrcopy

With gdr_copy MD configured, memory registration fails in several test suites:

  • 3 tests in test_md
  • 2 tests in uct_test
  • test_uct_loopback_cuda

The root cause was always the same: CUDA memory of arbitrary size was allocated, and then this memory is registered with uct_md_mem_reg without any alignment. However this does not work for gdr_copy transport, because it's required to register memory aligned by GPU_PAGE_SIZE (64k in this case).

I fixed the memory registration in all those places the same way it's done in ucp_mm module: by alignment using ucs_align_ptr_range API

Copy link
Contributor

@tvegas1 tvegas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetic

test/gtest/uct/test_md.cc Outdated Show resolved Hide resolved
src/tools/perf/cuda/cuda_alloc.c Outdated Show resolved Hide resolved
tvegas1
tvegas1 previously approved these changes Apr 24, 2024
@@ -276,7 +280,8 @@ void test_md::dereg_cb(uct_completion_t *comp)
bool test_md::is_gpu_ipc() const
{
return (GetParam().md_name == "cuda_ipc") ||
(GetParam().md_name == "rocm_ipc");
(GetParam().md_name == "rocm_ipc") ||
(GetParam().md_name == "gdr_copy");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed? it's not really a gpu_ipc transport

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to skip a failing test: gdr_copy/test_md.rkey_compare/0

[ RUN      ] gdr_copy/test_md.rkey_compare/0 <gdr_copy>
[1713949792.100377] [rock04:65081:0]     gdr_copy_md.c:121  UCX  ERROR gdr_pin_buffer failed. length :65536 ret:22
/labhome/iyastrebov/ws/ucx3/contrib/../test/gtest/uct/test_md.cc:1011: Failure
Error: Input/output error
/labhome/iyastrebov/ws/ucx3/contrib/../test/gtest/common/test.cc:366: Failure
Failed
Got 1 errors and 0 warnings during the test

Unlike other tests, this one fails due to a different reason: it allocated memory with ucs_mmap_alloc, which is not CUDA memory. And then gdr_pin_buffer fails because it's expected to work only with CUDA memory.

These two tests were skipped by "is_gpu_ipc" condition for a similar reason, so I decided just to reuse existing function that is used only in these 2 places to skip the tests. Here is the PR that introduced it: #9401

Maybe we can just rename that function to "is_gpu_md"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the test instead. E.g. like this:
query md attrs with uct_md_query_v2
skip the test if md_attr->reg_mem_types does not contain host memory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as @brminich the right way to to check if the MD can register host memory .
Maybe after this, we could remove is_gpu_ipc condition as well to cleanup the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, agree, will be fixed

Comment on lines 971 to 977
/* Register memory respecting MD reg_alignment */
size_t reg_align = ucs_max(md_attr().reg_alignment, 1);
void *reg_address = mem->address;
size_t reg_length = mem->length;
ucs_align_ptr_range(&reg_address, &reg_length, reg_align);

ucs_status_t status = uct_md_mem_reg(m_md, reg_address, reg_length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use uct_md_mem_reg_v2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

status = uct_md_mem_reg(perf->uct.md, alloc_mem->address,
length, flags, &alloc_mem->memh);
/* Register memory respecting MD reg_alignment */
reg_align = ucs_max(md_attr.reg_alignment, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed? is there an MD that returns 0 here? if yes it should be fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as I see it's 0s for all MDs, expect gdr_copy:

[     INFO ] >>> knem 0
[     INFO ] >>> cma 0
[     INFO ] >>> posix 0
[     INFO ] >>> sysv 0
[     INFO ] >>> xpmem 0
[     INFO ] >>> cuda_cpy 0
[     INFO ] >>> cuda_ipc 0
[     INFO ] >>> mlx5_0 0
[     INFO ] >>> mlx5_1 0
[     INFO ] >>> mlx5_2 0
[     INFO ] >>> gdr_copy 65536

And the second reason: I looked at the existing handling of this param in production code: https://github.com/openucx/ucx/blob/master/src/ucp/core/ucp_mm.c#L525

        if (context->rcache == NULL) {
            reg_align = ucs_max(context->tl_mds[md_index].attr.reg_alignment, 1);
            ucs_align_ptr_range(&reg_address, &reg_length, reg_align);
        }

Should we set this attribute to 1 everywhere by default?
I quickly checked places where this field is used, should be ok to change it to 1

Copy link
Contributor

@yosefe yosefe Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set this attribute to 1 everywhere by default?

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 92 to 94
reg_align = ucs_max(md_attr.reg_alignment, 1);
reg_address = alloc_mem->address;
reg_length = length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can remove these local helper vars

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reg_align - we can remove it if agree to init reg_alignment to 1 for all MDs
reg_address - we cannot remove it, because we don't want to modify alloc_mem->address
reg_length - this is safe to remove

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -66,6 +66,10 @@ void* test_md::alloc_thread(void *arg)
ucs_status_t test_md::reg_mem(unsigned flags, void *address, size_t length,
uct_mem_h *memh_p)
{
/* Register memory respecting MD reg_alignment */
size_t reg_align = ucs_max(md_attr().reg_alignment, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. same here, need to fix UCT MD if it returns 0
  2. we should make sure the test allocates aligned memory buffer and not just align during registration. Otherwise, we could try to register pages not in the process address space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ok
  2. hmm, are we absolutely sure about this?
    Looking at production code, it seems we do exactly that, or I misunderstood it?
    Finally we just register the whole page, so it must be within the process address space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ok

@@ -276,7 +280,8 @@ void test_md::dereg_cb(uct_completion_t *comp)
bool test_md::is_gpu_ipc() const
{
return (GetParam().md_name == "cuda_ipc") ||
(GetParam().md_name == "rocm_ipc");
(GetParam().md_name == "rocm_ipc") ||
(GetParam().md_name == "gdr_copy");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as @brminich the right way to to check if the MD can register host memory .
Maybe after this, we could remove is_gpu_ipc condition as well to cleanup the code

Comment on lines 971 to 975
/* Register memory respecting MD reg_alignment */
size_t reg_align = ucs_max(md_attr().reg_alignment, 1);
void *reg_address = mem->address;
size_t reg_length = mem->length;
ucs_align_ptr_range(&reg_address, &reg_length, reg_align);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. code duplication with test_md
  2. cannot register outside the allocated buffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. will think how to get rid of the code duplication
  2. Let's discuss this in person?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. there is no duplication with test_md anymore
  2. it's ok as we agreed

@iyastreb
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@yosefe yosefe merged commit 176ff68 into openucx:master Apr 26, 2024
140 checks passed
@iyastreb iyastreb deleted the uct/gtest/fix-gdrcopy branch April 29, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants