Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/GTEST: Fixed multiple tests for gdr_copy transport - v1.17.x #9853

Merged
merged 1 commit into from
Apr 29, 2024

Conversation

iyastreb
Copy link
Contributor

What

This is a double commit of #9840

This is the fix for RM#3873368.
The issue is always reproducible on rock, when building and running UCX with the following modules:

module load hpcx-env
module load hpcx-env/cuda
module load hpcx-env/gdrcopy

With gdr_copy MD configured, memory registration fails in several test suites:

  • 3 tests in test_md
  • 2 tests in uct_test
  • test_uct_loopback_cuda

The root cause was always the same: CUDA memory of arbitrary size was allocated, and then this memory is registered with uct_md_mem_reg without any alignment. However this does not work for gdr_copy transport, because it's required to register memory aligned by GPU_PAGE_SIZE (64k in this case).

I fixed the memory registration in all those places the same way it's done in ucp_mm module: by alignment using ucs_align_ptr_range API

@tvegas1 tvegas1 changed the title UCT/GTEST: Fixed multiple tests for gdr_copy transport UCT/GTEST: Fixed multiple tests for gdr_copy transport - v1.17.x Apr 29, 2024
@yosefe yosefe merged commit 3778154 into openucx:v1.17.x Apr 29, 2024
140 checks passed
@iyastreb iyastreb deleted the uct/gtest/fix-gdrcopy-1.17 branch May 3, 2024 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants