Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/UCS/UCT: Fix memtype_cache region info after merge #7791

Merged

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Dec 11, 2021

Why

Fix #7575
The issue was that whole-alloc logic returned wrong region boundaries, so the registered memory did not contain the actual communication buffer.

How

  • Track merged memtype cache region range by start/end only; drop base_addr/alloc_length
  • Assert that "whole-alloc" really returns a region that contains the original one
  • Add logging

Test status

Currently, the test case fails with ibv_reg_mr: Bad address, instead of Local protection error, but it could be a driver issue. Reproducer:

$ ssh swx-dgx01
$ conda activate ucx
$ env UCX_MEMTYPE_CACHE=y UCX_MAX_RNDV_RAILS=1 UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda \
  python dask_cuda/benchmarks/local_cudf_merge.py -d 0,1,2,3,4,5,6,7 --runs 5 -c 100_000_000 -p ucx \
   --interface ib0 --enable-nvlink --enable-infiniband --enable-rdmacm

@yosefe yosefe added the Bugfix label Dec 11, 2021
@@ -126,6 +127,16 @@ ucs_status_t ucp_mem_rereg_mds(ucp_context_h context, ucp_md_map_t reg_md_map,
ucp_memory_detect_internal(context, address, length, &mem_info);
base_address = mem_info.base_address;
reg_length = mem_info.alloc_length;
end_address = UCS_PTR_BYTE_OFFSET(base_address, reg_length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is end_address used once populated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used for ucs_trace and ucs_assertv

ucs_memtype_cache_update_internal(ucs_memtype_cache_global_instance,
address, size, &mem_info,
address, size, UCS_MEMORY_TYPE_UNKNOWN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe do you expect any issues in removing a region by marking it as unknown memory type even though it may have a valid memtype? I don't immediately see a problem in update_internal as it seems to consider mem_type for insert operation alone but wanted to double check. Wondering if it's better to detect and remove with valid mem_type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO setting to UNKNOWN is safer, since removing it completely is essentially marking it as host memory

region = ucs_derived_of(pgt_region, ucs_memtype_cache_region_t);
*mem_info = region->mem_info;
mem_info->base_address = (void*)region->super.start;
mem_info->alloc_length = region->super.end - region->super.start;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can region merges not result in alloc_length > actual allocation length? Is it disallowed again because we disallow region merges of contiguous allocations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's because of that (search_end = end - 1;)

@@ -126,6 +127,16 @@ ucs_status_t ucp_mem_rereg_mds(ucp_context_h context, ucp_md_map_t reg_md_map,
ucp_memory_detect_internal(context, address, length, &mem_info);
base_address = mem_info.base_address;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd move base_address and reg_length initialization from lines 123-124 to the else branch of this if

struct ucs_memtype_cache_region {
ucs_pgt_region_t super; /**< Base class - page table region */
ucs_list_link_t list; /**< List element */
ucs_memory_type_t mem_type; /**< Memory type, use uint8 for compact size */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the intent to use uint8_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since it's replacing ucs_memory_info_t which has ucs_memory_type_t

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then the comment is misleading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah :)

@pentschev
Copy link
Contributor

@yosefe as we discussed in our call earlier today, even with this PR I see the error below with driver 495.44/CUDA 11.5:

$ UCX_MEMTYPE_CACHE=y UCX_MAX_RNDV_RAILS=2 UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES="cuda" UCX_CUDA_COPY_MAX_REG_RATIO=1.0 UCX_IB_REG_MT_THRESH=inf python dask_cuda/benchmarks/local_cupy.py -d 0,1,2,3,4,5,6,7 --all-to-all --rmm-pool-size 29GiB --runs 10 -p ucx
...
[dgx13:34831:0:34831] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f54bbebc400)
==== backtrace (tid:  34831) ====
 0  /datasets/pentschev/miniconda3/envs/ucx-tmp2/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f5ea4523bb4]
 1  /datasets/pentschev/miniconda3/envs/ucx-tmp2/lib/libucs.so.0(+0x30dcf) [0x7f5ea4523dcf]
 2  /datasets/pentschev/miniconda3/envs/ucx-tmp2/lib/libucs.so.0(+0x310f4) [0x7f5ea45240f4]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f5f4543a980]
 4  /lib/x86_64-linux-gnu/libc.so.6(+0x18ea93) [0x7f5f44818a93]
 5  /datasets/pentschev/miniconda3/envs/ucx-tmp2/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_short+0x143) [0x7f5e70aaffb3]
 6  /datasets/pentschev/miniconda3/envs/ucx-tmp2/lib/libucp.so.0(+0xa28fb) [0x7f5eaccb48fb]
 7  /datasets/pentschev/miniconda3/envs/ucx-tmp2/lib/libucp.so.0(ucp_tag_send_nbx+0xacc) [0x7f5eaccc676c]
 8  /datasets/pentschev/miniconda3/envs/ucx-tmp2/lib/libucp.so.0(ucp_tag_send_nb+0x4e) [0x7f5eaccc5b7e]
 9  /datasets/pentschev/miniconda3/envs/ucx-tmp2/lib/python3.8/site-packages/ucp/_libs/ucx_api.cpython-38-x86_64-linux-gnu.so(+0x60e62) [0x7f5ea3ff2e62]

I created a reproducer script, but on Prom with driver 460.32.03 it caused the ibv_reg_mr error instead, so it will probably require a newer driver to trigger the error above.

@yosefe
Copy link
Contributor Author

yosefe commented Jan 23, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@yosefe
Copy link
Contributor Author

yosefe commented Jan 25, 2022

@brminich changed to ucs_memory_type_t; using uint8_t does not really improve the size - it's 24 in both cases

$ ./build-devel/src/tools/info/ucx_info -y|grep memory_in
    sizeof(ucs_memory_info_t) = ........... 24    

Instead of tracking base_address/alloc_length in each memtype cache
region, use start/end fields to track whole-allocation range.
This makes sure the region info after merge stays correct.
@yosefe yosefe force-pushed the topic/ucp-ucs-uct-fix-memtype-cache-region branch from 70dda1c to 647242a Compare January 25, 2022 12:02
@yosefe yosefe merged commit cda6aae into openucx:master Jan 27, 2022
@yosefe yosefe deleted the topic/ucp-ucs-uct-fix-memtype-cache-region branch January 27, 2022 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RDMA_READ errors without UCX_MEMTYPE_CACHE=n
5 participants