-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/UCS/UCT: Fix memtype_cache region info after merge #7791
UCP/UCS/UCT: Fix memtype_cache region info after merge #7791
Conversation
@@ -126,6 +127,16 @@ ucs_status_t ucp_mem_rereg_mds(ucp_context_h context, ucp_md_map_t reg_md_map, | |||
ucp_memory_detect_internal(context, address, length, &mem_info); | |||
base_address = mem_info.base_address; | |||
reg_length = mem_info.alloc_length; | |||
end_address = UCS_PTR_BYTE_OFFSET(base_address, reg_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is end_address used once populated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's used for ucs_trace and ucs_assertv
ucs_memtype_cache_update_internal(ucs_memtype_cache_global_instance, | ||
address, size, &mem_info, | ||
address, size, UCS_MEMORY_TYPE_UNKNOWN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe do you expect any issues in removing a region by marking it as unknown memory type even though it may have a valid memtype? I don't immediately see a problem in update_internal
as it seems to consider mem_type for insert operation alone but wanted to double check. Wondering if it's better to detect and remove with valid mem_type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO setting to UNKNOWN is safer, since removing it completely is essentially marking it as host memory
region = ucs_derived_of(pgt_region, ucs_memtype_cache_region_t); | ||
*mem_info = region->mem_info; | ||
mem_info->base_address = (void*)region->super.start; | ||
mem_info->alloc_length = region->super.end - region->super.start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can region merges not result in alloc_length > actual allocation length? Is it disallowed again because we disallow region merges of contiguous allocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's because of that (search_end = end - 1;
)
@@ -126,6 +127,16 @@ ucs_status_t ucp_mem_rereg_mds(ucp_context_h context, ucp_md_map_t reg_md_map, | |||
ucp_memory_detect_internal(context, address, length, &mem_info); | |||
base_address = mem_info.base_address; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd move base_address
and reg_length
initialization from lines 123-124 to the else
branch of this if
struct ucs_memtype_cache_region { | ||
ucs_pgt_region_t super; /**< Base class - page table region */ | ||
ucs_list_link_t list; /**< List element */ | ||
ucs_memory_type_t mem_type; /**< Memory type, use uint8 for compact size */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was the intent to use uint8_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, since it's replacing ucs_memory_info_t which has ucs_memory_type_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then the comment is misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah :)
@yosefe as we discussed in our call earlier today, even with this PR I see the error below with driver 495.44/CUDA 11.5:
I created a reproducer script, but on Prom with driver 460.32.03 it caused the ibv_reg_mr error instead, so it will probably require a newer driver to trigger the error above. |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
@brminich changed to ucs_memory_type_t; using uint8_t does not really improve the size - it's 24 in both cases $ ./build-devel/src/tools/info/ucx_info -y|grep memory_in
sizeof(ucs_memory_info_t) = ........... 24 |
Instead of tracking base_address/alloc_length in each memtype cache region, use start/end fields to track whole-allocation range. This makes sure the region info after merge stays correct.
70dda1c
to
647242a
Compare
Why
Fix #7575
The issue was that whole-alloc logic returned wrong region boundaries, so the registered memory did not contain the actual communication buffer.
How
Test status
Currently, the test case fails with ibv_reg_mr: Bad address, instead of Local protection error, but it could be a driver issue. Reproducer: