-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA_IPC: MD_FLAG_INVALIDATE stub implementation #7522
UCT/CUDA_IPC: MD_FLAG_INVALIDATE stub implementation #7522
Conversation
@pentschev Can you check if this PR prevents cuda-ipc from being dropped with error callbacks usage in ucxpy? |
Can one of the admins verify this patch? |
ok to test |
@yosefe are the errors related here? I see tests ending with terminated status unexpectedly. |
Sorry for the delayed response, but I wanted to make sure I tested performance and all of our tests too. Everything is looking good with this, performance is restored. Thanks so much @Akshay-Venkatesh ! |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
@Akshay-Venkatesh the failures seem related to this PR |
Interestingly, this branch seemed to have worked before, but I now see various issues, for example:
Followed by messages on the Python side:
It seems that the size is being corrupted somewhere. What changed in our environment since is we upgraded from MOFED 4 to MOFED 5, in case that information might make any sense. Some other tests entirely segfault:
|
I assume UCX was rebuilt from source after this upgrade?
|
Yes, several times and it sometimes causes the processes to go into uninterruptible sleep state, requiring a reboot. |
@yosefe This is also a target for ucx-1.12 and I'm not sure if the errors we're seeing are related to the PR. |
I think it's related: This PR fails all GPU tests consistently, while all other PRs run to completion successfully |
I also think they are related, as per details in #7522 (comment) . And to be clear, this is a blocker for UCX-Py in UCX 1.12. |
@pentschev I'm looking at possible issue with rcache alignment but I don't expect the following issue to come from this PR:
Are you applying https://github.com/openucx/ucx/pull/7522.diff on top of master to test md_invalidate support? If not, can you try that? because the errors resemble those we fixed with #7485 |
47c96c8
to
3829297
Compare
@pentschev I've updated the PR again without using rcache. Can you check if you still get cuda_ipc support with this? (hopefully there are no errors as you saw previously) cc @yosefe |
Sorry @Akshay-Venkatesh , you're right. I mixed things up, the |
@Akshay-Venkatesh tests and benchmarks are looking good with the PR, including CUDA IPC performance, so definitely a +1 to get this in for UCX-Py. Thanks so much for all the work here! |
test/gtest/uct/test_md.cc
Outdated
@@ -655,6 +655,10 @@ UCS_TEST_SKIP_COND_P(test_md, invalidate, !check_caps(UCT_MD_FLAG_INVALIDATE)) | |||
ucs_status_t status; | |||
uct_md_mem_dereg_params_t params; | |||
|
|||
if (!strcmp(GetParam().md_name.c_str(), "cuda_ipc")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (GetParam().md_name == "cuda_ipc")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or better: has_cuda_ipc()
@yosefe errors look unrelated to the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls squash without rebase
0f48007
to
8e8203e
Compare
@@ -653,6 +653,10 @@ UCS_TEST_SKIP_COND_P(test_md, invalidate, !check_caps(UCT_MD_FLAG_INVALIDATE)) | |||
ucs_status_t status; | |||
uct_md_mem_dereg_params_t params; | |||
|
|||
if (GetParam().md_name == "cuda_ipc") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about using has_cuda_ipc() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed that has_cuda_ipc is a virtual defined in uct_test class which is not accessible in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, test_md does not inherit from uct_test
What
Support MD_FLAG_INVALIDATE using rcache to track memory regions