Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA: Return unreachable from rkey unpack in case of error - v1.16.x #9717

Merged
merged 1 commit into from
Mar 10, 2024

Conversation

brminich
Copy link
Contributor

What

Always return unreachable from uct_cuda_ipc_rkey_unpack to ignore cuda IPC key when it can't be used (e.g. cuda device is not set for this process)

rakhmets
rakhmets previously approved these changes Feb 28, 2024
tvegas1
tvegas1 previously approved these changes Feb 28, 2024
@brminich brminich changed the title UCT/CUDA: Return unreachable from rkey unpack in case of error UCT/CUDA: Return unreachable from rkey unpack in case of error v1.16.x Feb 29, 2024
@brminich brminich changed the title UCT/CUDA: Return unreachable from rkey unpack in case of error v1.16.x UCT/CUDA: Return unreachable from rkey unpack in case of error - v1.16.x Feb 29, 2024
@rakhmets
Copy link
Collaborator

rakhmets commented Mar 1, 2024

Do we need the fix in master branch?

@brminich
Copy link
Contributor Author

brminich commented Mar 1, 2024

Do we need the fix in master branch?

yes, le'ts wait for @yosefe approve

UCT_CUDA_IPC_GET_DEVICE(this_device);
UCT_CUDA_IPC_DEVICE_GET_COUNT(num_devices);
if ((CUDA_SUCCESS != cuCtxGetDevice(&this_device)) ||
(CUDA_SUCCESS != cuDeviceGetCount(&num_devices))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep the error message if num_devices query fails, since it should not happen at this point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in place

@brminich brminich dismissed stale reviews from tvegas1 and rakhmets via 11511cb March 5, 2024 17:23
@brminich brminich force-pushed the uct/cuda_ipc_fix_rkey_unpack branch from 03e8955 to 11511cb Compare March 5, 2024 17:23
@brminich brminich force-pushed the uct/cuda_ipc_fix_rkey_unpack branch from 11511cb to 9cdac44 Compare March 6, 2024 15:23
@yosefe yosefe enabled auto-merge March 8, 2024 11:00
@yosefe yosefe merged commit 10d785d into openucx:v1.16.x Mar 10, 2024
107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants