Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "accelerator/cuda: Check for cuda devices" #12157

Merged
merged 1 commit into from
Dec 8, 2023

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented Dec 8, 2023

This reverts commit ae98e04.

This is a temporary mitigation for #12156

@wenduwan wenduwan self-assigned this Dec 8, 2023
@edgargabriel
Copy link
Member

can you confirm that you are indeed not segfaulting if you remove these lines? Basically, if the dso cannot load for whatever reason on some processes, its irrelevant whether you segfault in the init function vs. another cuda function call later

@wenduwan
Copy link
Contributor Author

wenduwan commented Dec 8, 2023

Yes. By reverting this change I can reliably pass the test.

wrt segfault, the only trace I got is pointing to cuda, according to dmesg

[79590.726378] cuda00001400006[73833]: segfault at 7f7f60fddb3f ip 00007f7f61fa7407 sp 00007f7f60d23eb0 error 4 in libgcc_s-7-20180712.so.1[7f7f61f99000+15000]
[79590.734804] Code: bb 0c 00 00 00 e9 f2 fe ff ff 40 80 ff 08 75 9d 80 78 01 00 75 97 0f b6 78 02 48 83 c0 02 e9 17 fd ff ff 49 8b 85 98 00 00 00 <80> 38 48 0f 85 67 fe ff ff 48 ba c7 c0 0f 00 00 00 0f 05 48 39 50

@wenduwan wenduwan merged commit 1ab256b into open-mpi:main Dec 8, 2023
10 checks passed
@hppritcha
Copy link
Member

so what are we going to do about #12055 I thought we decided about a month ago that the accelerator framework was designed to use DSOs for the components

@wenduwan
Copy link
Contributor Author

wenduwan commented Dec 8, 2023

@hppritcha This issue is manifested by the DSO build but it does not stop us from doing DSO build.

After reverting this change our CI started passing again(we do DSO build)

So I think you should go ahead with #12055 while we figure out #12156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants