-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accelerator framework/cuda: still not entirely fixed #11354
Comments
ACK, there shouldn't be any event create/destroy if cuda init hasn't been called, I fixed a bug in the ob1 path to fix this but must have missed it. It should be easy enough to figure it out. |
if you'd like to fix go ahead and assign this to yourself. i've got some other things to fix elsewhere.... |
I wasn't able to reproduce this (Built 5.0.x head with cuda and application w/o cuda and ran using the OFI MTL) when I tried so Howard said he'd take a look at reproducing. |
this problem vanishes if I don't use |
i'm fine with not using the |
Not being able to enable that option seems like a pretty drastic solution for the general community. The other PR has been sitting there waiting to be committed for 3 months now - can the community find some way to make that happen? Seems like there are a bunch of PRs stuck in that situation. |
If something breaks when you use |
well what's happening i believe is that the accelerator framework is being dlclosed prior to smcuda (part of btl framework). |
That seems like a legit bug that should be fixed, right? |
I think there may be something more complicated going on. it appears that on the system i'm using, that if i load a "cudatoolkit" module, this causes the call to the smcuda btl to initialize although the accelerator framework cuda component failed to initialized. The magic in the cudatoolkit module appears to be adding libcudart.so into the LD_LIBRARY_PATH. Given this, i think this is some cuda support specific problem. I'm confused though why the problem only shows up when building with |
uff. this is kind of ugly. the smcuda btl creates a cuda context after the accelerator framework has been initialized, leading to a bunch of problems. going to open a wip PR. |
well now I cannot seem to reproduce these warning messages, either with main or v5.0.x. Tried both with vendor compiler wrappers and using gnu compiler directly. |
If we can't reproduce this issue (I suspect that old versions of the accelerator framework component were loaded), can we mark this issue as closed? |
actually i was able to reproduce again but its not simple. it seems i only hit this problem when using the vendor supplied compiler wrappers to build ompi. |
related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
I'm going to add a fix for the ob1 component and then mark this issue as done, please backport the issue @hppritcha |
Actually I looked into it more and ob1 shouldn't have the same issue since it occurs in the component_fini rather than component_close |
related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit f7803dd)
related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
I was checking out head of the v5.0.x branch in high expectations that it would work well on our nvidia + HPE SS11 (aka libfabric) system, but alas, if my application doesn't use cudA, yet is linked against a ompi v5.0.x with all the recent accelerator/cuda changes in place, and configured for CUDA support, things don't work right.
It looks like holes may have been plugged for OB1 (if i set the pml to use ob1 I don't see these messages), but such is not the case when using other PMLs apparently.
The text was updated successfully, but these errors were encountered: