Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with CUDA accelerator component initialization #11831

Closed
devreal opened this issue Jul 20, 2023 · 9 comments
Closed

Issues with CUDA accelerator component initialization #11831

devreal opened this issue Jul 20, 2023 · 9 comments
Milestone

Comments

@devreal
Copy link
Contributor

devreal commented Jul 20, 2023

We're working with the CUDA accelerator component and tried to rebase my somewhat outdated branch to current main. I believe I found an issue with the way the CUDA component is initialized: Since ae98e04 we call cuInit in accelerator_cuda_init but do not set a context. Then in every call to opal_accelerator_cuda_delayed_init henceforth (until the first call to a CUDA function by the application) we receive a NULL context from cuCtxGetCurrent and return an error (https://github.com/open-mpi/ompi/blob/main/opal/mca/accelerator/cuda/accelerator_cuda_component.c#L146). That prevents all other accelerator-related state in OMPI from properly initializing. On this particular system, at least smcuda (mca_btl_smcuda_accelerator_init) and ob1 (mca_pml_ob1_accelerator_init) do not enable accelerator support because they cannot create a stream, unless the application does call into CUDA before calling MPI_Init (because there will be a CUDA context in that case). Is this what we want?

Interestingly, before ae98e04 we would not return an error from opal_accelerator_cuda_delayed_init (because cuCtxGetCurrent returned an error code) and so the accelerator support would work properly.

I believe the same behavior exists in the 5.x release branch.

@BenWibking
Copy link

I've been trying to use 5.0rc's on a cluster with CUDA devices and reported a bug previously about broken sm+ob1 support. How does CUDA support keep breaking in 5.0rc's? Is there CI for this?

@devreal
Copy link
Contributor Author

devreal commented Jul 23, 2023

@BenWibking Is this the issue you are referring to? #10871

@BenWibking
Copy link

@BenWibking Is this the issue you are referring to? #10871

I was referring to #11399, which is now fixed.

@janjust
Copy link
Contributor

janjust commented Jul 27, 2023

@devreal I thought this was addressed with #11297 I don't think this is the intended behavior. The delayed_init was introduced specifically to avoid this scenario.

@BenWibking No CI atm, at least in external MTT.

@edgargabriel
Copy link
Member

We did discuss this ticket in the last meeting and potential resolutions. The main reason for pr ae98e04 was the change to the coll/cuda component (i.e. coll/cuda being always compiled and wanting to make sure that it disqualifies itself if no GPUs are present). That change to coll/cuda has however not been ported to 5.0. So in theory if we want we can simply revert this pr on 5.0 without any consequences.

@gpaulsen gpaulsen added this to the v5.0.0 milestone Aug 8, 2023
@qkoziol
Copy link
Contributor

qkoziol commented Aug 8, 2023

Possibly the change I made to cuda_hmem_verify_devices() in this libfabric PR would be a helpful direction to explore: ofiwg/libfabric#9170

@devreal
Copy link
Contributor Author

devreal commented Aug 14, 2023

As long as any component requires the device component to be up during its initialization (like ob1) the delayed init of the CUDA component is useless (not sure how that ever worked). One fix would be to delay the comm component's device setup (i.e., streams) until we see the first communication. I guess this was the intent of #11253 but it missed the stream creation.

@lrbison
Copy link
Contributor

lrbison commented Aug 15, 2023

Discussed on developers call today. @bosilca reminded us that this issue should really target milestone 5.0.1, as the delayed init code is in main but not v5.0.x branch. I'm changing the target tags accordingly.

@wenduwan
Copy link
Contributor

ae98e04 has been reverted in #12157.

Note that this change was only on main. v5.0.x does not have this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants