Issues with CUDA accelerator component initialization #11831

devreal · 2023-07-20T20:26:24Z

We're working with the CUDA accelerator component and tried to rebase my somewhat outdated branch to current main. I believe I found an issue with the way the CUDA component is initialized: Since ae98e04 we call cuInit in accelerator_cuda_init but do not set a context. Then in every call to opal_accelerator_cuda_delayed_init henceforth (until the first call to a CUDA function by the application) we receive a NULL context from cuCtxGetCurrent and return an error (https://github.com/open-mpi/ompi/blob/main/opal/mca/accelerator/cuda/accelerator_cuda_component.c#L146). That prevents all other accelerator-related state in OMPI from properly initializing. On this particular system, at least smcuda (mca_btl_smcuda_accelerator_init) and ob1 (mca_pml_ob1_accelerator_init) do not enable accelerator support because they cannot create a stream, unless the application does call into CUDA before calling MPI_Init (because there will be a CUDA context in that case). Is this what we want?

Interestingly, before ae98e04 we would not return an error from opal_accelerator_cuda_delayed_init (because cuCtxGetCurrent returned an error code) and so the accelerator support would work properly.

I believe the same behavior exists in the 5.x release branch.

The text was updated successfully, but these errors were encountered:

BenWibking · 2023-07-21T22:01:34Z

I've been trying to use 5.0rc's on a cluster with CUDA devices and reported a bug previously about broken sm+ob1 support. How does CUDA support keep breaking in 5.0rc's? Is there CI for this?

devreal · 2023-07-23T09:30:50Z

@BenWibking Is this the issue you are referring to? #10871

BenWibking · 2023-07-23T15:08:21Z

@BenWibking Is this the issue you are referring to? #10871

I was referring to #11399, which is now fixed.

janjust · 2023-07-27T16:08:39Z

@devreal I thought this was addressed with #11297 I don't think this is the intended behavior. The delayed_init was introduced specifically to avoid this scenario.

@BenWibking No CI atm, at least in external MTT.

edgargabriel · 2023-08-04T15:00:15Z

We did discuss this ticket in the last meeting and potential resolutions. The main reason for pr ae98e04 was the change to the coll/cuda component (i.e. coll/cuda being always compiled and wanting to make sure that it disqualifies itself if no GPUs are present). That change to coll/cuda has however not been ported to 5.0. So in theory if we want we can simply revert this pr on 5.0 without any consequences.

qkoziol · 2023-08-08T15:42:59Z

Possibly the change I made to cuda_hmem_verify_devices() in this libfabric PR would be a helpful direction to explore: ofiwg/libfabric#9170

devreal · 2023-08-14T20:54:26Z

As long as any component requires the device component to be up during its initialization (like ob1) the delayed init of the CUDA component is useless (not sure how that ever worked). One fix would be to delay the comm component's device setup (i.e., streams) until we see the first communication. I guess this was the intent of #11253 but it missed the stream creation.

lrbison · 2023-08-15T15:29:21Z

Discussed on developers call today. @bosilca reminded us that this issue should really target milestone 5.0.1, as the delayed init code is in main but not v5.0.x branch. I'm changing the target tags accordingly.

wenduwan · 2024-04-24T21:28:12Z

ae98e04 has been reverted in #12157.

Note that this change was only on main. v5.0.x does not have this issue.

devreal added the Target: main label Jul 20, 2023

jsquyres added the Target: v5.0.x label Jul 25, 2023

gpaulsen added this to the v5.0.0 milestone Aug 8, 2023

lrbison removed the Target: v5.0.x label Aug 15, 2023

jsquyres modified the milestones: v5.0.0, v5.0.1 Oct 30, 2023

rgayatri23 mentioned this issue Dec 12, 2023

trouble running openmpi+pmix in rootless podman-hpc container #12146

Open

janjust modified the milestones: v5.0.1, v5.0.2 Jan 8, 2024

jsquyres modified the milestones: v5.0.2, v5.0.3 Feb 13, 2024

wenduwan closed this as completed Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with CUDA accelerator component initialization #11831

Issues with CUDA accelerator component initialization #11831

devreal commented Jul 20, 2023

BenWibking commented Jul 21, 2023

devreal commented Jul 23, 2023

BenWibking commented Jul 23, 2023

janjust commented Jul 27, 2023

edgargabriel commented Aug 4, 2023

qkoziol commented Aug 8, 2023

devreal commented Aug 14, 2023

lrbison commented Aug 15, 2023

wenduwan commented Apr 24, 2024

Issues with CUDA accelerator component initialization #11831

Issues with CUDA accelerator component initialization #11831

Comments

devreal commented Jul 20, 2023

BenWibking commented Jul 21, 2023

devreal commented Jul 23, 2023

BenWibking commented Jul 23, 2023

janjust commented Jul 27, 2023

edgargabriel commented Aug 4, 2023

qkoziol commented Aug 8, 2023

devreal commented Aug 14, 2023

lrbison commented Aug 15, 2023

wenduwan commented Apr 24, 2024