Newer versions of OpenMPI are unable to locate CUDA support. #12264

tmh97 · 2024-01-22T23:48:03Z

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

The bug exists in 5.01, I am unaware if it also exists for previous, or subsequent releases.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

This issue exists in source tarball and gitclone, I've tested both.

Please describe the system on which you are running

Two node system

Operating system/version: RHEL 9.2 (Plow)
Computer hardware: x86_64

Details of the problem

I used to be able to get CUDA support with OpenMPI by simply providing the --with-cuda=/usr/local/cuda option at OMPI configure. Now it seems I also require the with-cuda-libdir Without this additional flag, it appears as if there is no support for NVIDIA devices,CUDA support: no. I believe this will cause problems for users when they re-build OMPI to a newer version and suddenly see their CUDA support is non-existent.

The text was updated successfully, but these errors were encountered:

tmh97 · 2024-01-23T16:30:03Z

As @hppritcha pointed out, this is indeed documented https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html.

jsquyres · 2024-01-23T20:35:53Z

@tmh97 Per the Webex today, could you provide a little more info? E.g.:

As you stated above, running ./configure --with-cuda=/usr/local/cuda ... fails to find CUDA support.
Does running ./configure --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuds/lib64 ... work?
- I.e., what is the specific libdir that you provide to --with-cuda-libdir that makes this work?

tmh97 · 2024-01-23T21:05:01Z

@jsquyres --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/ worked well for OpenMPI 5.0.0/1

It seems /usr/local/cuda/lib64 is where the CUDA runtime API resides. I believe this is the path we wish to target.

Alternatively, /usr/lib64 also contains CUDA related files, but I believe these are for the CUDA driver API, which is not what we want (i think)

jsquyres · 2024-01-23T21:44:14Z

Do we know that that is correct?

You're saying --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64 works.
But the docs @hppritcha cited state that --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs is correct. I.e., this stubs folder at the end of the libdir is needed.

Given that the docs were specifically written that way, is it correct to assume that there is a reason stubs is the correct way, and not including stubs in the libdir is wrong for some reason?

Alternatively, @edgargabriel stated today on the call that configuring --with-luster=/blah didn't work to find the Lustre libraries in /blah/lib64.

@edgargabriel Can you confirm that this is correct / what is currently happening on main and v5.0.x?

lrbison · 2024-01-23T21:47:04Z

on Ubuntu 20.04, I need:

# Open MPI 4.1:
./configure --with-cuda
# Open MPI main:
./configure --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs

Note that I seem to need to specify paths for both cuda and cuda-libdir. Adding a path for libdir alone was not enough.

jsquyres · 2024-01-23T21:52:51Z

Yes, having to specify both --with-cuda and --with-cuda-libdir is expected. I'm asking if the stubs part is really necessary -- the docs were clearly written that way on purpose. And why does not specifying stubs work for @tmh97?

edgargabriel · 2024-01-23T22:13:08Z

I went back to the cluster with the lustre file system, and I can see clearly in bash_history that I configured for a while Open MPI with the -with-lustre=/opt/lustre/2.12.2 --with-lustre-libdir=/opt/lustre/2.12.2/lib64 arguments, and since I didn't use to do that in the past, it was probably because it wasn't working without that (and that is what I also remembered).

However, as of right now, it looks like I don't need to set the --with-lustre-libdir anymore, it configures correctly again without having to provide that argument.

jsquyres · 2024-01-23T22:21:03Z

Ok, so then this question really is just about --with-cuda -- not the general OAC --with-FOO handling.

Is it incorrect to not specify the stubs folder in the --with-cuda-libdir? (the docs imply that stubs is necessary)
Can config/opal_check_cuda.m4 be updated to automagically handle searching for stubs?

bosilca · 2024-01-23T22:56:38Z

The stubs point to a libcuda.so that allows linking CUDA applications using the driver API (such as OMPI) on platforms without GPUs. This is different from what other libraries require, but there are valid reasons. I'll vote for automatically checking for the stubs in config/opal_check_cuda.m4.

jsquyres · 2024-01-24T14:41:04Z

I'll vote for automatically checking for the stubs in config/opal_check_cuda.m4.

Cool. Can someone in NVIDIA look into this? Hint, hint. 😄

Finding CUDA libraries without having to specify both --with-cuda and --with-cuda-lib was requested in github issue open-mpi#12264 Signed-off-by: Nick Sarkauskas <nsarkauskas@nvidia.com>

Finding CUDA libraries without having to specify both --with-cuda and --with-cuda-lib was requested in github issue #12264 Signed-off-by: Nick Sarkauskas <nsarkauskas@nvidia.com> (cherry picked from commit cad3d9a)

janjust · 2024-03-06T16:32:27Z

fixed with #12382

jsquyres added the question label Jan 23, 2024

jsquyres added Target: main Target: v5.0.x labels Jan 23, 2024

jsquyres added this to the v5.0.2 milestone Jan 23, 2024

christgau mentioned this issue Jan 26, 2024

Expand CUDA support and fix documentation to account for all cuda dependent components. #12279

Open

janjust self-assigned this Jan 30, 2024

jsquyres modified the milestones: v5.0.2, v5.0.3 Feb 13, 2024

PhilipDeegan mentioned this issue Feb 14, 2024

Building cuda aware openMPI does not seem to work #12334

Closed

nsarka mentioned this issue Feb 26, 2024

Find libcuda.so automatically if --with-cuda-lib is not passed #12378

Merged

nsarka mentioned this issue Feb 28, 2024

Find libcuda.so automatically if --with-cuda-lib is not passed. #12382

Merged

janjust closed this as completed Mar 6, 2024

BKitor mentioned this issue May 1, 2024

--with-cuda failes to find libcuda.so #12509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newer versions of OpenMPI are unable to locate CUDA support. #12264

Newer versions of OpenMPI are unable to locate CUDA support. #12264

tmh97 commented Jan 22, 2024 •

edited

Loading

tmh97 commented Jan 23, 2024

jsquyres commented Jan 23, 2024

tmh97 commented Jan 23, 2024 •

edited

Loading

jsquyres commented Jan 23, 2024

lrbison commented Jan 23, 2024

jsquyres commented Jan 23, 2024 •

edited

Loading

edgargabriel commented Jan 23, 2024 •

edited

Loading

jsquyres commented Jan 23, 2024 •

edited

Loading

bosilca commented Jan 23, 2024

jsquyres commented Jan 24, 2024

janjust commented Mar 6, 2024

Newer versions of OpenMPI are unable to locate CUDA support. #12264

Newer versions of OpenMPI are unable to locate CUDA support. #12264

Comments

tmh97 commented Jan 22, 2024 • edited Loading

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

tmh97 commented Jan 23, 2024

jsquyres commented Jan 23, 2024

tmh97 commented Jan 23, 2024 • edited Loading

jsquyres commented Jan 23, 2024

lrbison commented Jan 23, 2024

jsquyres commented Jan 23, 2024 • edited Loading

edgargabriel commented Jan 23, 2024 • edited Loading

jsquyres commented Jan 23, 2024 • edited Loading

bosilca commented Jan 23, 2024

jsquyres commented Jan 24, 2024

janjust commented Mar 6, 2024

tmh97 commented Jan 22, 2024 •

edited

Loading

tmh97 commented Jan 23, 2024 •

edited

Loading

jsquyres commented Jan 23, 2024 •

edited

Loading

edgargabriel commented Jan 23, 2024 •

edited

Loading

jsquyres commented Jan 23, 2024 •

edited

Loading