Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newer versions of OpenMPI are unable to locate CUDA support. #12264

Closed
tmh97 opened this issue Jan 22, 2024 · 11 comments
Closed

Newer versions of OpenMPI are unable to locate CUDA support. #12264

tmh97 opened this issue Jan 22, 2024 · 11 comments

Comments

@tmh97
Copy link

tmh97 commented Jan 22, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

The bug exists in 5.01, I am unaware if it also exists for previous, or subsequent releases.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

This issue exists in source tarball and gitclone, I've tested both.

Please describe the system on which you are running

Two node system

  • Operating system/version: RHEL 9.2 (Plow)
  • Computer hardware: x86_64

Details of the problem

I used to be able to get CUDA support with OpenMPI by simply providing the --with-cuda=/usr/local/cuda option at OMPI configure. Now it seems I also require the with-cuda-libdir Without this additional flag, it appears as if there is no support for NVIDIA devices,CUDA support: no. I believe this will cause problems for users when they re-build OMPI to a newer version and suddenly see their CUDA support is non-existent.

@tmh97
Copy link
Author

tmh97 commented Jan 23, 2024

As @hppritcha pointed out, this is indeed documented https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html.

@jsquyres
Copy link
Member

@tmh97 Per the Webex today, could you provide a little more info? E.g.:

  • As you stated above, running ./configure --with-cuda=/usr/local/cuda ... fails to find CUDA support.
  • Does running ./configure --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuds/lib64 ... work?
    • I.e., what is the specific libdir that you provide to --with-cuda-libdir that makes this work?

@tmh97
Copy link
Author

tmh97 commented Jan 23, 2024

@jsquyres --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/ worked well for OpenMPI 5.0.0/1

It seems /usr/local/cuda/lib64 is where the CUDA runtime API resides. I believe this is the path we wish to target.

Alternatively, /usr/lib64 also contains CUDA related files, but I believe these are for the CUDA driver API, which is not what we want (i think)

@jsquyres
Copy link
Member

Do we know that that is correct?

  • You're saying --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64 works.
  • But the docs @hppritcha cited state that --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs is correct. I.e., this stubs folder at the end of the libdir is needed.

Given that the docs were specifically written that way, is it correct to assume that there is a reason stubs is the correct way, and not including stubs in the libdir is wrong for some reason?

Alternatively, @edgargabriel stated today on the call that configuring --with-luster=/blah didn't work to find the Lustre libraries in /blah/lib64.

@edgargabriel Can you confirm that this is correct / what is currently happening on main and v5.0.x?

@lrbison
Copy link
Contributor

lrbison commented Jan 23, 2024

on Ubuntu 20.04, I need:

# Open MPI 4.1:
./configure --with-cuda
# Open MPI main:
./configure --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs

Note that I seem to need to specify paths for both cuda and cuda-libdir. Adding a path for libdir alone was not enough.

@jsquyres
Copy link
Member

jsquyres commented Jan 23, 2024

Yes, having to specify both --with-cuda and --with-cuda-libdir is expected. I'm asking if the stubs part is really necessary -- the docs were clearly written that way on purpose. And why does not specifying stubs work for @tmh97?

@edgargabriel
Copy link
Member

edgargabriel commented Jan 23, 2024

I went back to the cluster with the lustre file system, and I can see clearly in bash_history that I configured for a while Open MPI with the -with-lustre=/opt/lustre/2.12.2 --with-lustre-libdir=/opt/lustre/2.12.2/lib64 arguments, and since I didn't use to do that in the past, it was probably because it wasn't working without that (and that is what I also remembered).

However, as of right now, it looks like I don't need to set the --with-lustre-libdir anymore, it configures correctly again without having to provide that argument.

@jsquyres
Copy link
Member

jsquyres commented Jan 23, 2024

Ok, so then this question really is just about --with-cuda -- not the general OAC --with-FOO handling.

  1. Is it incorrect to not specify the stubs folder in the --with-cuda-libdir? (the docs imply that stubs is necessary)
  2. Can config/opal_check_cuda.m4 be updated to automagically handle searching for stubs?

@bosilca
Copy link
Member

bosilca commented Jan 23, 2024

The stubs point to a libcuda.so that allows linking CUDA applications using the driver API (such as OMPI) on platforms without GPUs. This is different from what other libraries require, but there are valid reasons. I'll vote for automatically checking for the stubs in config/opal_check_cuda.m4.

@jsquyres
Copy link
Member

I'll vote for automatically checking for the stubs in config/opal_check_cuda.m4.

Cool. Can someone in NVIDIA look into this? Hint, hint. 😄

@janjust janjust self-assigned this Jan 30, 2024
@jsquyres jsquyres modified the milestones: v5.0.2, v5.0.3 Feb 13, 2024
nsarka added a commit to nsarka/ompi that referenced this issue Feb 26, 2024
Finding CUDA libraries without having to specify both --with-cuda and
--with-cuda-lib was requested in github issue
open-mpi#12264

Signed-off-by: Nick Sarkauskas <nsarkauskas@nvidia.com>
nsarka added a commit to nsarka/ompi that referenced this issue Feb 27, 2024
Finding CUDA libraries without having to specify both --with-cuda and
--with-cuda-lib was requested in github issue
open-mpi#12264

Signed-off-by: Nick Sarkauskas <nsarkauskas@nvidia.com>
nsarka added a commit to nsarka/ompi that referenced this issue Feb 27, 2024
Finding CUDA libraries without having to specify both --with-cuda and
--with-cuda-lib was requested in github issue
open-mpi#12264

Signed-off-by: Nick Sarkauskas <nsarkauskas@nvidia.com>
nsarka added a commit to nsarka/ompi that referenced this issue Feb 27, 2024
Finding CUDA libraries without having to specify both --with-cuda and
--with-cuda-lib was requested in github issue
open-mpi#12264

Signed-off-by: Nick Sarkauskas <nsarkauskas@nvidia.com>
jiaxiyan pushed a commit to jiaxiyan/ompi that referenced this issue Mar 1, 2024
Finding CUDA libraries without having to specify both --with-cuda and
--with-cuda-lib was requested in github issue
open-mpi#12264

Signed-off-by: Nick Sarkauskas <nsarkauskas@nvidia.com>
janjust pushed a commit that referenced this issue Mar 5, 2024
Finding CUDA libraries without having to specify both --with-cuda and
--with-cuda-lib was requested in github issue
#12264

Signed-off-by: Nick Sarkauskas <nsarkauskas@nvidia.com>
(cherry picked from commit cad3d9a)
@janjust
Copy link
Contributor

janjust commented Mar 6, 2024

fixed with #12382

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants