Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerator framework/cuda: still not entirely fixed #11354

Closed
hppritcha opened this issue Jan 27, 2023 · 16 comments
Closed

accelerator framework/cuda: still not entirely fixed #11354

hppritcha opened this issue Jan 27, 2023 · 16 comments

Comments

@hppritcha
Copy link
Member

I was checking out head of the v5.0.x branch in high expectations that it would work well on our nvidia + HPE SS11 (aka libfabric) system, but alas, if my application doesn't use cudA, yet is linked against a ompi v5.0.x with all the recent accelerator/cuda changes in place, and configured for CUDA support, things don't work right.

Hello, world, I am 1 of 2, (Open MPI v5.0.0rc9, package: Open MPI hpp@ch-fe1 Distribution, ident: 5.0.0rc9, repo rev: v5.0.0rc9-287-g5d87f3e6, Unreleased developer copy, 141)
Hello, world, I am 0 of 2, (Open MPI v5.0.0rc9, package: Open MPI hpp@ch-fe1 Distribution, ident: 5.0.0rc9, repo rev: v5.0.0rc9-287-g5d87f3e6, Unreleased developer copy, 141)
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.

It looks like holes may have been plugged for OB1 (if i set the pml to use ob1 I don't see these messages), but such is not the case when using other PMLs apparently.

@wckzhang
Copy link
Contributor

ACK, there shouldn't be any event create/destroy if cuda init hasn't been called, I fixed a bug in the ob1 path to fix this but must have missed it. It should be easy enough to figure it out.

@hppritcha
Copy link
Member Author

if you'd like to fix go ahead and assign this to yourself. i've got some other things to fix elsewhere....

@wckzhang wckzhang assigned wckzhang and unassigned wckzhang Jan 27, 2023
@wckzhang
Copy link
Contributor

I wasn't able to reproduce this (Built 5.0.x head with cuda and application w/o cuda and ran using the OFI MTL) when I tried so Howard said he'd take a look at reproducing.

@hppritcha
Copy link
Member Author

this problem vanishes if I don't use --enable-mca-dso configure option.

@hppritcha
Copy link
Member Author

i'm fine with not using the --enable-mca-dso option. I think this problem is somewhat related to the problems that PR #10949 was seeking to address.

@rhc54
Copy link
Contributor

rhc54 commented Jan 31, 2023

Not being able to enable that option seems like a pretty drastic solution for the general community. The other PR has been sitting there waiting to be committed for 3 months now - can the community find some way to make that happen? Seems like there are a bunch of PRs stuck in that situation.

@jsquyres
Copy link
Member

jsquyres commented Jan 31, 2023

If something breaks when you use --enable-mca-dso, it usually means that there is a genuine bug -- perhaps in ordering of shutdown or somesuch.

@hppritcha
Copy link
Member Author

well what's happening i believe is that the accelerator framework is being dlclosed prior to smcuda (part of btl framework).

@jsquyres
Copy link
Member

well what's happening i believe is that the accelerator framework is being dlclosed prior to smcuda (part of btl framework).

That seems like a legit bug that should be fixed, right?

@hppritcha
Copy link
Member Author

I think there may be something more complicated going on. it appears that on the system i'm using, that if i load a "cudatoolkit" module, this causes the call to the smcuda btl to initialize although the accelerator framework cuda component failed to initialized. The magic in the cudatoolkit module appears to be adding libcudart.so into the LD_LIBRARY_PATH.

Given this, i think this is some cuda support specific problem. I'm confused though why the problem only shows up when building with -enable-mca-dso configure option.

@hppritcha
Copy link
Member Author

uff. this is kind of ugly.

the smcuda btl creates a cuda context after the accelerator framework has been initialized, leading to a bunch of problems. going to open a wip PR.

@hppritcha
Copy link
Member Author

hppritcha commented Feb 7, 2023

well now I cannot seem to reproduce these warning messages, either with main or v5.0.x. Tried both with vendor compiler wrappers and using gnu compiler directly.

@wckzhang
Copy link
Contributor

wckzhang commented Feb 8, 2023

If we can't reproduce this issue (I suspect that old versions of the accelerator framework component were loaded), can we mark this issue as closed?

@hppritcha
Copy link
Member Author

actually i was able to reproduce again but its not simple. it seems i only hit this problem when using the vendor supplied compiler wrappers to build ompi.

hppritcha added a commit to hppritcha/ompi that referenced this issue Feb 24, 2023
related to open-mpi#11354

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Feb 27, 2023
related to open-mpi#11354

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 2, 2023
related to open-mpi#11354

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@wckzhang
Copy link
Contributor

wckzhang commented Mar 2, 2023

I'm going to add a fix for the ob1 component and then mark this issue as done, please backport the issue @hppritcha

@wckzhang
Copy link
Contributor

wckzhang commented Mar 2, 2023

Actually I looked into it more and ob1 shouldn't have the same issue since it occurs in the component_fini rather than component_close

hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 2, 2023
related to open-mpi#11354

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit f7803dd)
boi4 pushed a commit to boi4/ompi that referenced this issue Mar 23, 2023
related to open-mpi#11354

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
related to open-mpi#11354

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants