accelerator framework/cuda: still not entirely fixed #11354

hppritcha · 2023-01-27T20:40:53Z

I was checking out head of the v5.0.x branch in high expectations that it would work well on our nvidia + HPE SS11 (aka libfabric) system, but alas, if my application doesn't use cudA, yet is linked against a ompi v5.0.x with all the recent accelerator/cuda changes in place, and configured for CUDA support, things don't work right.

Hello, world, I am 1 of 2, (Open MPI v5.0.0rc9, package: Open MPI hpp@ch-fe1 Distribution, ident: 5.0.0rc9, repo rev: v5.0.0rc9-287-g5d87f3e6, Unreleased developer copy, 141)
Hello, world, I am 0 of 2, (Open MPI v5.0.0rc9, package: Open MPI hpp@ch-fe1 Distribution, ident: 5.0.0rc9, repo rev: v5.0.0rc9-287-g5d87f3e6, Unreleased developer copy, 141)
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The call to cuEventDestory failed. This is a unrecoverable error and will
cause the program to abort.
  cuEventDestory return value:   709
Check the cuda.h file for what the return value means.

It looks like holes may have been plugged for OB1 (if i set the pml to use ob1 I don't see these messages), but such is not the case when using other PMLs apparently.

The text was updated successfully, but these errors were encountered:

wckzhang · 2023-01-27T21:46:12Z

ACK, there shouldn't be any event create/destroy if cuda init hasn't been called, I fixed a bug in the ob1 path to fix this but must have missed it. It should be easy enough to figure it out.

hppritcha · 2023-01-27T21:51:45Z

if you'd like to fix go ahead and assign this to yourself. i've got some other things to fix elsewhere....

wckzhang · 2023-01-30T18:11:44Z

I wasn't able to reproduce this (Built 5.0.x head with cuda and application w/o cuda and ran using the OFI MTL) when I tried so Howard said he'd take a look at reproducing.

hppritcha · 2023-01-31T17:21:41Z

this problem vanishes if I don't use --enable-mca-dso configure option.

hppritcha · 2023-01-31T20:51:09Z

i'm fine with not using the --enable-mca-dso option. I think this problem is somewhat related to the problems that PR #10949 was seeking to address.

rhc54 · 2023-01-31T21:08:52Z

Not being able to enable that option seems like a pretty drastic solution for the general community. The other PR has been sitting there waiting to be committed for 3 months now - can the community find some way to make that happen? Seems like there are a bunch of PRs stuck in that situation.

jsquyres · 2023-01-31T21:30:27Z

If something breaks when you use --enable-mca-dso, it usually means that there is a genuine bug -- perhaps in ordering of shutdown or somesuch.

hppritcha · 2023-01-31T21:42:44Z

well what's happening i believe is that the accelerator framework is being dlclosed prior to smcuda (part of btl framework).

jsquyres · 2023-01-31T21:51:09Z

well what's happening i believe is that the accelerator framework is being dlclosed prior to smcuda (part of btl framework).

That seems like a legit bug that should be fixed, right?

hppritcha · 2023-02-01T22:20:12Z

I think there may be something more complicated going on. it appears that on the system i'm using, that if i load a "cudatoolkit" module, this causes the call to the smcuda btl to initialize although the accelerator framework cuda component failed to initialized. The magic in the cudatoolkit module appears to be adding libcudart.so into the LD_LIBRARY_PATH.

Given this, i think this is some cuda support specific problem. I'm confused though why the problem only shows up when building with -enable-mca-dso configure option.

hppritcha · 2023-02-02T22:56:49Z

uff. this is kind of ugly.

the smcuda btl creates a cuda context after the accelerator framework has been initialized, leading to a bunch of problems. going to open a wip PR.

hppritcha · 2023-02-07T18:13:18Z

well now I cannot seem to reproduce these warning messages, either with main or v5.0.x. Tried both with vendor compiler wrappers and using gnu compiler directly.

wckzhang · 2023-02-08T22:35:05Z

If we can't reproduce this issue (I suspect that old versions of the accelerator framework component were loaded), can we mark this issue as closed?

hppritcha · 2023-02-09T15:37:22Z

actually i was able to reproduce again but its not simple. it seems i only hit this problem when using the vendor supplied compiler wrappers to build ompi.

related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

wckzhang · 2023-03-02T21:36:11Z

I'm going to add a fix for the ob1 component and then mark this issue as done, please backport the issue @hppritcha

wckzhang · 2023-03-02T21:43:22Z

Actually I looked into it more and ob1 shouldn't have the same issue since it occurs in the component_fini rather than component_close

related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit f7803dd)

related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha added Target: main Target: v5.0.x labels Jan 27, 2023

hppritcha self-assigned this Jan 27, 2023

wckzhang assigned wckzhang and unassigned wckzhang Jan 27, 2023

hppritcha added a commit to hppritcha/ompi that referenced this issue Feb 24, 2023

smcuda: fixes when using enable-mca-dso

7b6c570

related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha mentioned this issue Feb 24, 2023

smcuda: fixes when using enable-mca-dso #11443

Merged

hppritcha added a commit to hppritcha/ompi that referenced this issue Feb 27, 2023

smcuda: fixes when using enable-mca-dso

6ad92c2

related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 2, 2023

smcuda: fixes when using enable-mca-dso

f7803dd

related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

hppritcha added a commit to hppritcha/ompi that referenced this issue Mar 2, 2023

smcuda: fixes when using enable-mca-dso

4d28266

related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit f7803dd)

hppritcha mentioned this issue Mar 2, 2023

v5.0.x: smcuda: fixes when using enable-mca-dso #11461

Merged

awlauria closed this as completed Mar 16, 2023

boi4 pushed a commit to boi4/ompi that referenced this issue Mar 23, 2023

smcuda: fixes when using enable-mca-dso

d77a8ff

related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024

smcuda: fixes when using enable-mca-dso

a279541

related to open-mpi#11354 Signed-off-by: Howard Pritchard <howardp@lanl.gov>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerator framework/cuda: still not entirely fixed #11354

accelerator framework/cuda: still not entirely fixed #11354

hppritcha commented Jan 27, 2023

wckzhang commented Jan 27, 2023

hppritcha commented Jan 27, 2023

wckzhang commented Jan 30, 2023

hppritcha commented Jan 31, 2023

hppritcha commented Jan 31, 2023

rhc54 commented Jan 31, 2023

jsquyres commented Jan 31, 2023 •

edited

Loading

hppritcha commented Jan 31, 2023

jsquyres commented Jan 31, 2023

hppritcha commented Feb 1, 2023

hppritcha commented Feb 2, 2023

hppritcha commented Feb 7, 2023 •

edited

Loading

wckzhang commented Feb 8, 2023

hppritcha commented Feb 9, 2023

wckzhang commented Mar 2, 2023

wckzhang commented Mar 2, 2023

accelerator framework/cuda: still not entirely fixed #11354

accelerator framework/cuda: still not entirely fixed #11354

Comments

hppritcha commented Jan 27, 2023

wckzhang commented Jan 27, 2023

hppritcha commented Jan 27, 2023

wckzhang commented Jan 30, 2023

hppritcha commented Jan 31, 2023

hppritcha commented Jan 31, 2023

rhc54 commented Jan 31, 2023

jsquyres commented Jan 31, 2023 • edited Loading

hppritcha commented Jan 31, 2023

jsquyres commented Jan 31, 2023

hppritcha commented Feb 1, 2023

hppritcha commented Feb 2, 2023

hppritcha commented Feb 7, 2023 • edited Loading

wckzhang commented Feb 8, 2023

hppritcha commented Feb 9, 2023

wckzhang commented Mar 2, 2023

wckzhang commented Mar 2, 2023

jsquyres commented Jan 31, 2023 •

edited

Loading

hppritcha commented Feb 7, 2023 •

edited

Loading