Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smcuda: fixes when using enable-mca-dso #11443

Merged
merged 1 commit into from
Mar 2, 2023

Conversation

hppritcha
Copy link
Member

related to #11354

@hppritcha
Copy link
Member Author

rebasing on top of main to pull in pmix_fence usage patch

@@ -79,6 +83,8 @@ int mca_btl_smcuda_accelerator_init(void)
goto cleanup_and_error;
}

opal_finalize_register_cleanup(mca_btl_smcuda_accelerator_fini);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you when you say this will fix the issue, but do you mind explaining why it needs to happen this way since I don't quite understand. The pml ob1 component does something similar and I'm wondering if it requires the same fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wckzhang does this comment block help answer your questions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we need to do the same thing with ob1 then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't in theory this the case for all components that use the accelerator framework functionality?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wckzhang where would you want to make a change in the ob1 to use the opal cleanup callback mechanism?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgargabriel if you find your consumer of the accelerator framework is doing something in its component close method that involves calls into the accelerator framework then you are probably right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it back about ob1 because it already does it in a fini function instead of component close

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgargabriel Yes, though the other components aren't doing any accelerator functionality during finalization. The reason ob1 and smcuda are susceptible to this is because they create event arrays for the purpose of their asynchronous progress engine which need to be cleaned up during finalization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the rcache/grdma component? I think that one does something similar, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No as far as I can tell it doesn't access the accelerator framework in its closing sequence, just during its register path.

@hppritcha
Copy link
Member Author

np i'll put some verbiage in around this call.

related to open-mpi#11354

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@hppritcha hppritcha merged commit 3889b48 into open-mpi:main Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants