-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add accelerator and part frameworks to mca lists #10949
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this change fixes anything.
mca_base_framework_open_list()
is literally just a for loop around the framework list callingmca_base_framework_open()
on each item in the list. It has the same flag argument, so there's no change there. I don't think this was the important change.I thought it might be a different behavior in the registered cleanup argument, but reading the code for
mca_base_framework_close_list
, it is just callingmca_base_framework_close()
. I think the real problem is not any of the list usage stuff, but that the registered cleanup function was/is calling the finalize function directly rather than callingmca_base_framework_close
, which was a pretty silly mistake on our part in the review. It kind of makes me wonder where @wckzhang found that cleanup example and if we don't have other issues lurking elsewhere.Either way, this patch is wrong. Either the
opal_finalize_register_cleanup
on line 203 needs to be removed or the whole patch could become changing that toopal_finalize_register_cleanup_arg(mca_base_framework_close, opal_accelerator_base_framework)
. The first obviously seems simpler, but one of the two needs to happen.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before the patch here's the dlopening/closing activity related to accelerator framework:
behavior with the patch
a framework has to be closed in opal_finalize when using enable-mca-dso otherwise a reopen of the framework is a no op owing to a check at line 173 of mca_base_framework.c but the component *.so's were unloaded as part of the MPI_Finalize. Without the patch the accelerator framework was not being closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, like I said, the issue is not the use / non-use of the list. The issue is that the code was registering a call to the finalize call directly, rather than component close. If line 203 was
opal_finalize_register_cleanup_arg(mca_base_framework_close, opal_accelerator_base_framework)
, the existing code would have worked just as well.I have no problem with the list change, but in opal, that was not the root cause of the problem. Deregisters are called in reverse order, so I think what's happening is that
accelerator_base_selected_component.accelerator_finalize()
is called and then shortly latermca_base_framework_close(opal_accelerator_base_framework)
is called, which callsaccelerator_base_selected_component.accelerator_finalize()
again and then closes the framework correctly. But we document that finalize will be called once per open, so we're getting lucky that the component doesn't do anything bad in that second call to finalize.Hence, for the patch to be correct, line 203 (
opal_finalize_register_cleanup(accelerator_base_selected_component.accelerator_finalize);
) needs to be removed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep I agree about removing the accelerator finalize call.
@wckzhang do you recall what motivated having the accelerator component finalize handled in a separate function rather than in the close of the component? Is there some cuda behavior that motivated this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought that was how we do cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hppritcha I don't understand that.
mca_base_framework_close()
is still the right answer; calling<framework>.finalize()
directly leaves the base framework in a not-clean state.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the accelerator framework was mostly doing the right thing. As described in section IIIB of 10.1109/CLUSTER.2019.8891002 the complex interlinking of framework dependencies that existed prior to sessions was replaced with a cleanup callback framework.
@hjelmn can provide more details if needed.
What is the not-clean state that the accelerator framework is being left in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code calls:
instead of the correct:
So while the component gets cleaned up, the framework itself does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would
opal_finalize_register_cleanup_arg(mca_base_framework_close, opal_accelerator_base_framework)
be needed? If it is in the list of frameworks (which it should be if you want it always opened) then it will get called automatically when the last call to finalize is made.I plan to take a closer look at this PR and the existing code tomorrow. Will probably have more comments then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjelmn we've gone around so long that it doesn't matter. But the current PR is wrong and needs to be fixed. Today, on line 212, it calls
opal_finalize_register_cleanup(opal_accelerator_base_selected_component.accelerator_finalize);
. This line just needs to be removed.I was trying to make the point to Howard that adding the accelerator framework to the list of frameworks wasn't strictly necessary to fixing the issue, because I was pushing on how we missed this issue in the original CR. That just overcomplicated things. If the patch is extended to remove line 212, I'm happy to ship this patch. Until then, the patch is wrong, because we're calling the component's
accelerator_finalize()
function twice. Once from the base cleanup function that is invoked by adding the framework to the opal framework list and once explicitly on line 212.