Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.1.x Thread fixes. #9312

Merged
merged 1 commit into from
Sep 7, 2021
Merged

v4.1.x Thread fixes. #9312

merged 1 commit into from
Sep 7, 2021

Conversation

awlauria
Copy link
Contributor

@awlauria awlauria commented Aug 25, 2021

  • ompi/request: Add a read memory barrier to sync the receive buffer after wait completes.

We found an issue where with using multiple threads, it is possible for the data
to not be in the buffer before MPI_Wait() returns. Testing the buffer later after
MPI_Wait() returned would show the data arrives eventually without the rmb().

We have seen this issue on Power9 intermittently using PAMI, but in theory could
happen with any transport.

Signed-off-by: Austen Lauria awlauria@us.ibm.com
(cherry picked from commit 12192f1)

…on after wait completes.

We found an issue where with using multiple threads, it is possible for the data
to not be in the buffer before MPI_Wait() returns. Testing the buffer later after
MPI_Wait() returned would show the data arrives eventually without the rmb().

We have seen this issue on Power9 intermittently using PAMI, but in theory could
happen with any transport.

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
(cherry picked from commit 12192f1)
@awlauria awlauria added this to the v4.1.2 milestone Aug 25, 2021
@ibm-ompi
Copy link

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/5993f493f00f83864718692818bc75dc

@ibm-ompi
Copy link

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/8c4777045b26173501a49ea7ff4f9b15

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/77f2bd1854f5da192ec6c2107c8049fe


ompi_hook_base_mpi_init_thread_top(argc, argv, required, provided);

if (NULL != (env = getenv("OMPI_MPI_THREAD_LEVEL"))) {
required = atoi(env);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this documented anywhere? It would seem pretty important to document this behavior -- e.g., in the man page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't quite understand the commit message. It says "Allow mpi_init_thread to override the MPI_THREAD_LEVEL", but it looks more like the commit allows an env variable to override the thread level set by the app in the call to MPI_INIT_THREAD.

This is new behavior that has not existed in 4.0.x and 4.1.x. There are small (but nonzero) affects to backwards compatibility here, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's making it consistent with ompi_mpi_init, which also uses this ENV variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ompi_mpi_init() does not check the environment variable for the thread level: MPI_Init() does.

We created an env variable "back door" for legacy apps that want to run with MPI threading support but couldn't change their source code to call MPI_Init_thread(). Honestly, it was mainly a way for us to enable MPI_THREAD_MULTIPLE with all the existing OMPI MPI test codes without changing all of them to MPI_Init_thread().

We intentionally did not put in the same backdoor to MPI_Init_thread() because MPI_Init_thread() allows the app to say exactly what it wants (where MPI_Init() does not). Having an environment variable override to something that was specified as a parameter to MPI_Init_thread() seems... sketchy.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is named "Thread fixes", but contains a fairly significant behavior change.

Indeed, that behavior change was added after the PR was positively reviewed by @bosilca. That is not cool.

The errant commit should be removed from this PR, and the first commit -- which is actually a thread fix -- can probably be merged.

The commit about adding a new feature to MPI_Init_thread() should be debated separately on a different PR. If that commit has already been merged to the v4.0.x branch, it should be reverted.

@awlauria
Copy link
Contributor Author

@jsquyres that was not added after @bosilca reviewed. In fact - he commented on it in the master PR:

Funny that we support OMPI_MPI_THREAD_LEVEL in MPI_Init but not in MPI_Init_thread.

Originally posted by @bosilca in #9302 (comment)

@jsquyres
Copy link
Member

jsquyres commented Aug 30, 2021

Screen Shot 2021-08-30 at 11 00 20 AM

He may have commented on it elsewhere, but on the timeline of this PR, the commit came in afterwards.

Regardless, this is a big behavior change. It's not a fix. And I would argue against this behavior change, because it is specifically discarding what the app stated in the provided argument. We can/should have the debate as to whether this is a good feature to have or not (I'm not personally in favor of it, but I could be overruled). If it is behavior we want, perhaps this should actually be turned into an real MCA var, and moved to ompi_mpi_init() (not handled independently in MPI_Init() and MPI_Init_thread(), not a standalone env var. Additionally, the commit message should be fixed to be correct.

@awlauria
Copy link
Contributor Author

awlauria commented Aug 30, 2021

@jsquyres if you actually clicked on the 'forced-push' link you will see that I fixed a bad cherry-pick of that exact commit - in that I only changed the name of the variable which caused the build to fail (a slight difference from the master cherry-pick)

I did not add this after @bosilca reviewed.

@awlauria
Copy link
Contributor Author

if you don't want this that is fine - but for the record I did not "sneak this in" after it was reviewed, and it matches what went into v4.0.x, v5.0.x and master.

@jsquyres
Copy link
Member

if you don't want this that is fine - but for the record I did not "sneak this in" after it was reviewed, and it matches what went into v4.0.x, v5.0.x and master.

Ok. That is not immediately clear from the github UI. In the future, it would probably be good to either re-request reviews or put a comment in explaining what happened and why a re-review was not necessary.

@awlauria
Copy link
Contributor Author

if you don't want this that is fine - but for the record I did not "sneak this in" after it was reviewed, and it matches what went into v4.0.x, v5.0.x and master.

Ok. That is not immediately clear from the github UI. In the future, it would probably be good to either re-request reviews or put a comment in explaining what happened and why a re-review was not necessary.

it is clear from github - it's literally documented in the link.

I did not want to waste peoples time with re-reviewing a mistake I made in the cherry-pick. If it was a real change, I would have re-requested a review from the appropriate parties.

@jsquyres
Copy link
Member

You have educated me on a point that I didn't know: you can click on a force push link on the github UI and it shows the change from that force push. Thanks.

However, the github UI very much makes it look like a commit was added after the review: there's a line for the commit and then there's another line for the force push. So there's no indication left on the UI that the 2nd commit was there before the review.

Screen Shot 2021-08-30 at 11 39 05 AM

This is why I say that I think it's helpful in a collaborative spirit to add a comment saying "Hey, that force push was just me making a trivial fix. No need to review again." Indeed, sometimes there are a bunch of force pushes, and having to click through a bunch of them to determine what changed is more work for each reviewer vs. the author adding a quick comment. Additionally, we also get emails upon force pushes. Rather than having to click through to chase down everything a PR author did, if the PR author just adds a quick comment summarizing the changes, that can be helpful.

Just my $0.02.

@awlauria
Copy link
Contributor Author

Yeah - redundancy is never a bad thing I suppose. I didn't think such a trivial change needed to be documented/commented - and figured that the link to the code change was sufficient.

if this is such a huge issue I think github has an option to invalidate reviews on a force push. Maybe we should consider turning that on.

@awlauria
Copy link
Contributor Author

I will say that commenting on what a force-push does isn't bound to anything - and people could say anything about it if they think maintainers won't follow the push and really wanted to sneak code in.

So - following and viewing the diffs is still good diligence - even if the person doing the force-push commented on what it did.

@jsquyres
Copy link
Member

I will say that commenting on what a force-push does isn't bound to anything - and people could say anything about it if they think maintainers won't follow the push and really wanted to sneak code in.

So - following and viewing the diffs is still good diligence - even if the person doing the force-push commented on what it did.

Understood. I don't think we disagree. We are a trusting dev community, probably because most of us know each other. The reasons for my "request changes" review (and the corresponding #9331 revert to v4.0.x) are more about the fact that I don't think that we want to make this potentially-backwards-compatibility-affecting change to the release branches.

#9332 is a place to discuss what we want to do with this feature for v5.0.x and forward.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing the additional commit. All is good now.

@jsquyres jsquyres merged commit 66d34c3 into open-mpi:v4.1.x Sep 7, 2021
@awlauria awlauria deleted the v4.1.x_threads branch March 17, 2022 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants