Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync to PMIx v2.1.0 #4746

Merged
merged 3 commits into from
Feb 13, 2018
Merged

Sync to PMIx v2.1.0 #4746

merged 3 commits into from
Feb 13, 2018

Conversation

karasevb
Copy link
Member

Signed-off-by: Boris Karasev karasev.b@gmail.com

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
@karasevb karasevb added this to the v3.1.0 milestone Jan 24, 2018
@karasevb karasevb mentioned this pull request Jan 24, 2018
@jladd-mlnx jladd-mlnx requested review from jladd-mlnx and removed request for rhc54 January 24, 2018 13:58
Copy link
Member

@jladd-mlnx jladd-mlnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good to go.

@jladd-mlnx
Copy link
Member

👍
@hppritcha @bwbarrett This is ready to go.

@rhc54
Copy link
Contributor

rhc54 commented Jan 24, 2018

I am receiving consistent reports of "hangs" of PMIx-based programs using PMIx v2.1.0rc2 when direct launched against Slurm 17.11. I would advise not moving forward until that gets resolved as we have no info as to whether the problem is in the Slurm plugin, or in PMIx itself.

Given reassignment of @artpol84 and @karasevb, I'm not sure when the Slurm problem will be investigated. Perhaps someone here can comment?

Copy link
Member

@jjhursey jjhursey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks like a fine sync of PMIx 2.1 to the v3.1 branch

I saw @rhc54 's concern in the comments - so I don't know if we want to delay merging until that gets resolved or not.

@jladd-mlnx
Copy link
Member

@rhc54 we still maintain the SLURM plugin. We have this under test and do not see issues. In fact, we have problems without it. @artpol84 can you check if this hangs with srun launch with SLURM 17.11, please.

@artpol84
Copy link
Contributor

@rhc54 can you share those reports with me and forward those who has issues to me?

@artpol84
Copy link
Contributor

@rhc54 I’ll appreciate if in future you will let us know ASAP if any issues related to Slurm/PMIx

@artpol84
Copy link
Contributor

artpol84 commented Jan 24, 2018

From our side we are running Slurm with pmix v2.1 and Slurm on daily basis:
https://mtt.open-mpi.org/index.php?do_redir=2554
You can see in the list that we cover all pmix release versions and multiple OMPI versions and there is no dramatical difference in results between the versions. @karasevb please double-check the failures.
We are using external pmix, but if this difference is the issue - then OMPI has problems with internal integration.

@jjhursey I think we should merge it and we will address the issues if any.

@jjhursey
Copy link
Member

@artpol84 Maybe @rhc54 is thinking about this PMIx user reported issue:

@artpol84
Copy link
Contributor

@jjhursey according the description it doesn't seem like hang, but data corruption.
@karasevb please check.

@rhc54
Copy link
Contributor

rhc54 commented Jan 24, 2018

@artpol84 I have been letting you know about these problems, but haven't been getting any response. In addition to the mailing list, I have emailed you directly about it.

The reports are coming from both HPe and Intel. The behavior is the same in both cases. Note that HPe is not using OMPI, but rather a simple PMIx client test code. Same for Intel.

In the Intel case, Slurm 17.11.2.1 is configured with PMIx v2.10rc2. A simple "srun --mpi=pmix_v2" of a PMIx test client that calls PMIx_Init/Finalize hangs until timeout occurs. I don't have a lot of diagnostic output at this time, but have requested more.

@artpol84
Copy link
Contributor

We will resolve this

@rhc54
Copy link
Contributor

rhc54 commented Jan 24, 2018

Looking into the reports, it appears that the fence may be broken. One possibility that might explain the difference between your tests and what is being reported is - are your tests always using the IB "accelerated" path to do communications? If so, I suspect the other code path is having problems.

@artpol84
Copy link
Contributor

We test all of the cases.
I'm taking it with HPE now. Will update on the result.

@jladd-mlnx
Copy link
Member

@bwbarrett @hppritcha:
@artpol84 and @karasevb discovered an error in one of the standalone PMIx tests. This was the root cause of the issue reported by HPE. All of our testing is clean. I think we should merge this which will allow for broader testing coverage.

@artpol84
Copy link
Contributor

Here are some links:

@rhc54
Copy link
Contributor

rhc54 commented Feb 2, 2018

Official release is now available: https://github.com/pmix/pmix/releases/tag/v2.1.0

Contains a couple required bug fixes beyond rc2, so I'd recommend updating before commit.

Ralph Castain added 2 commits February 1, 2018 18:48
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@jjhursey
Copy link
Member

jjhursey commented Feb 2, 2018

Yeah this is good to go. 👍 Thanks!

@hppritcha hppritcha changed the title Sync to PMIx v2.1.0rc2 Sync to PMIx v2.1.0 Feb 8, 2018
@bwbarrett
Copy link
Member

@karasevb, with PMIx 2.1.0 going GA, is it possible to refresh this patch?

@rhc54
Copy link
Contributor

rhc54 commented Feb 12, 2018

@bwbarrett I already updated it - should be ready to go

@bwbarrett bwbarrett merged commit b6e825f into open-mpi:v3.1.x Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants