-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PRRTE git submodule to point to Open MPI fork #12449
Conversation
oh one of those destructor failures again. i thought i'd fixed that but maybe there's another. |
git bisect implicates openpmix/prrte@6e9417a |
@hppritcha Can you post the actual error here so I can look into it? I'm continuing to cleanup group operations - I'll bet, though, that the problem here is because you didn't update the PMIx submodule to sync with the changes in PRRTE. The fix for local group ops required a coordinated change between the two packages. |
Sigh - sorry about that last note, I was logged into a different account! |
yep it looks like advancing openpmix sha to 2f225b0748febe4df3695ec89d16a33ce155394e causes the problem to go away, at least for -np 1. |
You folks will need to decide how you want to handle this for release branches. The group rework is not being brought across to the PMIx/PRRTE release branches, but that also means that some of the group-related problems won't be fixed there. We can discuss when things get to an appropriate point, if you like. |
too bad. looks like we're on to the next failure. |
Well, let me know if it is PMIx related - I'll try to help give you a solid launch point for your PRRTE fork. |
looks like somethings changed with PMIx_Group_construct that causes |
Hmmm...that seems odd. Testing with current head of PMIx/PRRTE master branches shows that $ prterun -n 2 ./group_lcl_cid
Client ns prterun-Ralphs-iMac-59184@1 rank 0 pid 59362: Running
Client ns prterun-Ralphs-iMac-59184@1 rank 1 pid 59363: Running
Rank 1 Group construct complete with status PMIX_SUCCESS KEY pmix.grp.id CID 4294967295
Rank 0 Group construct complete with status PMIX_SUCCESS KEY pmix.grp.id CID 4294967295
prterun-Ralphs-iMac-59184@1:1: PMIx_Get LOCAL CID for rank 0 SUCCESS value: PMIX_VALUE: Data type: PMIX_SIZE Value: 1234
prterun-Ralphs-iMac-59184@1:0: PMIx_Get LOCAL CID for rank 0 SUCCESS value: PMIX_VALUE: Data type: PMIX_SIZE Value: 1234
prterun-Ralphs-iMac-59184@1:0: PMIx_Get LOCAL CID for rank 1 SUCCESS value: PMIX_VALUE: Data type: PMIX_SIZE Value: 1235
prterun-Ralphs-iMac-59184@1:1: PMIx_Get LOCAL CID for rank 1 SUCCESS value: PMIX_VALUE: Data type: PMIX_SIZE Value: 1235
prterun-Ralphs-iMac-59184@1:0 COMPLETE
prterun-Ralphs-iMac-59184@1:1 COMPLETE |
I also built OMPI main with the head of PMIx/PRRTE and ran a simple spawn test, which also uses the comm_cid code that calls $ mpirun -n 2 ./simple_spawn
[prterun-Ralphs-iMac-80906@1:1 pid 80908] starting up on node Ralphs-iMac.local!
[prterun-Ralphs-iMac-80906@1:0 pid 80907] starting up on node Ralphs-iMac.local!
0 completed MPI_Init
1 completed MPI_Init
Parent [pid 80907] about to spawn!
Parent [pid 80908] about to spawn!
[prterun-Ralphs-iMac-80906@2:0 pid 80909] starting up on node Ralphs-iMac.local!
[prterun-Ralphs-iMac-80906@2:1 pid 80910] starting up on node Ralphs-iMac.local!
[prterun-Ralphs-iMac-80906@2:2 pid 80911] starting up on node Ralphs-iMac.local!
1 completed MPI_Init
Hello from the child 1 of 3 on host Ralphs-iMac.local pid 80910
2 completed MPI_Init
Hello from the child 2 of 3 on host Ralphs-iMac.local pid 80911
0 completed MPI_Init
Hello from the child 0 of 3 on host Ralphs-iMac.local pid 80909
Parent done with spawn
Parent done with spawn
Parent sending message to child
Child 0 received msg: 38
Child 1 disconnected
Child 2 disconnected
Parent disconnected
Parent disconnected
Child 0 disconnected
80911: exiting
80908: exiting
80907: exiting
80910: exiting Perhaps you could share what you are doing, and what error you see? Is this a case again where we run a gazillion things and then eventually call something that accesses this code? If so, then maybe it isn't this specific code that is the source of the problem. |
Not sure who (if anyone) is monitoring the upstream PRRTE repo, but I posted a fix this morning to a problem cited by AMD last night (openpmix/prrte@f01e2a2) - it affects their web-based map display. We should probably chat sometime about if/how you plan to keep an eye on the upstream repo as there are changes which may impact you. |
@hppritcha You might try adding this patch, simply to help diagnostics: diff --git a/.github/workflows/ompi_mpi4py.yaml b/.github/workflows/ompi_mpi4py.yaml
index f23f20fb2f..bed157bc26 100644
--- a/.github/workflows/ompi_mpi4py.yaml
+++ b/.github/workflows/ompi_mpi4py.yaml
@@ -65,6 +65,9 @@ jobs:
mca_params="$HOME/.prte/mca-params.conf"
mkdir -p "$(dirname "$mca_params")"
echo rmaps_default_mapping_policy = :oversubscribe >> "$mca_params"
+ mca_params="$HOME/.pmix/mca-params.conf"
+ mkdir -p "$(dirname "$mca_params")"
+ echo gds = hash >> "$mca_params"
- name: Show MPI
run: ompi_info Unfortunately, disabling PMIx shmem doesn't mean that the shmem component is the cause of the problem. The component would be sensitive to OMPI memory corruption, so it may just be getting stomped on. Not sure how it would tie into |
thanks Ralph. i'll take a look with the patch. note the problem doesn't seem to have anything to do with spawn/connect accept however. |
Yes, we need to figure out how we're going to manage the OMPI fork of PRRTE; part of the reason I created this PR was to force that conversation. We didn't have a Tuesday call this week to discuss it. |
PR 3329 introduced a regression which causes Open MPI's MPI_Intercomm_from_groups method to fail. Related to open-mpi/ompi#12449 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
PR 3329 introduced a regression which causes Open MPI's MPI_Intercomm_from_groups method to fail. Related to open-mpi/ompi#12449 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
PR 3329 introduced a regression which causes Open MPI's MPI_Intercomm_from_groups method to fail. Related to open-mpi/ompi#12449 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 213956c)
PR 3329 introduced a regression which causes Open MPI's MPI_Intercomm_from_groups method to fail. Related to open-mpi/ompi#12449 Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 213956c)
Shift away from the upstream PRRTE (https://github.com/openpmix/prrte) and point to an Open MPI-specific fork for future releases of Open MPI. The direct consequence of this is that the PRRTE bundled in Open MPI will now be an Open MPI-specific fork of PRRTE. Tighter integration between Open MPI and this bundled, Open MPI-specific PRRTE will likely happen over time, but right now, it's functionallty identical as the upstream PRRTE. Note: this commit both updates to the current upstream PRRTE master branch tip (5cde35d439) plus one additional Open MPI-specific commit (1d867e8498) that edits the PRRTE VERSION file to include "ompi" to the version string. Signed-off-by: Jeff Squyres <jeff@squyres.com>
2f225b0748febe4df3695ec89d16a33ce155394e Fix local group operations and cleanup to pass all examples Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@jsquyres i'd like to push some fixes to this PR to pass mpi4py. |
@hppritcha Do you want to just resync the master branch with upstream? I believe all the fixes are in it. |
also rebase to pull in some cid fixes for inter communicators Signed-off-by: Howard Pritchard <howardp@lanl.gov>
e71220a
to
e3f0ada
Compare
@jsquyres i think this is ready when you want to merge. probably want to ping on slack/email about the
|
Shift away from the upstream PRRTE (https://github.com/openpmix/prrte) and point to an Open MPI-specific fork for future releases of Open MPI. The direct consequence of this is that the PRRTE bundled in Open MPI will now be an Open MPI-specific fork of PRRTE.
Tighter integration between Open MPI and this bundled, Open MPI-specific PRRTE will likely happen over time, but right now, it's functionallty identical as the upstream PRRTE.
Note: this commit both updates to the current upstream PRRTE master branch tip (5cde35d439) plus one additional Open MPI-specific commit (1d867e8498) that edits the PRRTE VERSION file to include "ompi" to the version string.
This commit effectively means that if we pull/cherry-pick/otherwise obtain a future commit git hash X from upstream PRRTE's git repo, it will have git hash Y on the Open MPI PRRTE repo. This may not be a big deal, but it is worth noting.
When checking out this branch or updating
main
after this PR merges, developers must re-sync their submodules:Given that we have not updated the PRRTE submodule URL for the Open MPI
v5.0.x
branch, asubmodule sync
(and possiblesubmodule update
) will be necessary every time you switch between the Open MPImain
andv5.0.x
branches.