Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging in mpi spawn application with openmpi 4.0.1 #6612

Closed
iassiour opened this issue Apr 24, 2019 · 2 comments
Closed

Hanging in mpi spawn application with openmpi 4.0.1 #6612

iassiour opened this issue Apr 24, 2019 · 2 comments
Assignees

Comments

@iassiour
Copy link

Please see below a scenario that leads to hanging for mpi spawn program.

mpirun command starts a few "master" processes

$MPI_HOME/bin/mpirun -oversubscribe -H red9906025 -x LD_LIBRARY_PATH -np 7 mpi_master

mpi_master.c

#include <stdlib.h>
#include <mpi.h>
#include <stdio.h>

int main() {
char slavejobtospawn[500];
strcpy(slavejobtospawn, "mpi_slave");

MPI_Comm wcomm_;
MPI_Info minfo;
int mpistat,myrank;

char localhost[40];

MPI_Init(NULL,NULL);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

printf("master %s started\n",localhost);
fflush(stdout);

if(myrank == 0) {

 printf("master %s tries to spawn\n",localhost);
 fflush(stdout);
 mpistat = MPI_Info_create(&minfo);
 mpistat = MPI_Info_set(minfo, "add-hostfile", "lamhosts_spawn");
 mpistat = MPI_Comm_spawn(slavejobtospawn, MPI_ARGV_NULL,
       7, minfo, 0, MPI_COMM_WORLD, &wcomm_, MPI_ERRCODES_IGNORE);

 printf("master %s spawned success\n",localhost);

}

usleep(100000000);
mpistat = MPI_Finalize();
return 0;
}

Rank 0 spawns a number of slaves.

mpi_slave.c

#include <stdlib.h>
#include <mpi.h>
#include <stdio.h>

int main() {
int rank_;
MPI_Comm slave_Comm_;
int mpistat;
// == init MPI
char localhost[40];
mpistat = gethostname(localhost, 40);

MPI_Init(NULL,NULL);

mpistat = MPI_Comm_get_parent(&slave_Comm_);

printf("slave %s connected to parent\n",localhost);
fflush(stdout);

mpistat = MPI_Finalize();

printf("slave %s shutting down\n",localhost);
fflush(stdout);

//cout << "SLAVE " << localhost << " SHUTTING DOWN" <<endl;
return 0;
}

The lamhosts_spawn looks like this:

red9906025
red9906026
red9906026
red9906027
red9906027
red9906028
red9906028

The application seems to hang with this stack on master:
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x0000000000799783 in OPAL_MCA_PMIX3X_PMIx_Connect ()
#2 0x000000000078b504 in pmix3x_connect ()
#3 0x0000000000445ff6 in ompi_dpm_connect_accept ()
#4 0x0000000000462a95 in PMPI_Comm_spawn ()
#5 0x000000000043c6c3 in main ()

And on the spawned slaves:
#0 0x00002aaaabc6f6b3 in *__GI___poll (fds=, nfds=, timeout=0) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00000000006b6336 in poll_dispatch ()
#2 0x00000000006ac23d in opal_libevent2022_event_base_loop ()
#3 0x0000000000661080 in opal_progress ()
#4 0x00000000004423fd in ompi_request_wait_completion ()
#5 0x00000000004446bc in ompi_comm_nextcid ()
#6 0x0000000000446389 in ompi_dpm_connect_accept ()
#7 0x000000000044a53a in ompi_dpm_dyn_init ()
#8 0x000000000045a890 in ompi_mpi_init ()
#9 0x000000000043c77d in PMPI_Init ()
#10 0x000000000043c5dc in main ()

@ggouaillardet
Copy link
Contributor

MPI_Comm_spawn() is a collective operation and should hence be called by all the ranks of the communicator.

@hppritcha hppritcha self-assigned this Jan 13, 2020
@hppritcha
Copy link
Member

As noted by @ggouaillardet the test has an error. closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants