Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI Freezes during multi-threaded communication #10324

Closed
IanSaucy opened this issue Apr 27, 2022 · 24 comments
Closed

OpenMPI Freezes during multi-threaded communication #10324

IanSaucy opened this issue Apr 27, 2022 · 24 comments

Comments

@IanSaucy
Copy link

IanSaucy commented Apr 27, 2022

Background information

I am working on a project where we use multiple pthreads within a single MPI process, allowing each pthread to communicate with MPI. We have noticed that OMPI freezes sometimes and causes a single thread to ping at 100% CPU utilization and no forward progress is made. We were able to build a minimal example to demonstrate the issue. Unfortunately it is still non-deterministic and takes some time running to cause the issue.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

We're using Open MPI v4.1.2 from a TAR.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./configure
sudo make all install

We have the following version of GCC:

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

Please describe the system on which you are running

  • Operating system/version: Ubuntu 18.04 LTS
  • Computer hardware: AWS m6a.8xlarge
  • Network type: TCP/IP over Ethernet

Details of the problem

We have three machines running on AWS, m6a.8xlarge(64c) all connected via SSH. Our code is then run on three MPI processes, one on each machine. Each machine will have some variable number of threads, our test case uses 8. These threads then communicate with their respective threads on the other nodes in a ring style manner. Threads sync their work using pthread barriers to maintain synchronization of our virtual work units. We allow each thread to communicate with their respective thread on other nodes using message tags.

Typically, within about 5min but sometimes closer to 30min the issue will arise and the entire program will freeze and never make progress. We have tried running the same test but co-located on a single node but the issue does not seem to arise, we only see it when running it on multiple machines.

This is the specific command to run our program:

mpirun -np 3 --host one,two,three ./comm-test

The typical behavior is that two PIDs are pinged at 100% CPU while the others sit idle. I've attached the backtrace of the two processes that are pinged at 100% CPU and one that is sat idle. The backtrace is equal across all nodes.

We've also verified that OpenMPI supports multiple threads view the following:

ubuntu@ip-10-1-1-12:~/secrecy/experiments$ ompi_info | grep -i thread
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
   FT Checkpoint support: no (checkpoint thread: no)

Thanks in advance for any help.

Thread 1 at 100% CPU

#1  0x00007fe6007a6da3 in poll (__timeout=<optimized out>, __nfds=4, __fds=0x5635280c23c0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  poll_dispatch (base=0x563527fe5b00, tv=0x7fe5ee73ccf0) at poll.c:165
#3  0x00007fe60079dfbb in opal_libevent2022_event_base_loop (base=0x563527fe5b00, flags=2) at event.c:1630
#4  0x00007fe60074ea55 in opal_progress () from /usr/local/lib/libopen-pal.so.40
#5  0x00007fe6007554f5 in ompi_sync_wait_mt () from /usr/local/lib/libopen-pal.so.40
#6  0x00007fe601356f89 in ompi_request_default_wait () from /usr/local/lib/libmpi.so.40
#7  0x00007fe60139ede2 in PMPI_Wait () from /usr/local/lib/libmpi.so.40
#8  0x0000563527616e14 in pthread_func (params=0x5635280ea440) at main.c:58
#9  0x00007fe6010ec6db in start_thread (arg=0x7fe5ee73d700) at pthread_create.c:463
#10 0x00007fe600e1561f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 at 100% CPU

#1  futex_wait_simple (private=0, expected=925281, futex_word=0x5635278184c4 <end_work+4>) at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_barrier_wait (barrier=0x5635278184c0 <end_work>) at pthread_barrier_wait.c:184
#3  0x0000563527616f93 in main (argc=1, argv=0x7ffe46b8cce8) at main.c:98

The other threads that are not taking CPU time have the follow backtrack:

#1  0x00007fe6007a6da3 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fe5f8000b20) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2  poll_dispatch (base=0x563527fff140, tv=0x7fe5feb2ae60) at poll.c:165
#3  0x00007fe60079dfbb in opal_libevent2022_event_base_loop (base=0x563527fff140, flags=1) at event.c:1630
#4  0x00007fe60075489e in progress_engine () from /usr/local/lib/libopen-pal.so.40
#5  0x00007fe6010ec6db in start_thread (arg=0x7fe5feb2b700) at pthread_create.c:463
#6  0x00007fe600e1561f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Below is our minimal example of the issue:

#include <stdio.h>
#include <pthread.h>
#include "mpi.h"
#include <math.h>
#include <sys/time.h>
#include <stdbool.h>
#include <stdlib.h>

int rank, num_parties;
pthread_barrier_t start_work, end_work;
#define COMM_PER_ROUND 20
#define THREAD_COUNT 1
#define NUM_PARTIES 3
#define DEBUG 1
struct thread_params {
    int thread_id;
};
struct timeval begin, end;
double seconds, micro;
double elapsed;
bool doExit = false;

int get_rank() {
    return rank;
}

int get_succ() {
    return (get_rank() + 1) % NUM_PARTIES;
}

int get_pred() {
    return ((get_rank() + NUM_PARTIES) - 1) % NUM_PARTIES;
}

void *pthread_func(struct thread_params *params) {
    while (1) {
#if DEBUG
        printf("Rank %i, Thread %i starting...\n", rank, params->thread_id);
#endif
        pthread_barrier_wait(&start_work);
        if (doExit) {
#if DEBUG
            printf("Rank %i Thread %i exiting...\n", rank, params->thread_id);
#endif
            free(params);
            return 0;
        }
        // Communicat rounds number of times
        for (int i = 0; i < COMM_PER_ROUND; ++i) {
            MPI_Request r1, r2;
            long long incoming_shares[4], outgoing_shares[4];
            // receive remote share from successor
            MPI_Irecv(incoming_shares, 4, MPI_LONG_LONG, get_succ(),
                      params->thread_id, MPI_COMM_WORLD, &r2);
            // send s1 to predecessor
            MPI_Isend(outgoing_shares, 4, MPI_LONG_LONG, get_pred(),
                      params->thread_id, MPI_COMM_WORLD, &r1);
            MPI_Wait(&r1, MPI_STATUS_IGNORE);
            MPI_Wait(&r2, MPI_STATUS_IGNORE);
        }
#if DEBUG
        printf("Rank %i, Thread %i completing...\n", rank, params->thread_id);
#endif
        pthread_barrier_wait(&end_work);
    }
}


int main(int argc, char **argv) {
    int obtained = 0;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &obtained);
    if (obtained < MPI_THREAD_MULTIPLE) {
        printf("Unable to obtain MPI Threaded");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &num_parties);
    pthread_barrier_init(&start_work, NULL,
                         THREAD_COUNT + 1);
    pthread_barrier_init(&end_work, NULL,
                         THREAD_COUNT + 1);

    // Setup threads
    for (int i = 0; i < THREAD_COUNT; ++i) {
        pthread_t thread;
        struct thread_params *pthread_params = malloc(sizeof(struct thread_params));
        pthread_params->thread_id = i;
        pthread_create(&thread, NULL, (void *) pthread_func, (void *) pthread_params);
        pthread_detach(thread);
    }

    // Number of communication rounds
    const double ROUNDS = pow(2, 2);
    for (long long i = 0; i < ROUNDS; ++i) {
        // Start everyone
        pthread_barrier_wait(&start_work);
        gettimeofday(&begin, 0);
        // wait for everyone to finish
        pthread_barrier_wait(&end_work);
        // stop timer
        gettimeofday(&end, 0);
        seconds = end.tv_sec - begin.tv_sec;
        micro = end.tv_usec - begin.tv_usec;
        elapsed = seconds + micro * 1e-6;
        if (rank == 0) {
            printf("%lli\t%.5f\n", i, elapsed);
        }

    }
    // Have all threads exit
    doExit = true;
    pthread_barrier_wait(&start_work);
#if DEBUG
    printf("Rank %i tearing down MPI\n", rank);
#endif
    MPI_Finalize();
    return 0;
}

Makefile:

CC= gcc -std=c99
CFLAGS= -O3 -Wall
MPI= mpicc
DEP = -pthread
comm_test: main.c
			$(MPI) $(CFLAGS) $(DEP) -o comm-test main.c
@IanSaucy
Copy link
Author

As an update, I've also tried Open MPI 5.0rc6 with the same results. Freezing in the same spot etc.

@bosilca
Copy link
Member

bosilca commented Apr 28, 2022

I cannot reproduce with main (v2.x-dev-9807-g0d126c6405) compiled in release mode. I increased the number of threads (THREAD_COUNT) up to 16, and run each test 100 times. Everything went smooth.

@IanSaucy
Copy link
Author

IanSaucy commented Apr 28, 2022

@bosilca Thanks for taking a look! On my end I have to allow the test to run until 20-100K tests before it freezes. Each test takes a different amount of time before it freezes but it's usually within that range or about 5-20min of testing. Also, it does only happen when running across three nodes that need to communicate over LAN/WAN vs using vader, sm or some other same node communication method.

Thanks!

@bosilca
Copy link
Member

bosilca commented Apr 28, 2022

Let's try something else before we go to such testing lengths. Assuming you are using the OB1 PML with the BTLs tcp and self (this will force all non-local communications over TCP by preventing sm BTL from being used), you should be able to dump the status of the sending and receiving queues. I suggest that once you obtain the deadlock, you attach to your application with gdb and call mca_pml_ob1_dump(comm, 1), where comm is MPI_COMM_WORLD in your example (or maybe ompi_mpi_comm_world.comm). You should do this on all your 3 processes and post the output (maybe a gist).

@IanSaucy
Copy link
Author

IanSaucy commented Apr 29, 2022

@bosilca Specifying --mca btl self,tcp has enabled me to trigger when running on a single node, thankfully in a fairly short period of time as well(when running on a large EC2 instance).

As a point of clarification on getting a dump of the queues. Does that command need to be run on the MPI process or could it also be called from one of the pthreads?

As of right now I'm getting a SIGSEGV when calling the function on what I presume to be one of the pthreads my program created.

GDB attached to one of the ./comm-test processes

(gdb) call (int) mca_pml_ob1_dump(ompi_mpi_comm_world.comm)

Program received signal SIGSEGV, Segmentation fault.
0x00007f10a32f2f44 in mca_pml_ob1_dump () from /home/ubuntu/openmpi-5.0.0rc6/ompi/.libs/libmpi.so.80
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(mca_pml_ob1_dump) will be abandoned.
When the function is done executing, GDB will silently stop.

@bosilca
Copy link
Member

bosilca commented Apr 29, 2022

You can call it in any context, gdb will execute it in it's own context. I think the segfault is because you are missing the second argument for the call call (int) mca_pml_ob1_dump(ompi_mpi_comm_world.comm, 1).

@IanSaucy
Copy link
Author

IanSaucy commented Apr 29, 2022

Oops, my bad on the typo. I tried again with the correct function params and I'm getting the same result. I was able to attach ARM DDT to the process in order to inspect the message queues. Although less verbose I did notice that all parties waiting on receive calls and there is no data stuck in sending queues.

Something did jump out to me once I started poking around the thread data using DDT. In the example I provided, there is that inner for loop each thread runs. The line for (int i = 0; i < COMM_PER_ROUND; ++i) { to be specific. The counter in the for loop i was not the same across some jth thread forked from each process. This seems odd given that each thread should progress at the same rate it's "pair" on other processes.

I know my GDB debugging skills are a bit lacking so I'll try to do some reading to better give you some data on what is happening when this deadlock occurs.

Edit:
I added some thread global sequence numbers on communication to try to hone in on what is happening. When the program enters a deadlock I am able to see that sequence numbers off by 1 on some threads.
image

@IanSaucy
Copy link
Author

IanSaucy commented May 2, 2022

I tried mixing up the order of the irec vs isend and the corresponding MPI_Wait calls as well. Same deadlock issue except it changes what is stuck inside the buffers.

@IanSaucy
Copy link
Author

@bosilca Do you have any thoughts on next steps to debug this issue? Could the thread synchronization issue be causing issues? Maybe switching away from barriers?

I can also cause the issue to occur and then give you SSH access to the node I'm testing on if that's helpful.

Thanks

@devreal
Copy link
Contributor

devreal commented May 24, 2022

I am able to reproduce this issue on a single node when running with --mca btl self,tcp and THREAD_COUNT set to 2. I cannot reproduce it with either THREAD_COUNT set to 1 (as in the posted code) or with --mca btl self,sm. This points to a bug in btl/tcp.

@devreal
Copy link
Contributor

devreal commented May 24, 2022

With Open MPI built in debug mode, I get an assertion every couple of runs:

test_mpi_thread_multiple: ../../../../../opal/mca/btl/tcp/btl_tcp_endpoint.c:517: mca_btl_tcp_endpoint_accept: Assertion `btl_endpoint->endpoint_sd_next == -1' failed.
[yoga:298124] *** Process received signal ***
[yoga:298124] Signal: Aborted (6)
[yoga:298124] Signal code:  (-6)
[yoga:298124] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fef966af420]
[yoga:298124] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fef964ec00b]
[yoga:298124] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fef964cb859]
[yoga:298124] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7fef964cb729]
[yoga:298124] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7fef964dcfd6]
[yoga:298124] [ 5] libopen-pal.so.0(mca_btl_tcp_endpoint_accept+0x61)[0x7fef962d93c5]
[yoga:298124] [ 6] libopen-pal.so.0(mca_btl_tcp_proc_accept+0x2c2)[0x7fef962df850]
[yoga:298124] [ 7] lib/libopen-pal.so.0(+0xd99d7)[0x7fef962d79d7]
[yoga:298124] [ 8] libevent_core-2.1.so.7(+0x2258d)[0x7fef9608058d]
[yoga:298124] [ 9] libevent_core-2.1.so.7(+0x22a9a)[0x7fef96080a9a]
[yoga:298124] [10] libevent_core-2.1.so.7(event_base_loop+0x2db)[0x7fef9608124e]
[yoga:298124] [11] libopen-pal.so.0(+0x3746f)[0x7fef9623546f]
[yoga:298124] [12] libopen-pal.so.0(opal_progress+0xc0)[0x7fef9623555e]
[yoga:298124] [13] libopen-pal.so.0(ompi_sync_wait_mt+0x1ef)[0x7fef96309402]
[yoga:298124] [14] libmpi.so.0(+0xb6987)[0x7fef96794987]
[yoga:298124] [15] libmpi.so.0(ompi_request_default_wait+0x2b)[0x7fef96794bce]
[yoga:298124] [16] libmpi.so.0(MPI_Wait+0x151)[0x7fef9683fc76]
[yoga:298124] [17] ./test_mpi_thread_multiple(+0x1581)[0x560e96348581]
[yoga:298124] [18] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fef966a3609]
[yoga:298124] [19] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fef965c8133]
[yoga:298124] *** End of error message ***

That's with --mca btl self,tcp and

#define COMM_PER_ROUND 10
#define THREAD_COUNT 2

@bosilca
Copy link
Member

bosilca commented May 24, 2022

Can't replicate on a setup with a single IP address. This, as well as the assert location reported by @devreal might indicate that the issue is in the code handling multiple interfaces. To check this out, let's make sure we restrict the BTL TCP flow over a single interface (127.0.0.1 on a single node).

What do you get if you run mpirun -np 3 --mca btl self,tcp --mca btl_tcp_if_include 127.0.0.1 ./thread_multiple ?

@IanSaucy
Copy link
Author

@bosilca Still get the same issue with the update command as well. I'll need to re-compile OMPI in debug mode to check if it fails the assert as well like @devreal found.

Thanks!

@bosilca
Copy link
Member

bosilca commented May 24, 2022

I tried to replicate on the same hardware as @devreal, but so far without success. I will talk with him tomorrow to see if he can share a setup where I can dig a little more into the reasons behind that abort (the abort signals that we are trying to open the connection between a pair of processes multiple times, clearly a bad thing).

@devreal
Copy link
Contributor

devreal commented May 25, 2022

I think @bosilca is right: I have two interfaces (one being the wifi and one being a docker device). If I a) run with --mca btl_tcp_if_include 127.0.0.1/16 (notice the /16 mask, otherwise the value is ignored); or b) disable the docker network interface, the hang disappears. That means this is an issue related to multiple devices in btl/tcp.

@bwbarrett bwbarrett modified the milestones: v4.1.4, v4.1.5 May 25, 2022
@IanSaucy
Copy link
Author

Thanks for all the help on this issue everyone! I'm wondering if you(@devreal) could describe your setup a little bit more? Using the below command I still see the freeze happening. I re-compiled OMPI with --enable-debug but haven't see the assertion error yet but the freeze is still there.

mpirun -np 3 --mca btl self,tcp  --mca btl_tcp_if_include 127.0.0.1/16  ./comm-test

Thanks!

@devreal
Copy link
Contributor

devreal commented May 26, 2022

Here is my network config:

$ ifconfig -a
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:34:cb:48:a7  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 145404  bytes 27992926 (27.9 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 145404  bytes 27992926 (27.9 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlp1s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 216.96.214.169  netmask 255.255.252.0  broadcast 216.96.215.255
        inet6 fe80::2fd2:24f3:f345:17f9  prefixlen 64  scopeid 0x20<link>
        ether e0:94:67:be:67:4d  txqueuelen 1000  (Ethernet)
        RX packets 7908598  bytes 6036182119 (6.0 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4245315  bytes 1664424851 (1.6 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

As I said earlier, excluding the docker0 device (shutting it down or restricting the interface to lo using --mca btl_tcp_if_include 127.0.0.1/16 solves the problem for me. Not sure why there is no conflict between lo and wlp1s0 once docker0 is gone though.

@IanSaucy
Copy link
Author

@devreal Interesting, I am not able to get the issue to go away even restricting the network card usage as you described. Are you able to provide some more details about the operating system and machine(VM or physical) that you are running the test on? Maybe my environment is introducing some sort of other conflict.

Also, how often in general did you see the assertion error when running the test program? I've run it in my environment ~ 10 times and haven't seen any assertions errors yet. This is with MPI built using the --enable-debug param.

Thanks!

@bosilca
Copy link
Member

bosilca commented May 31, 2022

The assertion only triggers in a scenario with multiple IP network interfaces, as soon as you restrict to a single interface you should never see the assert.

Back to the test: I've run the test hundreds of time and other members of the team did similar tests, but we were unable to make it deadlock. This might indicate we are chasing the wrong lead, and the problem is not with the communication layer. Going back to your original bug description, it was mentionned that the 3rd process was sitting idle. That's not something OMPI does in any MPI call, so maybe we can get some info from there. Can you please post the stack trace of the last process.

@bwbarrett bwbarrett modified the milestones: v4.1.5, v4.1.6 Feb 23, 2023
@bwbarrett bwbarrett removed this from the v4.1.6 milestone Sep 30, 2023
@bwbarrett bwbarrett added this to the v4.1.7 milestone Sep 30, 2023
@wenduwan
Copy link
Contributor

@IanSaucy Sorry for the 2-year wait. Are you still working on this?

Copy link

github-actions bot commented May 2, 2024

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

@github-actions github-actions bot added the Stale label May 2, 2024
@IanSaucy
Copy link
Author

IanSaucy commented May 3, 2024

@wenduwan Thanks for the response! I am not personally on this project anymore but I know my colleague @nitinm25 is still working on it. I'll let them pop-in with any updates!

Copy link

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

@github-actions github-actions bot added the Stale label May 17, 2024
Copy link

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants