Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMB-EXT stalls using openmpi 2.1.3 #4976

Closed
nmorey opened this issue Mar 27, 2018 · 27 comments
Closed

IMB-EXT stalls using openmpi 2.1.3 #4976

nmorey opened this issue Mar 27, 2018 · 27 comments

Comments

@nmorey
Copy link
Contributor

nmorey commented Mar 27, 2018

Running IMB-EXT from Intel (R) MPI Benchmarks 2018 Update 1 on a SLE12-SP3

rdma03:~/hpc-testing/:[0]# ompi_info --version
Open MPI v2.1.3.0.cfd8f3f34e27
rdma03:~/hpc-testing/:[0]# lspci | grep Mell
02:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
rdma04:~/:[0]# lspci | grep Mell
0a:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]

Running with this command stalls

rdma03:~/hpc-testing/:[0]# mpirun --host 192.168.0.1,192.168.0.2 -np 2  --allow-run-as-root --mca btl openib /usr/lib64/mpi/gcc/openmpi2/tests/IMB/IMB-EXT
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-2 part    
#------------------------------------------------------------
# Date                  : Tue Mar 27 10:42:19 2018
# Machine               : x86_64
# System                : Linux
# Release               : 4.4.120-94.17-default
# Version               : #1 SMP Wed Mar 14 17:23:00 UTC 2018 (cf3a7bb)
# MPI Version           : 3.1
# MPI Thread Environment: 


# Calling sequence was: 

# /usr/lib64/mpi/gcc/openmpi2/tests/IMB/IMB-EXT

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Window
# Unidir_Get
# Unidir_Put
# Bidir_Get
# Bidir_Put
# Accumulate

And then it stalls like this. Both nodes have an IMB-EXT process running at 100%

On first host:

(gdb) bt
#0  opal_atomic_unlock (lock=0x7f0fd17454e4 <mca_coll_libnbc_component+708>) at ../../../../opal/include/opal/sys/atomic_impl.h:435
#1  ompi_coll_libnbc_progress () at coll_libnbc_component.c:295
#2  0x00007f0fe2aadd94 in opal_progress () at runtime/opal_progress.c:226
#3  0x00007f0fe360b515 in sync_wait_st (sync=<optimized out>) at ../opal/threads/wait_sync.h:80
#4  ompi_request_default_wait_all (count=2, requests=0x7ffde91a2f20, statuses=0x0) at request/req_wait.c:221
#5  0x00007f0fe3652c7c in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=<optimized out>, rbuf=0x7ffde91a3010, count=4, dtype=0x7f0fe3897a40 <ompi_mpi_long>, op=<optimized out>, comm=<optimized out>, module=0x103c3a0)
    at base/coll_base_allreduce.c:225
#6  0x00007f0fd06d6f2c in ompi_osc_rdma_check_parameters (size=0, disp_unit=1, module=0x103a140) at osc_rdma_component.c:1054
#7  ompi_osc_rdma_component_select (win=0x103a060, base=0x7ffde91a3088, size=0, disp_unit=1, comm=0x1037cd0, info=0x6136a0 <ompi_mpi_info_null>, flavor=1, model=0x7ffde91a3094) at osc_rdma_component.c:1182
#8  0x00007f0fe360ec2c in ompi_win_create (base=base@entry=0x1003fd0, size=size@entry=0, disp_unit=disp_unit@entry=1, comm=comm@entry=0x1037cd0, info=0x6136a0 <ompi_mpi_info_null>, newwin=newwin@entry=0x7ffde91a3418) at win/win.c:236
#9  0x00007f0fe363e9dc in PMPI_Win_create (base=0x1003fd0, size=0, disp_unit=1, info=<optimized out>, comm=0x1037cd0, win=0x7ffde91a3418) at pwin_create.c:79
#10 0x000000000040a2f8 in IMB_window ()
#11 0x0000000000406c34 in IMB_init_buffers_iter ()
#12 0x0000000000402448 in main ()

On the second host

(gdb) bt
#0  0x00007f08f56f333e in poll_cq (cqe_ver=0, wc=<optimized out>, ne=<optimized out>, ibcq=0x1f80e00) at ../providers/mlx5/cq.c:931
#1  mlx5_poll_cq (ibcq=0x1f80e00, ne=256, wc=<optimized out>) at ../providers/mlx5/cq.c:1221
#2  0x00007f08ef3c4cc7 in ibv_poll_cq (wc=0x7fff9ccbe810, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/verbs.h:2055
#3  poll_device (device=device@entry=0x1ecac00, count=count@entry=0) at btl_openib_component.c:3581
#4  0x00007f08ef3c5aad in progress_one_device (device=0x1ecac00) at btl_openib_component.c:3714
#5  btl_openib_component_progress () at btl_openib_component.c:3738
#6  0x00007f08fea16d94 in opal_progress () at runtime/opal_progress.c:226
#7  0x00007f08ff573e55 in ompi_request_wait_completion (req=0x256e300) at ../ompi/request/request.h:392
#8  ompi_request_default_wait (req_ptr=0x7fff9ccc19a8, status=0x7fff9ccc19b0) at request/req_wait.c:41
#9  0x00007f08ff5c21ca in ompi_coll_base_sendrecv_zero (stag=-16, rtag=-16, comm=0x23e7570, source=0, dest=0) at base/coll_base_barrier.c:63
#10 ompi_coll_base_barrier_intra_two_procs (comm=0x23e7570, module=<optimized out>) at base/coll_base_barrier.c:296
#11 0x00007f08ec8f86a7 in component_select (win=0x2383ed0, base=0x7fff9ccc1aa8, size=0, disp_unit=1, comm=0x22bb4e0, info=0x6136a0 <ompi_mpi_info_null>, flavor=1, model=0x7fff9ccc1ab4) at osc_pt2pt_component.c:416
#12 0x00007f08ff577c2c in ompi_win_create (base=base@entry=0x21b7fc0, size=size@entry=0, disp_unit=disp_unit@entry=1, comm=comm@entry=0x22bb4e0, info=0x6136a0 <ompi_mpi_info_null>, newwin=newwin@entry=0x7fff9ccc1e38) at win/win.c:236
#13 0x00007f08ff5a79dc in PMPI_Win_create (base=0x21b7fc0, size=0, disp_unit=1, info=<optimized out>, comm=0x22bb4e0, win=0x7fff9ccc1e38) at pwin_create.c:79
#14 0x000000000040a2f8 in IMB_window ()
#15 0x0000000000406c34 in IMB_init_buffers_iter ()
#16 0x0000000000402448 in main ()
@ggouaillardet
Copy link
Contributor

iirc that is a know issue
two hosts have different hardware and hence end up selecting different osc components.

as a workaround, you can

mpirun --mca ^osc ...

@nmorey
Copy link
Contributor Author

nmorey commented Mar 27, 2018

I'm redeploying the servers right now. I'll test this ASAP.

$ /usr/lib64/mpi/gcc/openmpi2/bin/ompi_info | grep osc
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v2.1.2)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v2.1.2)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v2.1.2)

As both HW are Infiniband, shouldn't both use rdma automatically ?

@nmorey
Copy link
Contributor Author

nmorey commented Mar 27, 2018

Also:

  • the IMB-MPI1 bench works fine in this setup
  • Both IMB-MPI1 and IMB-EXT works fine with openmpi 1.10.7

@ggouaillardet
Copy link
Contributor

osc is for one-sided communications (not in IMB-MPI 1 and I do not think there is a osc/rdma component in 1.10

You can

mpirun --mca osc_base_verbose 10 ...

to see which component is selected.

@nmorey
Copy link
Contributor Author

nmorey commented Mar 27, 2018

Here's what I get

[rdma03:14356] mca: base: components_register: registering framework osc components
[rdma03:14356] mca: base: components_register: found loaded component pt2pt
[rdma03:14356] mca: base: components_register: component pt2pt register function successful
[rdma03:14356] mca: base: components_register: found loaded component rdma
[rdma03:14356] mca: base: components_register: component rdma register function successful
[rdma03:14356] mca: base: components_register: found loaded component sm
[rdma03:14356] mca: base: components_register: component sm has no register or open function
[rdma03:14356] mca: base: components_open: opening osc components
[rdma03:14356] mca: base: components_open: found loaded component pt2pt
[rdma03:14356] mca: base: components_open: found loaded component rdma
[rdma03:14356] mca: base: components_open: found loaded component sm
[rdma03:14356] mca: base: components_open: component sm open function successful
[rdma04:12703] mca: base: components_register: registering framework osc components
[rdma04:12703] mca: base: components_register: found loaded component pt2pt
[rdma04:12703] mca: base: components_register: component pt2pt register function successful
[rdma04:12703] mca: base: components_register: found loaded component rdma
[rdma04:12703] mca: base: components_register: component rdma register function successful
[rdma04:12703] mca: base: components_register: found loaded component sm
[rdma04:12703] mca: base: components_register: component sm has no register or open function
[rdma04:12703] mca: base: components_open: opening osc components
[rdma04:12703] mca: base: components_open: found loaded component pt2pt
[rdma04:12703] mca: base: components_open: found loaded component rdma
[rdma04:12703] mca: base: components_open: found loaded component sm
[rdma04:12703] mca: base: components_open: component sm open function successful

I do not think there is a osc/rdma component in 1.10

The seems to be one:

[(master) nmorey@portia:openmpi ((v1.10.7^0) %)]$ ll ompi/mca/osc/
total 120
drwxr-xr-x 2 nmorey users   146 Mar 27 13:47 base
-rw-r--r-- 1 nmorey users  1139 Mar 27 13:47 Makefile.am
-rw-r--r-- 1 nmorey users 92603 Nov 20 16:58 Makefile.in
-rw-r--r-- 1 nmorey users 19791 Mar 27 13:47 osc.h
drwxr-xr-x 2 nmorey users   278 Mar 27 13:47 portals4
drwxr-xr-x 2 nmorey users  4096 Mar 27 13:47 pt2pt
drwxr-xr-x 2 nmorey users    25 Mar 27 13:47 rdma
``
drwxr-xr-x 2 nmorey users   168 Mar 27 13:47 sm

@ggouaillardet
Copy link
Contributor

I will double check that

What happens when you blacklist the osc/rdma component ?

@ggouaillardet
Copy link
Contributor

Is this the only log you get when the benchmark hangs ?

@nmorey
Copy link
Contributor Author

nmorey commented Mar 27, 2018

Doing this gets it working:

mpirun  --host 192.168.0.1,192.168.0.2 -np 2  --allow-run-as-root --mca btl openib --mca osc ^rdma  /usr/lib64/mpi/gcc/openmpi2/tests/IMB/IMB-EXT

@nmorey
Copy link
Contributor Author

nmorey commented Mar 27, 2018

Is this the only log you get when the benchmark hangs ?

No warning/error. Just the last printf hanging there

@ggouaillardet
Copy link
Contributor

Can you collect the same traces with IMB-EXT and 1.10 ?

@nmorey
Copy link
Contributor Author

nmorey commented Mar 27, 2018

You're right, the osc/rdma is not available in 1.10.7 (at least in our build)

Using openmpi 1.10.7

rdma03:~/:[0]# mpirun --mca osc_base_verbose 10 --host 192.168.0.1,192.168.0.2 -np 2  --allow-run-as-root /usr/lib64/mpi/gcc/openmpi/tests/IMB/IMB-EXT
[rdma04:14655] mca: base: components_register: registering osc components
[rdma04:14655] mca: base: components_register: found loaded component pt2pt
[rdma04:14655] mca: base: components_register: component pt2pt register function successful
[rdma04:14655] mca: base: components_register: found loaded component sm
[rdma04:14655] mca: base: components_register: component sm has no register or open function
[rdma04:14655] mca: base: components_open: opening osc components
[rdma04:14655] mca: base: components_open: found loaded component pt2pt
[rdma04:14655] mca: base: components_open: component pt2pt open function successful
[rdma04:14655] mca: base: components_open: found loaded component sm
[rdma04:14655] mca: base: components_open: component sm open function successful
[rdma03:17554] mca: base: components_register: registering osc components
[rdma03:17554] mca: base: components_register: found loaded component pt2pt
[rdma03:17554] mca: base: components_register: component pt2pt register function successful
[rdma03:17554] mca: base: components_register: found loaded component sm
[rdma03:17554] mca: base: components_register: component sm has no register or open function
[rdma03:17554] mca: base: components_open: opening osc components
[rdma03:17554] mca: base: components_open: found loaded component pt2pt
[rdma03:17554] mca: base: components_open: component pt2pt open function successful
[rdma03:17554] mca: base: components_open: found loaded component sm
[rdma03:17554] mca: base: components_open: component sm open function successful
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'rdma03', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-2 part    
#------------------------------------------------------------
# Date                  : Tue Mar 27 14:40:24 2018
# Machine               : x86_64
# System                : Linux
# Release               : 4.4.73-5-default
# Version               : #1 SMP Tue Jul 4 15:33:39 UTC 2017 (b7ce4e4)
# MPI Version           : 3.0
# MPI Thread Environment: 


# Calling sequence was: 

# /usr/lib64/mpi/gcc/openmpi/tests/IMB/IMB-EXT

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# Window
# Unidir_Get
# Unidir_Put
# Bidir_Get
# Bidir_Put
# Accumulate
[rdma03:17554] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma03:17554] pt2pt component destroying window with id 4
[rdma04:14655] pt2pt component destroying window with id 4

[Cut here as it goes on and on]

@ggouaillardet
Copy link
Contributor

Any chance to test the latest master ?
I cannot remember if we fixed that (and only a backport is needed)

@hjelmn any recollection of this issue ?

@nmorey
Copy link
Contributor Author

nmorey commented Mar 27, 2018

I have an openmpi 3.0.0 package available that I can test quickly if that's of any interest.
Anything else will need some more time

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Mar 27, 2018

That will be enough for now, thanks

@nmorey
Copy link
Contributor Author

nmorey commented Mar 27, 2018

@ggouaillardet openmpi 3.0.0 behaves exactly like 2.1.3 and stalls

@ggouaillardet
Copy link
Contributor

it seems this has never been fixed, even on master.

can you please give the inline patch a try ?
this is really a proof of concept at this stage.

diff --git a/ompi/mca/osc/rdma/osc_rdma_component.c b/ompi/mca/osc/rdma/osc_rdma_component.c
index b5c544a..db450ca 100644
--- a/ompi/mca/osc/rdma/osc_rdma_component.c
+++ b/ompi/mca/osc/rdma/osc_rdma_component.c
@@ -767,6 +767,7 @@ static int ompi_osc_rdma_query_btls (ompi_communicator_t *comm, struct mca_btl_b
     int *btl_counts = NULL;
     char **btls_to_use;
     void *tmp;
+    int tmps[3];
 
     btls_to_use = opal_argv_split (ompi_osc_rdma_btl_names, ',');
     if (btls_to_use) {
@@ -793,6 +794,20 @@ static int ompi_osc_rdma_query_btls (ompi_communicator_t *comm, struct mca_btl_b
         *btl = selected_btl;
     }
 
+    tmps[0] = (NULL==selected_btl)?0:1;
+    rc = comm->c_coll->coll_allreduce(tmps, tmps+1, 1, MPI_INT, MPI_MAX, comm, comm->c_coll->coll_allreduce_module);
+    if (OMPI_SUCCESS != rc) {
+        return rc;
+    }
+    tmps[2] = (tmps[0] == tmps[1]) ? 1 : 0;
+    rc = comm->c_coll->coll_allreduce(tmps+2, tmps, 1, MPI_INT, MPI_MIN, comm, comm->c_coll->coll_allreduce_module);
+    if (OMPI_SUCCESS != rc) {
+        return rc;
+    }
+    if (!tmps[0]) {
+        return OMPI_ERR_NOT_AVAILABLE;
+    }
+
     if (NULL != selected_btl) {
         OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_INFO, "selected btl: %s",
                          selected_btl->btl_component->btl_version.mca_component_name);

@hjelmn
Copy link
Member

hjelmn commented Mar 28, 2018

@ggouaillardet Not a configuration I have or care about. If your patch fixes it let me know. BTW, you can get the same result using a single allreduce:

tmps[0] = (NULL==selected_btl)?0:1; tmps[1] = -tmps[0];
rc = comm->c_coll->coll_allreduce(MPI_IN_PLACE, tmps, 2, MPI_INT, MPI_MAX, comm, comm->c_coll->coll_allreduce_module);
if (tmps[0] != -tmps[1]) {
    /* results differ */
   return OMPI_ERR_NOT_AVAILABLE;
}

@hjelmn
Copy link
Member

hjelmn commented Mar 28, 2018

Though I do find it odd that ConnectIB doesn't select the verbs btl.

Will not be an issue when the uct btl is in place. For reference see #4919. Will probably go in later this week once I have verified it works with IB.

@ggouaillardet
Copy link
Contributor

@hjelmn thanks for the comment, I will definitely use a single allreduce.

@nmorey
Copy link
Contributor Author

nmorey commented Mar 28, 2018

@ggouaillardet Had to fix a compile error in your patch (s/com->c_coll->/com->c_coll./g) but if fixes the issue

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Mar 28, 2018

Here is a more correct patch

[EDIT] use MPI_MIN instead of MPI_MAX

diff --git a/ompi/mca/osc/rdma/osc_rdma_component.c b/ompi/mca/osc/rdma/osc_rdma_component.c
index b145395..069c9dc 100644
--- a/ompi/mca/osc/rdma/osc_rdma_component.c
+++ b/ompi/mca/osc/rdma/osc_rdma_component.c
@@ -372,6 +372,8 @@ static int ompi_osc_rdma_component_query (struct ompi_win_t *win, void **base, s
                                           int flavor)
 {
 
+    int rc;
+
     if (MPI_WIN_FLAVOR_SHARED == flavor) {
         return -1;
     }
@@ -385,15 +387,18 @@ static int ompi_osc_rdma_component_query (struct ompi_win_t *win, void **base, s
     }
 #endif /* OPAL_CUDA_SUPPORT */
 
-    if (OMPI_SUCCESS == ompi_osc_rdma_query_mtls ()) {
+    rc = ompi_osc_rdma_query_mtls ();
+    rc = comm->c_coll->coll_allreduce(MPI_IN_PLACE, &rc, 1, MPI_INT, MPI_MIN, comm, comm->c_coll->coll_allreduce_module);
+    if (OMPI_SUCCESS == rc) {
         return 5; /* this has to be lower that osc pt2pt default priority */
     }
 
-    if (OMPI_SUCCESS != ompi_osc_rdma_query_btls (comm, NULL)) {
+    rc = ompi_osc_rdma_query_btls (comm, NULL);
+    rc = comm->c_coll->coll_allreduce(MPI_IN_PLACE, &rc, 1, MPI_INT,  MPI_MIN, comm, comm->c_coll->coll_allreduce_module);
+    if (OMPI_SUCCESS != rc) {
         return -1;
     }
 
-
     return mca_osc_rdma_component.priority;
 }

similar porting has to be done for the v2.x series

I will resume my work next week

@hjelmn
Copy link
Member

hjelmn commented Mar 28, 2018

Keep in mind that the patch will hurt performance for RMA. If the two systems can talk over infiniband and you want performance you need to figure out why one of the systems is not getting a valid openib btl module.

@nmorey
Copy link
Contributor Author

nmorey commented Mar 28, 2018

I will look into that.
But does the patch has an impact on systems working as expected ?

@hjelmn
Copy link
Member

hjelmn commented Mar 28, 2018

It shouldn't. In the common case the same BTL will be selected by all processes and we should get OMPI_SUCCESS in rc. I can double-check once we finish service time on our systems.

@jsquyres
Copy link
Member

@nmorey @hjelmn @ggouaillardet There's been no new updates on here for months. Is this issue still happening at the HEAD of master / release branches?

Copy link

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

@github-actions github-actions bot added the Stale label Feb 16, 2024
Copy link

github-actions bot commented Mar 2, 2024

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants