Significant degradation in message rates observed on Master. #1831

jladd-mlnx · 2016-06-29T18:14:13Z

Opening this issue for tracking purposes. Measured with Master nightly build against 1.10.3. Possible fix on master.

PML - Yalla
OMPI – 1.10.3
$mpirun -np 2 --map-by node --bind-to core -mca pml yalla -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-v1.10/tests/osu-micro-benchmarks-5.2/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.2
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       4.01        4005006.11
2                       8.24        4121056.15
4                      16.39        4097311.09
8                      32.45        4055766.73
16                     64.16        4010025.24
32                    127.13        3972687.66
64                    237.04        3703703.70
128                   455.11        3555555.62
256                   860.96        3363110.99
512                  1592.23        3109815.42
1024                 2811.50        2745602.68
2048                 4972.38        2427921.16
4096                 5430.79        1325875.29
8192                 5933.54         724309.64
16384                6155.42         375697.10
32768                6328.16         193120.10
65536                6398.15          97627.95
131072               6433.23          49081.64
262144               5161.27          19688.67
524288               5731.10          10931.20
1048576              6046.06           5765.97
2097152              6215.12           2963.60
4194304              6306.30           1503.54


---------------------
PML – Yalla
OMPI – Master
$mpirun -np 2 --map-by node --bind-to core -mca pml yalla -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-master/tests/osu-micro-benchmarks-5.2/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.2
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       1.89        1887305.40
2                       3.80        1898890.08
4                       7.56        1889678.24
8                      15.31        1914346.13
16                     30.41        1900517.95
32                     60.30        1884510.12
64                    119.99        1874796.93
128                   227.47        1777098.03
256                   454.66        1776025.43
512                   870.71        1700598.93
1024                 1599.39        1561900.54
2048                 3228.97        1576645.16
4096                 4453.33        1087237.56
8192                 5822.02         710695.44
16384                6213.84         379262.51
32768                6336.49         193374.24
65536                6403.37          97707.72
131072               6438.18          49119.44
262144               5126.38          19555.59
524288               5708.48          10888.06
1048576              6033.43           5753.92
2097152              6208.48           2960.43
4194304              6303.32           1502.83


-----------------------------
PML - OB1
OMPI – 1.10.3
$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-v1.10/tests/osu-micro-benchmarks-5.2/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.2
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       3.20        3204807.28
2                       6.91        3453858.56
4                      13.82        3453858.78
8                      27.41        3426124.16
16                     54.12        3382663.90
32                    105.73        3304078.53
64                    208.55        3258655.72
128                   402.16        3141875.26
256                   780.19        3047618.99
512                  1324.49        2586903.70
1024                 2392.70        2336619.29
2048                 4147.85        2025316.48
4096                 5411.73        1321222.14
8192                 5900.16         720234.07
16384                6083.99         371337.40
32768                6329.11         193149.24
65536                6427.56          98076.78
131072               6478.69          49428.48
262144               6503.55          24809.09
524288               6517.20          12430.56
1048576              6523.66           6221.44
2097152              6526.58           3112.11
4194304              6528.46           1556.51
----------------------------------
PML – OB1
OMPI - Master
$mpirun -np 2 --map-by node -mca pml ob1 -mca btl_openib_if_include mlx5_0:1 /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-master/tests/osu-micro-benchmarks-5.2/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.2
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       1.64        1636174.22
2                       4.79        2392507.23
4                       9.69        2423259.37
8                      19.08        2384926.46
16                     38.57        2410744.90
32                     75.80        2368681.59
64                    149.17        2330745.92
128                   281.28        2197461.83
256                   539.24        2106415.38
512                  1065.10        2080264.37
1024                 1807.65        1765284.91
2048                 3429.21        1674421.30
4096                 5233.04        1277597.35
8192                 5634.88         687851.71
16384                5303.44         323696.21
32768                6091.79         185906.65
65536                6392.29          97538.57
131072               6459.20          49279.78
262144               6494.59          24774.90
524288               6512.32          12421.26
1048576              6521.32           6219.21
2097152              6525.70           3111.70
4194304              6526.97           1556.15

The text was updated successfully, but these errors were encountered:

hjelmn · 2016-06-29T18:18:39Z

@jladd-mlnx You need to confirm this affects 2.0.0 before setting a blocker there.

hjelmn · 2016-06-29T18:25:41Z

One of the biggest differences between 2.0.0 and master is master always has MPI_THREAD_MULTIPLE support. This will impact performance and we intend to fix as much of the performance regression as possible by the next master branch.

bosilca · 2016-06-29T18:31:36Z

We also need to confirm that the performance drop was not already visible before the request rework made it in.

jladd-mlnx · 2016-06-29T20:04:45Z

Yalla looks better on 2.x, but still degraded by ~5%. OpenIB still has performance issues, and a bug that prevents the test from completing.

PML - Yalla
OMPI - v2.x

$mpirun -np 2 --map-by node --bind-to core -mca pml yalla -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       3.78        3781191.93
2                       7.58        3788042.56
4                      15.29        3822050.11
8                      30.19        3773720.66
16                     60.38        3773580.52
32                    120.21        3756457.15
64                    235.28        3676253.05
128                   402.59        3145267.57
256                   822.60        3213293.41
512                  1535.95        2999911.89
1024                 2714.29        2650670.98
2048                 4405.85        2151295.22
4096                 5152.63        1257966.21
8192                 5725.70         698938.48
16384                6106.27         372697.15
32768                6250.77         190758.24
65536                6332.56          96627.17
131072               6403.43          48854.28
262144               6402.59          24423.95
524288               6442.80          12288.66
1048576              6460.99           6161.68
2097152              6470.35           3085.30
4194304              6475.19           1543.81

PML - OB1
OMPI - v2.x

joshual@hpchead /hpc/mtr_scrap/users/joshual/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt 
$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       2.04        2042588.59
2                       4.23        2115237.51
4                       8.40        2100731.38
8                      16.82        2101930.46
16                     33.66        2103450.30
32                     66.01        2062909.05
64                    129.70        2026597.13
128                   247.53        1933860.78
256                   473.00        1847674.81
512                   867.73        1694776.11
1024                 1523.01        1487313.31
2048                 2571.69        1255709.54
4096                 3762.62         918608.56
8192                 5847.15         713762.90
[vegas36][[3628,1],1][btl_tcp_endpoint.c:800:mca_btl_tcp_endpoint_complete_connect] connect() to 21.151.70.35 failed: No route to host (113)

hjelmn · 2016-06-29T20:07:35Z

Looks like the issue is the tcp btl is being used. Add -mca btl self,vader,openib

jladd-mlnx · 2016-06-29T20:09:54Z

@hjelmn I see that. That's odd. Violates the law of least surprise. Still some degradation.

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca btl openib,self -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       3.11        3106859.54
2                       6.37        3186949.58
4                      12.81        3203248.06
8                      25.58        3197116.28
16                     50.15        3134119.36
32                     98.67        3083294.13
64                    195.66        3057110.52
128                   377.00        2945293.94
256                   722.70        2823063.32
512                  1253.78        2448783.17
1024                 2273.13        2219848.99
2048                 3797.79        1854390.37
4096                 5168.69        1261887.62
8192                 5896.71         719813.82
16384                6024.78         367723.32
32768                6278.04         191590.67
65536                6406.25          97751.62
131072               6464.88          49323.11
262144               6498.35          24789.24
524288               6514.36          12425.15
1048576              6522.33           6220.18
2097152              6526.24           3111.95
4194304              6528.29           1556.47

hjelmn · 2016-06-29T20:12:24Z

Yeah, very odd that the tcp btl is active. Is this system similar to the one running Jenkins? I understand that one has a two port card with one ib and one ethernet port. By default both ports will be used for large messages. Not sure why it is affecting small ones. I put together a patch to adjust all the latency and bandwidth numbers for the btls that might help.

hjelmn · 2016-06-29T20:12:58Z

BTW, ~5% is not a blocker. I have seen larger variation due to changes in icache miss rates :-/

bosilca · 2016-06-29T20:23:49Z

Did we lost the exclusivity ?

jladd-mlnx · 2016-06-29T20:28:51Z

Law of least surprise is respected on Master. Same command line that triggered TCP BTL, loads OpenIB on master. The nightly build is from an internal nightly MLNX build. I'll try with master from Git when I get a chance.

hjelmn · 2016-06-29T20:32:23Z

@bosilca The exclusivity check looks ok to me on master. Will check 2.x.

hjelmn · 2016-06-29T20:41:16Z

v2.x looks ok too. We sort the btls be decreasing exclusivity. Then we add the proc to each btl in that order and only add them to btl_send if there is no send btl endpoint for the proc or the exclusivity is equal.

bosilca · 2016-06-29T20:44:47Z

If preventing the TCP BTL from being used improves the latency, then the exclusivity might not work the way we expect.

hjelmn · 2016-06-29T20:48:29Z

Agreed. It should be preventing the tcp btl from running with that proc. Will have to run through the code to see why this could be happening.

jladd-mlnx · 2016-06-29T22:11:46Z

Tests with master head from GitHub. BIG degradation in Yalla. Significant degredation in OpenIB.

OMPI - Master Head f18d660
PML - Yalla

$mpirun -np 2 --map-by node --bind-to core -mca pml yalla -mca btl openib,self -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/GitHub/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       1.38        1384319.88
2                       2.86        1428626.93
4                       5.70        1425752.08
8                      11.43        1428954.19
16                     22.80        1425162.48
32                     45.35        1417328.02
64                     89.76        1402440.95
128                   175.14        1368296.44
256                   343.19        1340577.75
512                   661.68        1292345.50
1024                 1251.70        1222364.08
2048                 2375.11        1159723.45
4096                 3563.16         869911.51
8192                 4996.50         609924.74
16384                5643.96         344479.77
32768                6020.44         183729.28
65536                6242.64          95255.18
131072               6350.21          48448.25
262144               5124.99          19550.29
524288               5706.46          10884.20
1048576              6031.02           5751.63
2097152              6209.33           2960.84
4194304              6303.17           1502.79

Another observation - OpenIB message rates are quite erratic from run to run. This is just a single trial that came back particularly degraded
OMPI - master f18d660
PML - OB1 (OpenIB)

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca btl openib,self -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/GitHub/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       0.87         867614.62
2                       5.15        2576510.24
4                      10.35        2588480.55
8                      20.60        2574982.19
16                     41.59        2599279.31
32                     81.21        2537720.88
64                    159.46        2491492.93
128                   311.66        2434836.70
256                   597.60        2334391.34
512                  1175.44        2295776.36
1024                 2166.56        2115778.70
2048                 3771.07        1841340.88
4096                 5084.93        1241439.16
8192                 5870.64         716631.08
16384                5835.08         356144.87
32768                6263.04         191132.95
65536                6394.87          97578.03
131072               6461.46          49297.04
262144               6495.77          24779.38
524288               6512.28          12421.18
1048576              6521.56           6219.44
2097152              6525.88           3111.78
4194304              6528.19           1556.44

jladd-mlnx · 2016-06-29T22:12:15Z

@hppritcha Please see this Issue for tracking and comments.

jladd-mlnx · 2016-06-29T22:53:14Z

@gpaulsen @nysal Please be aware.

hppritcha · 2016-11-01T15:24:30Z

I need more info on how to reproduce this problem. Was osu_mbw_mr used to generate the message rates for example? And were any particular PSM2 related env. variables set?

matcabral · 2016-11-01T15:41:58Z

Hi all,
I got the numbers Ralph is sharing (separate email thread) by building from ompi's different tags with:
CFLAGS=-O3 ./configure --with-libfabric=no --with-psm2=/usr
Yes, this is running on PSM2, 2 nodes over a switch , no specific env vars.

mpirun -np 2 -host host-1,host-2 ./osu_mbw_mr
(I did confirm it was runing over PSM2)

hppritcha · 2016-11-03T15:22:08Z

I collected some osu_mbw_mr numbers off of one of the LANL omnipath systems and put them
here https://gist.github.com/hppritcha/8da4436a216c9d96bf10ad9d403da03f

It appears there was a significant performance degradation going from 1.10 release stream to master. It looks like 2.0.x release stream is the worst, then something was done on master to patch things up a bit, with PRs back to 2.x after the 2.0.x branch was created.

@jsquyres might want to take a look.

hppritcha · 2016-11-09T23:52:20Z

I got data using the GNI provider. It also shows a significant performance degration for shorter messages.

https://gist.github.com/hppritcha/200bc7a2d4dfde709245d6ffa7b2b971

not as bad as for the PSM2 MTL though.

rhc54 · 2016-11-10T00:04:24Z

I don't see the 1.10 data in that gist, but the 2.0.x data is clearly impacted relative to master and v2.x. So it looks like there is something in the code path above the libraries.

@matcabral @hjelmn Can someone take a look and see if something is missing in the CM PML, or related code?

hjelmn · 2016-11-10T00:30:18Z

pml/cm looks up to date to me.

rhc54 · 2016-11-10T00:34:42Z

Given that the problem is in master as well, my comment was more to the point that perhaps some change is required in that code path - something that was done to resolve the performance issue in the pml/ob1 path, but didn't get done in the pml/cm path.

jsquyres · 2016-11-14T18:54:20Z

@rhc54 I don't think @hppritcha plans to get the Cray code to work with v1.10.

@rhc54 @matcabral Any progress on this issue?

rhc54 · 2016-11-18T22:58:17Z

Still under investigation. I am removing the blocker tag from it as our folks are engaged in some other things right now, and this shouldn't hold up v2.0.2.

@matcabral I spoke with @hjelmn about this at SC, and he suggested looking at the instruction cache for "misses". We have previously seen cases where slight changes to the code path, even when reducing instruction count, would result in performance degradation due to a sudden spike in cache misses. This could be what is happening here.

bosilca · 2016-11-23T00:57:58Z

@thananon is reviewing the CM's code to remove the request lock left over from the transition. The work can be followed on issue #2448.

jsquyres · 2017-02-22T17:03:05Z

How are we doing on this issue on the v2.x branch these days? Is this still an open issue, or should we close it?

thananon · 2017-02-23T20:24:43Z

I'm not sure if this is fixed. I'm observing performance degradation on my test with builtin atomics.

This is injection rate in msg/s from multithreaded benchmark.
Compiled with gcc 6.3.0.

no-bulitin-atomics	default	percentage
109711	18962	-82%

jsquyres · 2017-02-23T20:35:12Z

Just to be clear: that was on v2.x, right?

Your table implies that we should switch the default to use the no-builtin-atomics...?

If we're seeing an 82% performance degradation, this feels like a blocker for v2.1.0. @hppritcha?

thananon · 2017-02-23T20:41:29Z

@jsquyres

Test was performed on master.
Yes, I think it should be no-bultin-atomics by default but that only based on UTK's machine. I don't know about the behavior on others.

jsquyres · 2017-02-23T20:43:48Z

Could you repeat the test on v2.x, please?

hppritcha · 2017-02-27T23:53:22Z

@thananon what BTL are you using when you see the 82 per cent degradation?

thananon · 2017-02-28T15:48:42Z

@jsquyres I will re run the test on 2.x as soon as I can.
@hppritcha I'm using openib btl. However the retest result is inconsistent with the original one. I will redo the whole test and report back as soon as possible.

jsquyres · 2017-02-28T17:07:40Z

Per the OMPI webex this morning, we're deferring this to v2.1.1.

hppritcha · 2017-08-09T15:37:38Z

Has this actually been fixed in 2.1.1? Otherwise I'd like to move this to future milestone.

matcabral · 2017-08-09T15:42:56Z

the problems I found were addressed in #3748
thanks

hppritcha · 2017-08-10T14:00:54Z

per last comment from @matcabral closing this issue

jladd-mlnx added bug Severity: blocker labels Jun 29, 2016

jladd-mlnx added this to the v2.0.0 milestone Jun 29, 2016

jladd-mlnx assigned bosilca and hjelmn Jun 29, 2016

hjelmn removed Severity: blocker bug labels Jun 29, 2016

hjelmn modified the milestones: Future, v2.0.0 Jun 29, 2016

jladd-mlnx modified the milestones: v2.0.0, Future Jun 29, 2016

jladd-mlnx added bug Severity: blocker labels Jun 29, 2016

rhc54 removed the Severity: blocker label Nov 18, 2016

rhc54 modified the milestones: v2.1.0, v2.0.2 Nov 18, 2016

matcabral mentioned this issue Dec 16, 2016

Performance regression: placeholder for v2.0.2 release #2592

Closed

jsquyres modified the milestones: v2.1.1, v2.1.0 Feb 28, 2017

hppritcha modified the milestones: v2.1.2, v2.1.1 Apr 24, 2017

hppritcha closed this as completed Aug 10, 2017

Significant degradation in message rates observed on Master. #1831

Significant degradation in message rates observed on Master. #1831

Comments

jladd-mlnx commented Jun 29, 2016 • edited by hjelmn Loading

hjelmn commented Jun 29, 2016

hjelmn commented Jun 29, 2016 • edited Loading

bosilca commented Jun 29, 2016

jladd-mlnx commented Jun 29, 2016 • edited Loading

hjelmn commented Jun 29, 2016

jladd-mlnx commented Jun 29, 2016

hjelmn commented Jun 29, 2016

hjelmn commented Jun 29, 2016

bosilca commented Jun 29, 2016

jladd-mlnx commented Jun 29, 2016

hjelmn commented Jun 29, 2016

hjelmn commented Jun 29, 2016

bosilca commented Jun 29, 2016

hjelmn commented Jun 29, 2016

jladd-mlnx commented Jun 29, 2016

jladd-mlnx commented Jun 29, 2016

jladd-mlnx commented Jun 29, 2016

hppritcha commented Nov 1, 2016

matcabral commented Nov 1, 2016

hppritcha commented Nov 3, 2016

hppritcha commented Nov 9, 2016

rhc54 commented Nov 10, 2016

hjelmn commented Nov 10, 2016

rhc54 commented Nov 10, 2016

jsquyres commented Nov 14, 2016

rhc54 commented Nov 18, 2016

bosilca commented Nov 23, 2016

jsquyres commented Feb 22, 2017

thananon commented Feb 23, 2017

jsquyres commented Feb 23, 2017

thananon commented Feb 23, 2017

jsquyres commented Feb 23, 2017

hppritcha commented Feb 27, 2017

thananon commented Feb 28, 2017

jsquyres commented Feb 28, 2017

hppritcha commented Aug 9, 2017

matcabral commented Aug 9, 2017

hppritcha commented Aug 10, 2017

jladd-mlnx commented Jun 29, 2016 •

edited by hjelmn

Loading

hjelmn commented Jun 29, 2016 •

edited

Loading

jladd-mlnx commented Jun 29, 2016 •

edited

Loading