Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant degradation in message rates observed on Master. #1831

Closed
jladd-mlnx opened this issue Jun 29, 2016 · 116 comments
Closed

Significant degradation in message rates observed on Master. #1831

jladd-mlnx opened this issue Jun 29, 2016 · 116 comments
Assignees
Milestone

Comments

@jladd-mlnx
Copy link
Member

jladd-mlnx commented Jun 29, 2016

Opening this issue for tracking purposes. Measured with Master nightly build against 1.10.3. Possible fix on master.

@hjelmn or @bosilca please comment.

PML - Yalla
OMPI – 1.10.3
$mpirun -np 2 --map-by node --bind-to core -mca pml yalla -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-v1.10/tests/osu-micro-benchmarks-5.2/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.2
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       4.01        4005006.11
2                       8.24        4121056.15
4                      16.39        4097311.09
8                      32.45        4055766.73
16                     64.16        4010025.24
32                    127.13        3972687.66
64                    237.04        3703703.70
128                   455.11        3555555.62
256                   860.96        3363110.99
512                  1592.23        3109815.42
1024                 2811.50        2745602.68
2048                 4972.38        2427921.16
4096                 5430.79        1325875.29
8192                 5933.54         724309.64
16384                6155.42         375697.10
32768                6328.16         193120.10
65536                6398.15          97627.95
131072               6433.23          49081.64
262144               5161.27          19688.67
524288               5731.10          10931.20
1048576              6046.06           5765.97
2097152              6215.12           2963.60
4194304              6306.30           1503.54


---------------------
PML – Yalla
OMPI – Master
$mpirun -np 2 --map-by node --bind-to core -mca pml yalla -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-master/tests/osu-micro-benchmarks-5.2/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.2
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       1.89        1887305.40
2                       3.80        1898890.08
4                       7.56        1889678.24
8                      15.31        1914346.13
16                     30.41        1900517.95
32                     60.30        1884510.12
64                    119.99        1874796.93
128                   227.47        1777098.03
256                   454.66        1776025.43
512                   870.71        1700598.93
1024                 1599.39        1561900.54
2048                 3228.97        1576645.16
4096                 4453.33        1087237.56
8192                 5822.02         710695.44
16384                6213.84         379262.51
32768                6336.49         193374.24
65536                6403.37          97707.72
131072               6438.18          49119.44
262144               5126.38          19555.59
524288               5708.48          10888.06
1048576              6033.43           5753.92
2097152              6208.48           2960.43
4194304              6303.32           1502.83


-----------------------------
PML - OB1
OMPI – 1.10.3
$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-v1.10/tests/osu-micro-benchmarks-5.2/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.2
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       3.20        3204807.28
2                       6.91        3453858.56
4                      13.82        3453858.78
8                      27.41        3426124.16
16                     54.12        3382663.90
32                    105.73        3304078.53
64                    208.55        3258655.72
128                   402.16        3141875.26
256                   780.19        3047618.99
512                  1324.49        2586903.70
1024                 2392.70        2336619.29
2048                 4147.85        2025316.48
4096                 5411.73        1321222.14
8192                 5900.16         720234.07
16384                6083.99         371337.40
32768                6329.11         193149.24
65536                6427.56          98076.78
131072               6478.69          49428.48
262144               6503.55          24809.09
524288               6517.20          12430.56
1048576              6523.66           6221.44
2097152              6526.58           3112.11
4194304              6528.46           1556.51
----------------------------------
PML – OB1
OMPI - Master
$mpirun -np 2 --map-by node -mca pml ob1 -mca btl_openib_if_include mlx5_0:1 /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-master/tests/osu-micro-benchmarks-5.2/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.2
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       1.64        1636174.22
2                       4.79        2392507.23
4                       9.69        2423259.37
8                      19.08        2384926.46
16                     38.57        2410744.90
32                     75.80        2368681.59
64                    149.17        2330745.92
128                   281.28        2197461.83
256                   539.24        2106415.38
512                  1065.10        2080264.37
1024                 1807.65        1765284.91
2048                 3429.21        1674421.30
4096                 5233.04        1277597.35
8192                 5634.88         687851.71
16384                5303.44         323696.21
32768                6091.79         185906.65
65536                6392.29          97538.57
131072               6459.20          49279.78
262144               6494.59          24774.90
524288               6512.32          12421.26
1048576              6521.32           6219.21
2097152              6525.70           3111.70
4194304              6526.97           1556.15
@hjelmn
Copy link
Member

hjelmn commented Jun 29, 2016

@jladd-mlnx You need to confirm this affects 2.0.0 before setting a blocker there.

@hjelmn
Copy link
Member

hjelmn commented Jun 29, 2016

One of the biggest differences between 2.0.0 and master is master always has MPI_THREAD_MULTIPLE support. This will impact performance and we intend to fix as much of the performance regression as possible by the next master branch.

@bosilca
Copy link
Member

bosilca commented Jun 29, 2016

We also need to confirm that the performance drop was not already visible before the request rework made it in.

@jladd-mlnx
Copy link
Member Author

jladd-mlnx commented Jun 29, 2016

Yalla looks better on 2.x, but still degraded by ~5%. OpenIB still has performance issues, and a bug that prevents the test from completing.

PML - Yalla
OMPI - v2.x

$mpirun -np 2 --map-by node --bind-to core -mca pml yalla -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       3.78        3781191.93
2                       7.58        3788042.56
4                      15.29        3822050.11
8                      30.19        3773720.66
16                     60.38        3773580.52
32                    120.21        3756457.15
64                    235.28        3676253.05
128                   402.59        3145267.57
256                   822.60        3213293.41
512                  1535.95        2999911.89
1024                 2714.29        2650670.98
2048                 4405.85        2151295.22
4096                 5152.63        1257966.21
8192                 5725.70         698938.48
16384                6106.27         372697.15
32768                6250.77         190758.24
65536                6332.56          96627.17
131072               6403.43          48854.28
262144               6402.59          24423.95
524288               6442.80          12288.66
1048576              6460.99           6161.68
2097152              6470.35           3085.30
4194304              6475.19           1543.81

PML - OB1
OMPI - v2.x

joshual@hpchead /hpc/mtr_scrap/users/joshual/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt 
$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       2.04        2042588.59
2                       4.23        2115237.51
4                       8.40        2100731.38
8                      16.82        2101930.46
16                     33.66        2103450.30
32                     66.01        2062909.05
64                    129.70        2026597.13
128                   247.53        1933860.78
256                   473.00        1847674.81
512                   867.73        1694776.11
1024                 1523.01        1487313.31
2048                 2571.69        1255709.54
4096                 3762.62         918608.56
8192                 5847.15         713762.90
[vegas36][[3628,1],1][btl_tcp_endpoint.c:800:mca_btl_tcp_endpoint_complete_connect] connect() to 21.151.70.35 failed: No route to host (113)

@hjelmn
Copy link
Member

hjelmn commented Jun 29, 2016

Looks like the issue is the tcp btl is being used. Add -mca btl self,vader,openib

@jladd-mlnx
Copy link
Member Author

@hjelmn I see that. That's odd. Violates the law of least surprise. Still some degradation.

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca btl openib,self -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       3.11        3106859.54
2                       6.37        3186949.58
4                      12.81        3203248.06
8                      25.58        3197116.28
16                     50.15        3134119.36
32                     98.67        3083294.13
64                    195.66        3057110.52
128                   377.00        2945293.94
256                   722.70        2823063.32
512                  1253.78        2448783.17
1024                 2273.13        2219848.99
2048                 3797.79        1854390.37
4096                 5168.69        1261887.62
8192                 5896.71         719813.82
16384                6024.78         367723.32
32768                6278.04         191590.67
65536                6406.25          97751.62
131072               6464.88          49323.11
262144               6498.35          24789.24
524288               6514.36          12425.15
1048576              6522.33           6220.18
2097152              6526.24           3111.95
4194304              6528.29           1556.47

@hjelmn
Copy link
Member

hjelmn commented Jun 29, 2016

Yeah, very odd that the tcp btl is active. Is this system similar to the one running Jenkins? I understand that one has a two port card with one ib and one ethernet port. By default both ports will be used for large messages. Not sure why it is affecting small ones. I put together a patch to adjust all the latency and bandwidth numbers for the btls that might help.

@hjelmn
Copy link
Member

hjelmn commented Jun 29, 2016

BTW, ~5% is not a blocker. I have seen larger variation due to changes in icache miss rates :-/

@bosilca
Copy link
Member

bosilca commented Jun 29, 2016

Did we lost the exclusivity ?

@jladd-mlnx
Copy link
Member Author

Law of least surprise is respected on Master. Same command line that triggered TCP BTL, loads OpenIB on master. The nightly build is from an internal nightly MLNX build. I'll try with master from Git when I get a chance.

@hjelmn
Copy link
Member

hjelmn commented Jun 29, 2016

@bosilca The exclusivity check looks ok to me on master. Will check 2.x.

@hjelmn
Copy link
Member

hjelmn commented Jun 29, 2016

v2.x looks ok too. We sort the btls be decreasing exclusivity. Then we add the proc to each btl in that order and only add them to btl_send if there is no send btl endpoint for the proc or the exclusivity is equal.

@bosilca
Copy link
Member

bosilca commented Jun 29, 2016

If preventing the TCP BTL from being used improves the latency, then the exclusivity might not work the way we expect.

@hjelmn
Copy link
Member

hjelmn commented Jun 29, 2016

Agreed. It should be preventing the tcp btl from running with that proc. Will have to run through the code to see why this could be happening.

@jladd-mlnx
Copy link
Member Author

Tests with master head from GitHub. BIG degradation in Yalla. Significant degredation in OpenIB.

OMPI - Master Head f18d660
PML - Yalla

$mpirun -np 2 --map-by node --bind-to core -mca pml yalla -mca btl openib,self -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/GitHub/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       1.38        1384319.88
2                       2.86        1428626.93
4                       5.70        1425752.08
8                      11.43        1428954.19
16                     22.80        1425162.48
32                     45.35        1417328.02
64                     89.76        1402440.95
128                   175.14        1368296.44
256                   343.19        1340577.75
512                   661.68        1292345.50
1024                 1251.70        1222364.08
2048                 2375.11        1159723.45
4096                 3563.16         869911.51
8192                 4996.50         609924.74
16384                5643.96         344479.77
32768                6020.44         183729.28
65536                6242.64          95255.18
131072               6350.21          48448.25
262144               5124.99          19550.29
524288               5706.46          10884.20
1048576              6031.02           5751.63
2097152              6209.33           2960.84
4194304              6303.17           1502.79

Another observation - OpenIB message rates are quite erratic from run to run. This is just a single trial that came back particularly degraded
OMPI - master f18d660
PML - OB1 (OpenIB)

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca btl openib,self -x MXM_RDMA_PORTS=mlx5_0:1  -mca btl_openib_if_include mlx5_0:1 /hpc/mtr_scrap/users/joshual/GitHub/ompi-release/osu-micro-benchmarks-5.3/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test v5.3
# [ pairs: 1 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
1                       0.87         867614.62
2                       5.15        2576510.24
4                      10.35        2588480.55
8                      20.60        2574982.19
16                     41.59        2599279.31
32                     81.21        2537720.88
64                    159.46        2491492.93
128                   311.66        2434836.70
256                   597.60        2334391.34
512                  1175.44        2295776.36
1024                 2166.56        2115778.70
2048                 3771.07        1841340.88
4096                 5084.93        1241439.16
8192                 5870.64         716631.08
16384                5835.08         356144.87
32768                6263.04         191132.95
65536                6394.87          97578.03
131072               6461.46          49297.04
262144               6495.77          24779.38
524288               6512.28          12421.18
1048576              6521.56           6219.44
2097152              6525.88           3111.78
4194304              6528.19           1556.44

@jladd-mlnx
Copy link
Member Author

@hppritcha Please see this Issue for tracking and comments.

@jladd-mlnx
Copy link
Member Author

@gpaulsen @nysal Please be aware.

@hppritcha
Copy link
Member

I need more info on how to reproduce this problem. Was osu_mbw_mr used to generate the message rates for example? And were any particular PSM2 related env. variables set?

@matcabral
Copy link
Contributor

Hi all,
I got the numbers Ralph is sharing (separate email thread) by building from ompi's different tags with:
CFLAGS=-O3 ./configure --with-libfabric=no --with-psm2=/usr
Yes, this is running on PSM2, 2 nodes over a switch , no specific env vars.

mpirun -np 2 -host host-1,host-2 ./osu_mbw_mr
(I did confirm it was runing over PSM2)

@hppritcha
Copy link
Member

I collected some osu_mbw_mr numbers off of one of the LANL omnipath systems and put them
here https://gist.github.com/hppritcha/8da4436a216c9d96bf10ad9d403da03f

It appears there was a significant performance degradation going from 1.10 release stream to master. It looks like 2.0.x release stream is the worst, then something was done on master to patch things up a bit, with PRs back to 2.x after the 2.0.x branch was created.

@jsquyres might want to take a look.

@hppritcha
Copy link
Member

I got data using the GNI provider. It also shows a significant performance degration for shorter messages.

https://gist.github.com/hppritcha/200bc7a2d4dfde709245d6ffa7b2b971

not as bad as for the PSM2 MTL though.

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2016

I don't see the 1.10 data in that gist, but the 2.0.x data is clearly impacted relative to master and v2.x. So it looks like there is something in the code path above the libraries.

@matcabral @hjelmn Can someone take a look and see if something is missing in the CM PML, or related code?

@hjelmn
Copy link
Member

hjelmn commented Nov 10, 2016

pml/cm looks up to date to me.

@rhc54
Copy link
Contributor

rhc54 commented Nov 10, 2016

Given that the problem is in master as well, my comment was more to the point that perhaps some change is required in that code path - something that was done to resolve the performance issue in the pml/ob1 path, but didn't get done in the pml/cm path.

@jsquyres
Copy link
Member

@rhc54 I don't think @hppritcha plans to get the Cray code to work with v1.10.

@rhc54 @matcabral Any progress on this issue?

@rhc54
Copy link
Contributor

rhc54 commented Nov 18, 2016

Still under investigation. I am removing the blocker tag from it as our folks are engaged in some other things right now, and this shouldn't hold up v2.0.2.

@matcabral I spoke with @hjelmn about this at SC, and he suggested looking at the instruction cache for "misses". We have previously seen cases where slight changes to the code path, even when reducing instruction count, would result in performance degradation due to a sudden spike in cache misses. This could be what is happening here.

@rhc54 rhc54 modified the milestones: v2.1.0, v2.0.2 Nov 18, 2016
@bosilca
Copy link
Member

bosilca commented Nov 23, 2016

@thananon is reviewing the CM's code to remove the request lock left over from the transition. The work can be followed on issue #2448.

@jsquyres
Copy link
Member

How are we doing on this issue on the v2.x branch these days? Is this still an open issue, or should we close it?

@thananon
Copy link
Member

I'm not sure if this is fixed. I'm observing performance degradation on my test with builtin atomics.

This is injection rate in msg/s from multithreaded benchmark.
Compiled with gcc 6.3.0.

no-bulitin-atomics default percentage
109711 18962 -82%

@jsquyres
Copy link
Member

Just to be clear: that was on v2.x, right?

Your table implies that we should switch the default to use the no-builtin-atomics...?

If we're seeing an 82% performance degradation, this feels like a blocker for v2.1.0. @hppritcha?

@thananon
Copy link
Member

@jsquyres

  • Test was performed on master.
  • Yes, I think it should be no-bultin-atomics by default but that only based on UTK's machine. I don't know about the behavior on others.

@jsquyres
Copy link
Member

Could you repeat the test on v2.x, please?

@hppritcha
Copy link
Member

@thananon what BTL are you using when you see the 82 per cent degradation?

@thananon
Copy link
Member

@jsquyres I will re run the test on 2.x as soon as I can.
@hppritcha I'm using openib btl. However the retest result is inconsistent with the original one. I will redo the whole test and report back as soon as possible.

@jsquyres
Copy link
Member

Per the OMPI webex this morning, we're deferring this to v2.1.1.

@jsquyres jsquyres modified the milestones: v2.1.1, v2.1.0 Feb 28, 2017
@hppritcha hppritcha modified the milestones: v2.1.2, v2.1.1 Apr 24, 2017
@hppritcha
Copy link
Member

Has this actually been fixed in 2.1.1? Otherwise I'd like to move this to future milestone.

@matcabral
Copy link
Contributor

the problems I found were addressed in #3748
thanks

@hppritcha
Copy link
Member

per last comment from @matcabral closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests