Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression: 2.0.x branch degradation over 1.10 for MTLs #2644

Closed
matcabral opened this issue Dec 28, 2016 · 19 comments
Closed

Performance regression: 2.0.x branch degradation over 1.10 for MTLs #2644

matcabral opened this issue Dec 28, 2016 · 19 comments

Comments

@matcabral
Copy link
Contributor

matcabral commented Dec 28, 2016

This is a place holder to fix the performance regressions seen on 2.0.x branch with regards to 1.10 that is impacting MTLs (tested with OFI and PSM2). The degradation is mostly impacting latency in small messages sizes, with some impact in bw.

Building with:

./configure CFLAGS=-O3 --prefix=<install path> --with-libfabric=no
--with-psm2=/usr --disable-oshmem --with-devel-headers --disable-debug
--disable-mem-profile --disable-mem-debug

The below tests assume same system setup, only changing OMPI 1.10 for 2.0.x

Two ranks on different nodes running osu_latency over PSM2.
1 -10%
2 -10%
4 -12%
8 -12%
16 -10%
32 -11%
64 -8%
128 -10%
256 -12%
512 -10%
1024 -12%
2048 -5%
4096 -18%
8192 0%
16384 -6%
32768 -4%
65536 -5%
131072 -3%
262144 -2%
524288 8%
1048576 0%
2097152 0%
4194304 18%

Two ranks on same node running osu_latency over PSM2.
1 -19%
2 -16%
4 -16%
8 -16%
16 -19%
32 -4%
64 -6%
128 -6%
256 -4%
512 -6%
1024 -14%
2048 -31%
4096 -1%
8192 -8%
16384 -24%
32768 -18%
65536 -12%
131072 -6%
262144 -8%
524288 -5%

Two ranks on different nodes running osu_bw over PSM2.
1 -7%
2 -9%
4 -7%
8 -7%
16 -5%
32 -4%
64 -6%
128 -2%
256 -5%
512 -7%
1024 -7%
2048 -5%
4096 -2%
8192 -2%
16384 -1%
32768 2%
65536 0%
131072 1%
262144 0%
524288 0%
1048576 0%
2097152 0%
4194304 0%

@matcabral
Copy link
Contributor Author

matcabral commented Dec 28, 2016

Some of the tests done:

  • Latency comparison against 2.x looks similar impact as 2.0.x, same with master.

  • BW comparison against 2.x looks similar impact as 2.0.x. However, master does not show such a negative impact.

  • Confirmed no impact on BTLS (openib, vader runs better)

  • Confirmed opal/timer: add code to check if rtdtsc is core invariant #2611 does not solve the problem.

  • Branches and branch misses: osu_latency 2 ranks. (compare 1.10 against 2.0.x)
    Running with PSM2 over the wire, 2.0.x shows less number of branches but slightly more % of branch misses (0.09% to 0.13%)
    However, running PSM2 over its shm, numbers are the same. (with a more significant degradation than over the wire)
    Testing Vader btl, I see 2.0.x has fewer branches (25% less), more misses (0.39% to 0.57%), but better performance (ranging from 10% to 20%).

  • Cache misses and minor faults: osu_latency 2 ranks in the same node (compare 1.10 against 2.0.x)
    PSM2 MTL (over psm internal shm): 2.0.x shows worse numbers: minor faults > 20%, cache misses > 30%.
    VADER BTL: 2.0.x shows significantly worse numbers minor faults > 29%, cache misses > 70%, but better benchmark performance.
    Therefore, Vader BTL still runs better regardless of the higher number of misses and faults.

@ggouaillardet
Copy link
Contributor

@matcabral i noted you configure with
--with-psm2=/usr
we recommend against that (e.g. no /usr nor /usr/local)
what if you configure with
--with-psm2
if this does not work, then this is a bug and i will try to fix it

@jsquyres jsquyres added this to the v2.0.2 milestone Jan 3, 2017
@jsquyres
Copy link
Member

jsquyres commented Jan 3, 2017

@hjelmn @bosilca Thoughts?

@hppritcha hppritcha modified the milestones: v2.0.3, v2.0.2 Jan 3, 2017
@bosilca
Copy link
Member

bosilca commented Jan 3, 2017

I would assume this is dependent on #2448

@jsquyres
Copy link
Member

jsquyres commented Jan 4, 2017

@rhc54 @matcabral (cross reference) #2448 has been merged; it would be good to see how it affected the performance of the PSM / OFI MTLs.

@hppritcha
Copy link
Member

I'm not seeing any noticeable difference in performance with the commits from #2448 applied to 2.0.x when using PSM2.

@bosilca
Copy link
Member

bosilca commented Jan 4, 2017 via email

@matcabral
Copy link
Contributor Author

(Catching up after vacations)
@ggouaillardet --with-psm2 works fine.
@bosilca as @hppritcha mentions, I also tested applying #2448 on 2.0.x makes not change on the latency degradation. Just to be on the same page, I'm not building multithreaded.

@hjelmn
Copy link
Member

hjelmn commented Jan 26, 2017

May be fixed by #2840.

@matcabral
Copy link
Contributor Author

Yes indeed!! first round of testing shows the performance got to show same performance as 1.10.x. Will run more tests tomorrow. 😄

@matcabral
Copy link
Contributor Author

So, so confirmed that with #2840 there still a degradation around %3 / %4 in latency for smaller messages. Also adding patch for #2448 the degradation goes down to %1. Given that I have not ran enough repetitions to get a reasonable average this %1 can just be noise. In summary, #2840 and #2448 together fix the problem.

@bosilca bosilca closed this as completed Jan 28, 2017
@matcabral
Copy link
Contributor Author

Question (sorry if I'm missing something) I see this issue closed for milestone v2.0.3 but PR #2448 was not merged into 2.0.x branch. Comments?
thanks

@matcabral
Copy link
Contributor Author

I'm re opening until #2448 gets into 2.0.x

@matcabral matcabral reopened this Feb 9, 2017
@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2017

@matcabral Feel free to make PRs to merge this into v2.0.x and v2.x. (frankly, I thought this was already there...?)

@matcabral
Copy link
Contributor Author

Hey @jsquyres , I confirmed the changes in the cm pml from PR #2448 are not in v2.0.x or v2.x. Shouldn't ti be cleaner to directly merge the PR than create a new on ?

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2017

@matcabral Not sure what you mean:

What I'm hearing you say is that there has been no v2.0.x or v2.x PR for the code that was in #2448 (i.e., the PR to master). Is that correct?

@matcabral
Copy link
Contributor Author

yes, to all 😄 .
I entered a few comments above that I got the performance equal 1.10 by applying #2840 and #2448 on v2.0.x . But #2448 never made it here.

@matcabral
Copy link
Contributor Author

Ok, I now think I get you suggestion. I'll open PR for #2448 for v2.x and v2.0.x

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2017

@matcabral Cool, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants