Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in 1.10.3 #2591

Closed
jsquyres opened this issue Dec 16, 2016 · 8 comments
Closed

Performance regression in 1.10.3 #2591

jsquyres opened this issue Dec 16, 2016 · 8 comments

Comments

@jsquyres
Copy link
Member

@hjelmn is working on identifying a performance regression that was introduced in 1.10.3. He doesn't have any further info yet beyond the fact that one of their apps runs a bunch slower with 1.10.3 vs 1.10.2.

He thinks this will also affect v2.0.x and v2.x.

This is potentially a blocker for v1.10.5 and v2.0.2.

@hjelmn Says he'll have more information shortly.

@hjelmn
Copy link
Member

hjelmn commented Dec 16, 2016

Bisect shows for 1.10 the regression starts with 9f2a6da. This commit forces the use of clock_gettime () over the rtdtsc instruction. IMHO this is a really bad idea. This forces us to call clock_gettime () on every single call to opal_progress ().

@bosilca We do not need a monotonic timer for opal_progress (), correct? If not we should force opal_progress to always use the asm timer and not clock_gettime ().

@hjelmn
Copy link
Member

hjelmn commented Dec 16, 2016

@rhc54, @jladd-mlnx. I would think this would affect message rates as well. The pre-1.10.3 behavior can be forced using --mca timer_require_monotonic 0. Might be worth rerunning some of the performance regressions with this option set.

@bosilca
Copy link
Member

bosilca commented Dec 16, 2016

@hjelmn I think we can survive in opal_progress without a monotonic timer, in which case we might run the libevent progress at an erratic rate. However, this will force us to expose 2 timers, one that is monotonic and one that might not be, and we do not have right now the infrastructure.

@hjelmn
Copy link
Member

hjelmn commented Dec 16, 2016

@bosilca. Yeah. Not sure how to handle that.

For now I can do something for newer Intel processors. There is a cpuid bit to indicate whether the cpu has a core invariant tsc. I am adding a function to check that bit. If it is set it will use the rtdtsc instruction as a monotonic timer.

@bosilca
Copy link
Member

bosilca commented Dec 16, 2016

👍

@ggouaillardet
Copy link
Contributor

Iirc, we moved to clock_gettime() because MPI_Wtime() could go back in time if a (thread of a) task migrates within a node.
@hjelmn do you plan to test the cpuid bit at configure time or at run time ?
Iirc once again, we take several shortcuts in the timer, and it does not behave as a standard component.
Having two timers (one internal, one external and always monotonic) could be the most efficient and "portable" way

@hjelmn
Copy link
Member

hjelmn commented Dec 17, 2016

@ggouaillardet I agree that we should have two timers. Working on that change over the weekend. The tsc core-invariance test is a run-time test (at timer component open) and gets us back the performance on Haswell and Broadwell (and probably others). See #2596.

@rhc54
Copy link
Contributor

rhc54 commented Dec 19, 2016

not sure why this didn't close

@rhc54 rhc54 closed this as completed Dec 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants