Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc: Don't inline in CGUs at -O0 #45075

Merged
merged 1 commit into from
Oct 9, 2017
Merged

Conversation

alexcrichton
Copy link
Member

This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat #[inline] functions by translating them into all codegen units that
they're needed within, marking the linkage as internal. This commit changes
the behavior so that in debug mode (compiling at -O0) rustc will instead only
translate #[inline] functions into one codegen unit, forcing all other
codegen units to reference this one copy.

The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in #44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example String would be basically
inlined into all codegen units if used). The strategy in this commit should
reduce the cost of #[inline] functions to being equivalent to one codegen
unit, which is only translating and codegen'ing inline functions once.

Collected data shows that this does indeed improve the situation from before
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.

One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.

@rust-highfive
Copy link
Collaborator

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton
Copy link
Member Author

r? @michaelwoerister

@@ -280,75 +280,74 @@ fn place_root_translation_items<'a, 'tcx, I>(tcx: TyCtxt<'a, 'tcx, 'tcx>,
let mut internalization_candidates = FxHashSet();

for trans_item in trans_items {
let is_root = trans_item.instantiation_mode(tcx) == InstantiationMode::GloballyShared;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file's diff is probably best viewed with no whitespace (very few changes here, just indentation)

@alexcrichton
Copy link
Member Author

Oh I should also mention that I only changed the compiler's default behavior in O0 mode. My thinking was that if we don't have any way to inline into all codegen units then we basically kill performance for optimized codegen unit builds, so I didn't want to tamper with anyone relying on that. Once we have ThinLTO, however, that I believe is the avenue by which we'd achieve inlining, so I think we could turn this behavior on by default.

@michaelwoerister
Copy link
Member

@bors r+

Thanks, @alexcrichton ! I'm excited about this :) Apart from reducing plain multi-cgu overhead it could have quite the positive effect on incremental compilation with debuginfo enabled.

@alexcrichton it would be great if you could do some runtime performance benchmarks to see the effect of this.

One thing that I'm not sure about is whether it is a good idea to also apply this to drop-glue and shims. With the changes in this PR they will all end up in one big codegen unit (the "fallback codegen-unit) and that might not be what we want. Also characteristic_def_id_of_trans_item() should probably be made smarter before extending this to non--O0 scenarios. The implementation still assumes that it never has to deal with anything other than regular functions.

@bors
Copy link
Contributor

bors commented Oct 7, 2017

📌 Commit cb7f7f0 has been approved by michaelwoerister

@alexcrichton
Copy link
Member Author

@michaelwoerister when you say performance benchmarks, you mean of activating this in release mode? I sort of assumed that the benchmarks could only be worse than this which is where we don't inline generics across codegen units but we inline #[inline] functions across all cgus. I could see just how bad it fares though :)

@mersinvald
Copy link
Contributor

mersinvald commented Oct 7, 2017

@alexcrichton I've tested this PR on the same project as before:

1:  66.25 65.56 64.43 (65.41)
2:  63.62 63.66 65.21 (64.16)
4:  60.76 60.81 60.74 (60.77)
8:  60.88 60.96 61.31 (61.05)
16: 61.31 62.62 62.27 (62.06)
32: 63.65 64.10 64.37 (64.04)

Interesting that now 2 CGUs performing only as goog as 32 (with inlining it was the fastest option), and 4 CGUs seems the fastest and most stable option.

Also now builds are faster with any CGU number then with inlining. (approx. -10 secs at least).

@alexcrichton
Copy link
Member Author

@mersinvald fascinating! I'm particularly curious about the huge dip using one codegen unit. Before you mentioned that one CGU was 80.41s to compile and now you're seeing 65.41s. That's quite a big improvement, and definitely shouldn't be affected by this PR! (this PR should have the same performance in one-CGU mode before and after).

Do you know what might be causing that discrepancy? I'm still surprised at the drastic difference between 4 and 32 cgus!

@mersinvald
Copy link
Contributor

@alexcrichton I'll recheck now with one-commit-before-this-pr rustc, our code base changed a bit, I am sorry for confusion in advance :)

@mersinvald
Copy link
Contributor

@alexcrichton rechecked.

Before PR:
4:  62.87 63.40 63.77 (63.34)
32: 69.74 69.47 69.55 (69.53)

After PR:
1:  66.25 65.56 64.43 (65.41)
2:  63.62 63.66 65.21 (64.16)
4:  60.76 60.81 60.74 (60.77)
8:  60.88 60.96 61.31 (61.05)
16: 61.31 62.62 62.27 (62.06)
32: 63.65 64.10 64.37 (64.04)

So, disabling inlines clearly has a good effect, dip between 4 and 32 GCUs is lower.

-2 secs with 4 CGUs anyway)

@bors
Copy link
Contributor

bors commented Oct 8, 2017

☔ The latest upstream changes (presumably #44841) made this pull request unmergeable. Please resolve the merge conflicts.

This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat `#[inline]` functions by translating them into all codegen units that
they're needed within, marking the linkage as `internal`. This commit changes
the behavior so that in debug mode (compiling at `-O0`) rustc will instead only
translate `#[inline]` functions into *one* codegen unit, forcing all other
codegen units to reference this one copy.

The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in rust-lang#44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example `String` would be basically
inlined into all codegen units if used). The strategy in this commit should
reduce the cost of `#[inline]` functions to being equivalent to one codegen
unit, which is only translating and codegen'ing inline functions once.

Collected [data] shows that this does indeed improve the situation from [before]
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.

One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.

[data]: rust-lang#44941 (comment)
[before]: rust-lang#44941 (comment)
@alexcrichton
Copy link
Member Author

@mersinvald hm ok very interesting! Thanks so much again for taking the time to collect this data :)

Something still feels not quite right though about the 4 -> 32 codegen unit transition. Do you know if there's one crate in particular that slows down by 4 seconds? Or does everything in general just slow down a little bit?

@bors: r=michaelwoerister

@bors
Copy link
Contributor

bors commented Oct 8, 2017

📌 Commit 4b2bdf7 has been approved by michaelwoerister

@mersinvald
Copy link
Contributor

mersinvald commented Oct 8, 2017

@alexcrichton if there were some cargo option to display timings for every built crate, I could measure it. Is there something like that?

@michaelwoerister
Copy link
Member

@mersinvald Do you maybe have a mechanical hard drive instead of an SSD in the computer you are testing this on? Having more CGUs will lead to more I/O and we are probably testing mostly on machine with at least some kind of SSD.

@michaelwoerister
Copy link
Member

@bors p=1 (this is kind of important)

@alexcrichton
Copy link
Member Author

@mersinvald ah unfotunately there's not anything easy to see how long crates are compiling for, but you may know of some that take longer than others perhaps?

@bors
Copy link
Contributor

bors commented Oct 9, 2017

⌛ Testing commit 4b2bdf7 with merge 199e837fe292873170dc1cf679ca6345a9df41cf...

@bors
Copy link
Contributor

bors commented Oct 9, 2017

💔 Test failed - status-travis

@michaelwoerister
Copy link
Member

@michaelwoerister when you say performance benchmarks, you mean of activating this in release mode?

No, I was actually interested in the runtime performance difference in debug mode. In particular: is there any difference at all? Do we do any inlining with -O0? What about #[inline(always)]?

@michaelwoerister
Copy link
Member

@bors retry

@bors
Copy link
Contributor

bors commented Oct 9, 2017

⌛ Testing commit 4b2bdf7 with merge 72d6501...

bors added a commit that referenced this pull request Oct 9, 2017
rustc: Don't inline in CGUs at -O0

This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat `#[inline]` functions by translating them into all codegen units that
they're needed within, marking the linkage as `internal`. This commit changes
the behavior so that in debug mode (compiling at `-O0`) rustc will instead only
translate `#[inline]` functions into *one* codegen unit, forcing all other
codegen units to reference this one copy.

The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in #44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example `String` would be basically
inlined into all codegen units if used). The strategy in this commit should
reduce the cost of `#[inline]` functions to being equivalent to one codegen
unit, which is only translating and codegen'ing inline functions once.

Collected [data] shows that this does indeed improve the situation from [before]
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.

One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.

[data]: #44941 (comment)
[before]: #44941 (comment)
@carols10cents carols10cents added the S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. label Oct 9, 2017
@aidanhs
Copy link
Member

aidanhs commented Oct 9, 2017

[01:24:43] Dist std stage2 (x86_64-unknown-linux-gnu -> x86_64-unknown-netbsd)
[01:26:25] Dist analysis
[01:26:25] image_src: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage1-std/x86_64-unknown-netbsd/release/deps/save-analysis", dst: "/checkout/obj/build/tmp/dist/rust-analysis-nightly-x86_64-unknown-netbsd-image/lib/rustlib/x86_64-unknown-netbsd/analysis"
[01:26:26] Dist src
[01:26:35] Dist extended stage2 (x86_64-unknown-netbsd)
[01:26:35] Dist cargo stage2 (x86_64-unknown-netbsd)
[01:26:35]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[01:26:35]                                  Dload  Upload   Total   Spent    Left  Speed
[01:26:35] 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   282    0   282    0     0    847      0 --:--:-- --:--:-- --:--:--   849
[01:26:35] thread 'main' panicked at 'downloaded openssl sha256 different
[01:26:35] expected: 6b3977c61f2aedf0f96367dcfb5c6e578cf37e7b8d913b4ecb6643c3cb88d8c0
[01:26:35] found:    8552e3169b5a6071edfab25bf7dacb3bbd40ae5325bd472690047b90c1cd0589
[01:26:35] ', /checkout/src/bootstrap/native.rs:380:16
[01:26:35] note: Run with `RUST_BACKTRACE=1` for a backtrace.
[01:26:35] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap dist --host x86_64-unknown-netbsd --target x86_64-unknown-netbsd
[01:26:35] Build completed unsuccessfully in 1:24:19

Possibly #40474?

@bors
Copy link
Contributor

bors commented Oct 9, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: michaelwoerister
Pushing 72d6501 to master...

@bors bors merged commit 4b2bdf7 into rust-lang:master Oct 9, 2017
@bluss bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Oct 9, 2017
@alexcrichton alexcrichton deleted the inline-less branch October 10, 2017 18:46
kennytm added a commit to kennytm/rust that referenced this pull request Oct 13, 2017
…nload-error, r=Mark-Simulacrum

rustbuild: Make openssl download more reliable.

1. Add `-f` flag to curl, so when the server returns 403 or 500 it will fail immediately.
2. Moved the checksum part into the retry loop, assuming checksum failure is due to broken download that can be fixed by downloading again.

This PR is created responding to two recent spurious failures in rust-lang#45075 (comment) and rust-lang#45030 (comment).

r? @Mark-Simulacrum , cc @aidanhs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants