Consider disabling compression for rlibs and bytecode files #37086

nnethercote · 2016-10-11T09:23:13Z

One of the hottest functions in rustc is tdefl_compress, which is called from deflate_bytes. It's used in two places: crate metadata in rlibs, and LLVM bytecode files.

If we simply turned off compression in these two places we would get sizeable speed-ups. The following numbers are for a proof-of-concept patch, doing debug builds with a stage1 compiler.

futures-rs-test  4.632s vs  4.588s --> 1.009x faster (variance: 1.013x, 1.012x)
helloworld       0.249s vs  0.250s --> 0.997x faster (variance: 1.014x, 1.015x)
html5ever-2016-  7.967s vs  7.791s --> 1.023x faster (variance: 1.004x, 1.016x)
hyper.0.5.0      5.424s vs  5.177s --> 1.048x faster (variance: 1.004x, 1.006x)
inflate-0.1.0    5.013s vs  4.945s --> 1.014x faster (variance: 1.009x, 1.017x)
issue-32062-equ  0.367s vs  0.364s --> 1.008x faster (variance: 1.013x, 1.017x)
issue-32278-big  1.812s vs  1.810s --> 1.001x faster (variance: 1.007x, 1.008x)
jld-day15-parse  1.638s vs  1.606s --> 1.020x faster (variance: 1.001x, 1.012x)
piston-image-0. 12.522s vs 12.236s --> 1.023x faster (variance: 1.029x, 1.004x)
regex.0.1.30     2.684s vs  2.511s --> 1.069x faster (variance: 1.018x, 1.013x)
rust-encoding-0  2.232s vs  2.134s --> 1.046x faster (variance: 1.008x, 1.010x)
syntex-0.42.2   34.353s vs 33.205s --> 1.035x faster (variance: 1.011x, 1.013x)
syntex-0.42.2-i 18.848s vs 17.033s --> 1.107x faster (variance: 1.004x, 1.035x)

regex and syntex-incr are the biggest wins.

The obvious downside is that the size of the relevant files is larger. So we need to decide if this we are happy with this trade-off, accepting larger files for faster compilation.

The text was updated successfully, but these errors were encountered:

nnethercote · 2016-10-11T09:33:12Z

CC @eddyb, @Mark-Simulacrum

arielb1 · 2016-10-11T09:54:04Z

Maybe we should switch to a faster compression algorithm - e.g. lz4?

eddyb · 2016-10-11T10:05:09Z

Something that could be considered is an algorithm that is seekable, so that we also could decompress on-demand (as we use absolute/relative positioning in the metadata blob).
If we can go back to compressing the metadata blobs in rlibs without noticeable slowdown we could get rid of the leb128 encoding and just write integers in little-endian as they would be in memory.
Then again, it might not be worth it.

nagisa · 2016-10-11T12:47:49Z

we would get sizeable speed-ups

How much larger the rlibs become, though? 30%?

michaelwoerister · 2016-10-11T13:36:57Z

For reference:

Alright so given the numbers #6954, I don't think LZ4 is worth it, and neither does @brson.

from #6902 (comment)

To be honest, I don't think that the speed gains listed above warrant turning compression off. How hard would it be to compress things on a background thread? Compression could be done in parallel with codegen, maybe?

eddyb · 2016-10-11T13:40:17Z

rlibs are already uncompressed except bitcode. Metadata in dylibs is compressed.

I think we turned it off in rlibs not because of the one-time compression cost, but because of decompression in each user crate (it also means no mmap is possible).

alexcrichton · 2016-10-11T14:55:28Z

It's used in two places: crate metadata in rlibs, and LLVM bytecode files.

As @eddyb mentioned rlib metadata isn't compressed, as it's intended to be mmap'd. So that also makes me curious where this compression is called from? A standard compilation should not take a look at the bytecode (e.g. it shouldn't need to decompress it), it's only there for LTO builds. Or that's at least the state of the world as of a few years ago when I implemented LTO...

@nnethercote are you sure that this function is mostly being called from decompression of metadata and/or bytecode? I could imagine that this showing up with compressing bytecode but not during a normal compile...

eddyb · 2016-10-11T14:58:24Z

tdefl_compress is compression, i.e. "deflate" ("inflate" being decompression - "de" of "deflate" might be a bit confusing).

alexcrichton · 2016-10-11T15:03:15Z

Oh ok I think I misread.

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs). We can likely stomach larger rlibs (the size rarely comes up) but to truly fix this we in theory want to disable bytecode-in-rlib entirely. It's only used for LTO, which is almost never used, so we should arguably require a flag to opt-in to LTO-able-rlib which Cargo passes by default if need be.

nnethercote · 2016-10-11T20:51:28Z

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs).

As the first comment says, compression occurs in two places. More specifically, here:

write_metadata, in src/librustc_trans/base.rs
link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

we could get rid of the leb128 encoding

leb128 encoding is also hot. E.g. see #37083 where a tiny improvement to read_unsigned_leb128 had a small but noticeable effect.

nnethercote · 2016-10-12T09:42:10Z

Any suggestions on how to move forward here? The speed-ups are large, but the disk space trade-off is such that it will be easy for inertia to win here, and I'd like to avoid that. I can take file measurements if someone suggests good examples to measure.

alexcrichton · 2016-10-12T14:23:29Z

I suppose my personal ideal world would look a little like:

The compiler works with rlibs that have either compressed or uncompressed bytecode, some metadata says which.
A flag is added to the compiler to compress the bytecode (or basically "generate the smallest, yet fastest to read rlib")
The compiler's build system passes this flag on release builders

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

michaelwoerister · 2016-10-12T14:25:44Z

I'd prefer to have a future proof concept before doing anything here. Unless I'm overlooking something, there are only two crates where this makes a difference of more than a second and that's for debug builds of big crates. In a release build, I would suspect that the difference is well under 1% of total build time. So, I don't see an urgent need for action here. (Sorry, @nnethercote, I don't want to put a damper on your enthusiasm. It's great that you are looking into this, I just don't want to needlessly rush things)

Some questions that I'd like to have answered before proceeding:

Do we really want to make LTO opt-in? How would that interact with the possibility of machine-code-less rlibs that have been talked about from time to time. I'm worried about introducing another stable commandline option just to deprecate it again a few months later.
Is there a compression algorithm that works well with small, individually addressable items in a stream? Out of the box, LZW variants don't provide that, afaik.
Do we need a general purpose algorithm at all. Using leb128 for all numbers probably provides for pretty OK compression (but mixing leb128 with general purpose compression, like we do now, is probably a bad idea).
How much of the compressed data is crate metadata and how much is LLVM bitcode?
Could we disable compression implicitly, e.g. for debug builds, incremental compilation, and release builds that do not specify -Os or -Oz?

oyvindln · 2016-10-12T19:32:43Z

A relatively easy change that might help would be to simply reduce the compression level. There are some flags that can be changed, see here, or here. Maybe there is a better tradeoff that could be found without making massive changes. Another thing that might be worth noting is that deflate function as it's written now will make miniz allocate memory (I haven't counted the exact size, but may be closer to 1mb) for the compressor each time it's called, which may not be ideal if it's used a number of times in short succession.

joshtriplett · 2016-10-14T07:17:24Z

I don't think we want to make LTO opt-in; Rust should support it by default, without needing to rebuild the world with it.

jld · 2016-10-15T02:43:34Z

Not sure how much it helps, but deflate can be parallelized.

briansmith · 2016-10-17T23:38:12Z

It's only used for LTO, which is almost never used

I disagree that LTO is almost never used. But, also, in the long run it should be used more. So, I'd rather have all libraries support LTO by default.

Rust should support it by default, without needing to rebuild the world with it.

I personally am OK with rebuilding the world to switch from non-LTO to LTO builds, as long as it doesn't require every library to opt into anything manually.

Perhaps, when building an application, Cargo could inspect the configuration and if any configuration enables LTO then it could build every library with LTO support. Otherwise, if the application doesn't enable LTO support, then it wouldn't enable LTO support. I don't know if this would need to inspect just the application's Cargo.toml or if it would need to recursively inspect all the Cargo.tomls, but this seems like just a detail.

lu-zero · 2016-10-18T09:48:58Z

Just to point the obvious, using something different from deflate might help better and if you are already encoding the data might be worthy considering do it once and use lz4, lzo or even zstd and spare one step.

How hard would be plug any of those?

arielb1 · 2016-10-18T20:18:23Z

zstd is supposedly a faster, equal-compression-rate version of zlib. Someone should check that.

nnethercote · 2016-10-19T03:25:45Z

As the first comment says, compression occurs in two places. More specifically, here:

write_metadata, in src/librustc_trans/base.rs

link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

@eddyb worked out that write_metadata is compressing metadata unnecessarily for rlibs! #37267 addresses this. With that change, write_metadata still compresses for dylibs and "proc macro" crates but AIUI they are much rarer. So that's the first part of this PR addressed.

That still leaves the bytecode compression, which is the larger part of the potential speed-up. I tried disabling bytecode compression and the size of the rlibs for syntex increased from 54508120 to 75265964, a 1.38x increase, i.e. quite a bit. So unconditionally disabling it probably isn't feasible.

@eddyb

…ddyb Optimize `write_metadata`. `write_metadata` currently generates metadata unnecessarily in some cases, and also compresses it unnecessarily in some cases. This commit fixes that. It speeds up three of the rustc-benchmarks by 1--4%. r? @eddyb, who deserves much of the credit because he (a) identified the problem from the profile data I provided in rust-lang#37086, and (b) explained to me how to fix it. Thank you, @eddyb!

alexcrichton · 2016-10-19T06:08:18Z

I did a small bit of analysis about compression algorithms and such. I extracted all bytecode from the standard distribution rlibs (e.g. everything we're shipping). I decompressed it and then recompressed it with a bunch of algorithms. Kept track of how long everything took as well as the compression ratios.

The raw data is here where each section is the statistics for one particular piece of bytecode. The final entry is the summation of everything previous. I tested:

xz compression levels 0-9
deflate compression levels (fast, default, best)
brotli compression 0-9
zstd compression 0-9

Basically what I think this tells me is that zstd is blazingly fast and gets better compression than deflate (what we're using today) at lower levels. Otherwise xz is super slow (surprise surprise) and brotli also isn't faring too well on this data set.

Now that being said, this seems like it's a tiny portion of compiles. The deflate times for any particular bytecode are in the handfuls of milliseconds at most it looks like. If we really want speed though, zstd seems the way to go.

nnethercote · 2016-10-19T07:59:23Z

Thank you for the analysis, @alexcrichton. What parameters defines the "fast", "default", and "best" modes for deflate?

Now that being said, this seems like it's a tiny portion of compiles.

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

oyvindln · 2016-10-19T08:19:01Z

Based on the tests done by @alexcrichton (maybe you could put up the test in a github repo or something?) it seems that lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

michaelwoerister · 2016-10-19T14:53:45Z

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

Is that still true after metadata is not compressed for rlibs anymore?

michaelwoerister · 2016-10-19T15:11:03Z

I think a good first step towards improvement here would be to allow for LLVM bitcode to be either compressed or not. The way we store bitcode already contains a small header that tells us about the format, so this would be easy to implement in a backwards compatible way.

With that implemented we can just forgo compression in some scenarios (like debug builds or incr. comp.) in a transparent way.

Adding support for zstd would be nice too.

alexcrichton · 2016-10-19T17:14:48Z

@nnethercote

The fast/default/best correspond to what's in flate2.

@oyvindln

It's pretty janky, but the script I used is here.

@michaelwoerister

Agreed we should support uncompressed bytecode! I'd also be fine with it behind a flag that we disable by default and only enable for our own releases.

briansmith · 2016-10-19T18:32:38Z

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

All said and done, my Visual Studio + Windows SDK setup is ~5GB. I wouldn't blink at the Rust installation doubling in size if it meant compilation was even slightly (a few percent) faster and/or if the toolchain was easier to maintain (i.e. if it didn't need to support any compression at all).

nnethercote · 2016-10-20T04:52:09Z

lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

This is exactly what I've done in #37298. Thank you for the suggestion.

@alexcrichton

…richton Use a faster `deflate` setting In rust-lang#37086 we have considered various ideas for reducing the cost of LLVM bytecode compression. This PR implements the simplest of these: use a faster `deflate` setting. It's very simple and reduces the compression time by almost half while increasing the size of the resulting rlibs by only about 2%. I looked at using zstd, which might be able to halve the compression time again. But integrating zstd is beyond my Rust FFI integration abilities at the moment -- it consists of a few dozen C files, has a non-trivial build system, etc. I decided it was worth getting a big chunk of the possible improvement with minimum effort. The following table shows the before and after percentages of instructions executed during compression while doing debug builds of some of the rustc-benchmarks with a stage1 compiler. ``` html5ever-2016-08-25 1.4% -> 0.7% hyper.0.5.0 3.8% -> 2.4% inflate-0.1.0 1.0% -> 0.5% piston-image-0.10.3 2.9% -> 1.8% regex.0.1.30 3.4% -> 2.1% rust-encoding-0.3.0 4.8% -> 2.9% syntex-0.42.2 2.9% -> 1.8% syntex-0.42.2-incr-clean 14.2% -> 8.9% ``` The omitted ones spend 0% of their time in decompression. And here are actual timings: ``` futures-rs-test 4.110s vs 4.102s --> 1.002x faster (variance: 1.017x, 1.004x) helloworld 0.223s vs 0.226s --> 0.986x faster (variance: 1.012x, 1.022x) html5ever-2016- 4.218s vs 4.186s --> 1.008x faster (variance: 1.008x, 1.010x) hyper.0.5.0 4.746s vs 4.661s --> 1.018x faster (variance: 1.002x, 1.016x) inflate-0.1.0 4.194s vs 4.143s --> 1.012x faster (variance: 1.007x, 1.006x) issue-32062-equ 0.317s vs 0.316s --> 1.001x faster (variance: 1.013x, 1.005x) issue-32278-big 1.811s vs 1.825s --> 0.992x faster (variance: 1.014x, 1.006x) jld-day15-parse 1.412s vs 1.412s --> 1.001x faster (variance: 1.019x, 1.008x) piston-image-0. 11.058s vs 10.977s --> 1.007x faster (variance: 1.008x, 1.039x) reddit-stress 2.331s vs 2.342s --> 0.995x faster (variance: 1.019x, 1.006x) regex.0.1.30 2.294s vs 2.276s --> 1.008x faster (variance: 1.007x, 1.007x) rust-encoding-0 1.963s vs 1.924s --> 1.020x faster (variance: 1.009x, 1.006x) syntex-0.42.2 29.667s vs 29.391s --> 1.009x faster (variance: 1.002x, 1.023x) syntex-0.42.2-i 15.257s vs 14.148s --> 1.078x faster (variance: 1.018x, 1.008x) ``` r? @alexcrichton

@alexcrichton

Use a faster `deflate` setting In #37086 we have considered various ideas for reducing the cost of LLVM bytecode compression. This PR implements the simplest of these: use a faster `deflate` setting. It's very simple and reduces the compression time by almost half while increasing the size of the resulting rlibs by only about 2%. I looked at using zstd, which might be able to halve the compression time again. But integrating zstd is beyond my Rust FFI integration abilities at the moment -- it consists of a few dozen C files, has a non-trivial build system, etc. I decided it was worth getting a big chunk of the possible improvement with minimum effort. The following table shows the before and after percentages of instructions executed during compression while doing debug builds of some of the rustc-benchmarks with a stage1 compiler. ``` html5ever-2016-08-25 1.4% -> 0.7% hyper.0.5.0 3.8% -> 2.4% inflate-0.1.0 1.0% -> 0.5% piston-image-0.10.3 2.9% -> 1.8% regex.0.1.30 3.4% -> 2.1% rust-encoding-0.3.0 4.8% -> 2.9% syntex-0.42.2 2.9% -> 1.8% syntex-0.42.2-incr-clean 14.2% -> 8.9% ``` The omitted ones spend 0% of their time in decompression. And here are actual timings: ``` futures-rs-test 4.110s vs 4.102s --> 1.002x faster (variance: 1.017x, 1.004x) helloworld 0.223s vs 0.226s --> 0.986x faster (variance: 1.012x, 1.022x) html5ever-2016- 4.218s vs 4.186s --> 1.008x faster (variance: 1.008x, 1.010x) hyper.0.5.0 4.746s vs 4.661s --> 1.018x faster (variance: 1.002x, 1.016x) inflate-0.1.0 4.194s vs 4.143s --> 1.012x faster (variance: 1.007x, 1.006x) issue-32062-equ 0.317s vs 0.316s --> 1.001x faster (variance: 1.013x, 1.005x) issue-32278-big 1.811s vs 1.825s --> 0.992x faster (variance: 1.014x, 1.006x) jld-day15-parse 1.412s vs 1.412s --> 1.001x faster (variance: 1.019x, 1.008x) piston-image-0. 11.058s vs 10.977s --> 1.007x faster (variance: 1.008x, 1.039x) reddit-stress 2.331s vs 2.342s --> 0.995x faster (variance: 1.019x, 1.006x) regex.0.1.30 2.294s vs 2.276s --> 1.008x faster (variance: 1.007x, 1.007x) rust-encoding-0 1.963s vs 1.924s --> 1.020x faster (variance: 1.009x, 1.006x) syntex-0.42.2 29.667s vs 29.391s --> 1.009x faster (variance: 1.002x, 1.023x) syntex-0.42.2-i 15.257s vs 14.148s --> 1.078x faster (variance: 1.018x, 1.008x) ``` r? @alexcrichton

nnethercote · 2016-11-22T22:13:02Z

#37267 removed some useless compression, and #37298 tweaked the deflate settings to make compression faster. Compression still shows up moderately high in some profiles but I think that's enough progress to close this issue.

nnethercote mentioned this issue Oct 19, 2016

Optimize write_metadata. #37267

Merged

nnethercote mentioned this issue Oct 20, 2016

Use a faster deflate setting #37298

Merged

nnethercote closed this as completed Nov 22, 2016

hanna-kruppe mentioned this issue May 6, 2017

Replace miniz.c with a pure-Rust implementation #41770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider disabling compression for rlibs and bytecode files #37086

Consider disabling compression for rlibs and bytecode files #37086

nnethercote commented Oct 11, 2016

nnethercote commented Oct 11, 2016

arielb1 commented Oct 11, 2016

eddyb commented Oct 11, 2016

nagisa commented Oct 11, 2016 •

edited

Loading

michaelwoerister commented Oct 11, 2016 •

edited

Loading

eddyb commented Oct 11, 2016

alexcrichton commented Oct 11, 2016

eddyb commented Oct 11, 2016 •

edited

Loading

alexcrichton commented Oct 11, 2016

nnethercote commented Oct 11, 2016

nnethercote commented Oct 12, 2016

alexcrichton commented Oct 12, 2016

michaelwoerister commented Oct 12, 2016

oyvindln commented Oct 12, 2016

joshtriplett commented Oct 14, 2016

jld commented Oct 15, 2016

briansmith commented Oct 17, 2016

lu-zero commented Oct 18, 2016

arielb1 commented Oct 18, 2016

nnethercote commented Oct 19, 2016

alexcrichton commented Oct 19, 2016

nnethercote commented Oct 19, 2016

oyvindln commented Oct 19, 2016

michaelwoerister commented Oct 19, 2016

michaelwoerister commented Oct 19, 2016

alexcrichton commented Oct 19, 2016

briansmith commented Oct 19, 2016

nnethercote commented Oct 20, 2016

nnethercote commented Nov 22, 2016

Consider disabling compression for rlibs and bytecode files #37086

Consider disabling compression for rlibs and bytecode files #37086

Comments

nnethercote commented Oct 11, 2016

nnethercote commented Oct 11, 2016

arielb1 commented Oct 11, 2016

eddyb commented Oct 11, 2016

nagisa commented Oct 11, 2016 • edited Loading

michaelwoerister commented Oct 11, 2016 • edited Loading

eddyb commented Oct 11, 2016

alexcrichton commented Oct 11, 2016

eddyb commented Oct 11, 2016 • edited Loading

alexcrichton commented Oct 11, 2016

nnethercote commented Oct 11, 2016

nnethercote commented Oct 12, 2016

alexcrichton commented Oct 12, 2016

michaelwoerister commented Oct 12, 2016

oyvindln commented Oct 12, 2016

joshtriplett commented Oct 14, 2016

jld commented Oct 15, 2016

briansmith commented Oct 17, 2016

lu-zero commented Oct 18, 2016

arielb1 commented Oct 18, 2016

nnethercote commented Oct 19, 2016

alexcrichton commented Oct 19, 2016

nnethercote commented Oct 19, 2016

oyvindln commented Oct 19, 2016

michaelwoerister commented Oct 19, 2016

michaelwoerister commented Oct 19, 2016

alexcrichton commented Oct 19, 2016

briansmith commented Oct 19, 2016

nnethercote commented Oct 20, 2016

nnethercote commented Nov 22, 2016

nagisa commented Oct 11, 2016 •

edited

Loading

michaelwoerister commented Oct 11, 2016 •

edited

Loading

eddyb commented Oct 11, 2016 •

edited

Loading