Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider disabling compression for rlibs and bytecode files #37086

Closed
nnethercote opened this issue Oct 11, 2016 · 29 comments
Closed

Consider disabling compression for rlibs and bytecode files #37086

nnethercote opened this issue Oct 11, 2016 · 29 comments

Comments

@nnethercote
Copy link
Contributor

One of the hottest functions in rustc is tdefl_compress, which is called from deflate_bytes. It's used in two places: crate metadata in rlibs, and LLVM bytecode files.

If we simply turned off compression in these two places we would get sizeable speed-ups. The following numbers are for a proof-of-concept patch, doing debug builds with a stage1 compiler.

futures-rs-test  4.632s vs  4.588s --> 1.009x faster (variance: 1.013x, 1.012x)
helloworld       0.249s vs  0.250s --> 0.997x faster (variance: 1.014x, 1.015x)
html5ever-2016-  7.967s vs  7.791s --> 1.023x faster (variance: 1.004x, 1.016x)
hyper.0.5.0      5.424s vs  5.177s --> 1.048x faster (variance: 1.004x, 1.006x)
inflate-0.1.0    5.013s vs  4.945s --> 1.014x faster (variance: 1.009x, 1.017x)
issue-32062-equ  0.367s vs  0.364s --> 1.008x faster (variance: 1.013x, 1.017x)
issue-32278-big  1.812s vs  1.810s --> 1.001x faster (variance: 1.007x, 1.008x)
jld-day15-parse  1.638s vs  1.606s --> 1.020x faster (variance: 1.001x, 1.012x)
piston-image-0. 12.522s vs 12.236s --> 1.023x faster (variance: 1.029x, 1.004x)
regex.0.1.30     2.684s vs  2.511s --> 1.069x faster (variance: 1.018x, 1.013x)
rust-encoding-0  2.232s vs  2.134s --> 1.046x faster (variance: 1.008x, 1.010x)
syntex-0.42.2   34.353s vs 33.205s --> 1.035x faster (variance: 1.011x, 1.013x)
syntex-0.42.2-i 18.848s vs 17.033s --> 1.107x faster (variance: 1.004x, 1.035x) 

regex and syntex-incr are the biggest wins.

The obvious downside is that the size of the relevant files is larger. So we need to decide if this we are happy with this trade-off, accepting larger files for faster compilation.

@nnethercote
Copy link
Contributor Author

CC @eddyb, @Mark-Simulacrum

@arielb1
Copy link
Contributor

arielb1 commented Oct 11, 2016

Maybe we should switch to a faster compression algorithm - e.g. lz4?

@eddyb
Copy link
Member

eddyb commented Oct 11, 2016

Something that could be considered is an algorithm that is seekable, so that we also could decompress on-demand (as we use absolute/relative positioning in the metadata blob).
If we can go back to compressing the metadata blobs in rlibs without noticeable slowdown we could get rid of the leb128 encoding and just write integers in little-endian as they would be in memory.
Then again, it might not be worth it.

@nagisa
Copy link
Member

nagisa commented Oct 11, 2016

we would get sizeable speed-ups

How much larger the rlibs become, though? 30%?

@michaelwoerister
Copy link
Member

michaelwoerister commented Oct 11, 2016

For reference:

Alright so given the numbers #6954, I don't think LZ4 is worth it, and neither does @brson.

from #6902 (comment)

To be honest, I don't think that the speed gains listed above warrant turning compression off. How hard would it be to compress things on a background thread? Compression could be done in parallel with codegen, maybe?

@eddyb
Copy link
Member

eddyb commented Oct 11, 2016

rlibs are already uncompressed except bitcode. Metadata in dylibs is compressed.

I think we turned it off in rlibs not because of the one-time compression cost, but because of decompression in each user crate (it also means no mmap is possible).

@alexcrichton
Copy link
Member

It's used in two places: crate metadata in rlibs, and LLVM bytecode files.

As @eddyb mentioned rlib metadata isn't compressed, as it's intended to be mmap'd. So that also makes me curious where this compression is called from? A standard compilation should not take a look at the bytecode (e.g. it shouldn't need to decompress it), it's only there for LTO builds. Or that's at least the state of the world as of a few years ago when I implemented LTO...

@nnethercote are you sure that this function is mostly being called from decompression of metadata and/or bytecode? I could imagine that this showing up with compressing bytecode but not during a normal compile...

@eddyb
Copy link
Member

eddyb commented Oct 11, 2016

tdefl_compress is compression, i.e. "deflate" ("inflate" being decompression - "de" of "deflate" might be a bit confusing).

@alexcrichton
Copy link
Member

Oh ok I think I misread.

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs). We can likely stomach larger rlibs (the size rarely comes up) but to truly fix this we in theory want to disable bytecode-in-rlib entirely. It's only used for LTO, which is almost never used, so we should arguably require a flag to opt-in to LTO-able-rlib which Cargo passes by default if need be.

@nnethercote
Copy link
Contributor Author

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs).

As the first comment says, compression occurs in two places. More specifically, here:

  • write_metadata, in src/librustc_trans/base.rs
  • link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

we could get rid of the leb128 encoding

leb128 encoding is also hot. E.g. see #37083 where a tiny improvement to read_unsigned_leb128 had a small but noticeable effect.

@nnethercote
Copy link
Contributor Author

Any suggestions on how to move forward here? The speed-ups are large, but the disk space trade-off is such that it will be easy for inertia to win here, and I'd like to avoid that. I can take file measurements if someone suggests good examples to measure.

@alexcrichton
Copy link
Member

I suppose my personal ideal world would look a little like:

  • The compiler works with rlibs that have either compressed or uncompressed bytecode, some metadata says which.
  • A flag is added to the compiler to compress the bytecode (or basically "generate the smallest, yet fastest to read rlib")
  • The compiler's build system passes this flag on release builders

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

@michaelwoerister
Copy link
Member

I'd prefer to have a future proof concept before doing anything here. Unless I'm overlooking something, there are only two crates where this makes a difference of more than a second and that's for debug builds of big crates. In a release build, I would suspect that the difference is well under 1% of total build time. So, I don't see an urgent need for action here. (Sorry, @nnethercote, I don't want to put a damper on your enthusiasm. It's great that you are looking into this, I just don't want to needlessly rush things)

Some questions that I'd like to have answered before proceeding:

  • Do we really want to make LTO opt-in? How would that interact with the possibility of machine-code-less rlibs that have been talked about from time to time. I'm worried about introducing another stable commandline option just to deprecate it again a few months later.
  • Is there a compression algorithm that works well with small, individually addressable items in a stream? Out of the box, LZW variants don't provide that, afaik.
  • Do we need a general purpose algorithm at all. Using leb128 for all numbers probably provides for pretty OK compression (but mixing leb128 with general purpose compression, like we do now, is probably a bad idea).
  • How much of the compressed data is crate metadata and how much is LLVM bitcode?
  • Could we disable compression implicitly, e.g. for debug builds, incremental compilation, and release builds that do not specify -Os or -Oz?

@oyvindln
Copy link
Contributor

A relatively easy change that might help would be to simply reduce the compression level. There are some flags that can be changed, see here, or here. Maybe there is a better tradeoff that could be found without making massive changes. Another thing that might be worth noting is that deflate function as it's written now will make miniz allocate memory (I haven't counted the exact size, but may be closer to 1mb) for the compressor each time it's called, which may not be ideal if it's used a number of times in short succession.

@joshtriplett
Copy link
Member

I don't think we want to make LTO opt-in; Rust should support it by default, without needing to rebuild the world with it.

@jld
Copy link
Contributor

jld commented Oct 15, 2016

Not sure how much it helps, but deflate can be parallelized.

@briansmith
Copy link
Contributor

It's only used for LTO, which is almost never used

I disagree that LTO is almost never used. But, also, in the long run it should be used more. So, I'd rather have all libraries support LTO by default.

Rust should support it by default, without needing to rebuild the world with it.

I personally am OK with rebuilding the world to switch from non-LTO to LTO builds, as long as it doesn't require every library to opt into anything manually.

Perhaps, when building an application, Cargo could inspect the configuration and if any configuration enables LTO then it could build every library with LTO support. Otherwise, if the application doesn't enable LTO support, then it wouldn't enable LTO support. I don't know if this would need to inspect just the application's Cargo.toml or if it would need to recursively inspect all the Cargo.tomls, but this seems like just a detail.

@lu-zero
Copy link
Contributor

lu-zero commented Oct 18, 2016

Just to point the obvious, using something different from deflate might help better and if you are already encoding the data might be worthy considering do it once and use lz4, lzo or even zstd and spare one step.

How hard would be plug any of those?

@arielb1
Copy link
Contributor

arielb1 commented Oct 18, 2016

zstd is supposedly a faster, equal-compression-rate version of zlib. Someone should check that.

@nnethercote
Copy link
Contributor Author

As the first comment says, compression occurs in two places. More specifically, here:

  • write_metadata, in src/librustc_trans/base.rs
  • link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

@eddyb worked out that write_metadata is compressing metadata unnecessarily for rlibs! #37267 addresses this. With that change, write_metadata still compresses for dylibs and "proc macro" crates but AIUI they are much rarer. So that's the first part of this PR addressed.

That still leaves the bytecode compression, which is the larger part of the potential speed-up. I tried disabling bytecode compression and the size of the rlibs for syntex increased from 54508120 to 75265964, a 1.38x increase, i.e. quite a bit. So unconditionally disabling it probably isn't feasible.

eddyb added a commit to eddyb/rust that referenced this issue Oct 19, 2016
…ddyb

Optimize `write_metadata`.

`write_metadata` currently generates metadata unnecessarily in some
cases, and also compresses it unnecessarily in some cases. This commit
fixes that. It speeds up three of the rustc-benchmarks by 1--4%.

r? @eddyb, who deserves much of the credit because he (a) identified the problem from the profile data I provided in rust-lang#37086, and (b) explained to me how to fix it. Thank you, @eddyb!
@alexcrichton
Copy link
Member

I did a small bit of analysis about compression algorithms and such. I extracted all bytecode from the standard distribution rlibs (e.g. everything we're shipping). I decompressed it and then recompressed it with a bunch of algorithms. Kept track of how long everything took as well as the compression ratios.

The raw data is here where each section is the statistics for one particular piece of bytecode. The final entry is the summation of everything previous. I tested:

  • xz compression levels 0-9
  • deflate compression levels (fast, default, best)
  • brotli compression 0-9
  • zstd compression 0-9

Basically what I think this tells me is that zstd is blazingly fast and gets better compression than deflate (what we're using today) at lower levels. Otherwise xz is super slow (surprise surprise) and brotli also isn't faring too well on this data set.

Now that being said, this seems like it's a tiny portion of compiles. The deflate times for any particular bytecode are in the handfuls of milliseconds at most it looks like. If we really want speed though, zstd seems the way to go.

@nnethercote
Copy link
Contributor Author

Thank you for the analysis, @alexcrichton. What parameters defines the "fast", "default", and "best" modes for deflate?

Now that being said, this seems like it's a tiny portion of compiles.

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

@oyvindln
Copy link
Contributor

Based on the tests done by @alexcrichton (maybe you could put up the test in a github repo or something?) it seems that lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

@michaelwoerister
Copy link
Member

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

Is that still true after metadata is not compressed for rlibs anymore?

@michaelwoerister
Copy link
Member

I think a good first step towards improvement here would be to allow for LLVM bitcode to be either compressed or not. The way we store bitcode already contains a small header that tells us about the format, so this would be easy to implement in a backwards compatible way.

With that implemented we can just forgo compression in some scenarios (like debug builds or incr. comp.) in a transparent way.

Adding support for zstd would be nice too.

@alexcrichton
Copy link
Member

@nnethercote

The fast/default/best correspond to what's in flate2.

@oyvindln

It's pretty janky, but the script I used is here.

@michaelwoerister

Agreed we should support uncompressed bytecode! I'd also be fine with it behind a flag that we disable by default and only enable for our own releases.

@briansmith
Copy link
Contributor

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

All said and done, my Visual Studio + Windows SDK setup is ~5GB. I wouldn't blink at the Rust installation doubling in size if it meant compilation was even slightly (a few percent) faster and/or if the toolchain was easier to maintain (i.e. if it didn't need to support any compression at all).

@nnethercote
Copy link
Contributor Author

lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

This is exactly what I've done in #37298. Thank you for the suggestion.

Manishearth added a commit to Manishearth/rust that referenced this issue Oct 22, 2016
…richton

Use a faster `deflate` setting

In rust-lang#37086 we have considered various ideas for reducing the cost of LLVM bytecode compression. This PR implements the simplest of these: use a faster `deflate` setting. It's very simple and reduces the compression time by almost half while increasing the size of the resulting rlibs by only about 2%.

I looked at using zstd, which might be able to halve the compression time again. But integrating zstd is beyond my Rust FFI integration abilities at the moment -- it consists of a few dozen C files, has a non-trivial build system, etc. I decided it was worth getting a big chunk of the possible improvement with minimum effort.

The following table shows the before and after percentages of instructions executed during compression while doing debug builds of some of the rustc-benchmarks with a stage1 compiler.
```
html5ever-2016-08-25      1.4% -> 0.7%
hyper.0.5.0               3.8% -> 2.4%
inflate-0.1.0             1.0% -> 0.5%
piston-image-0.10.3       2.9% -> 1.8%
regex.0.1.30              3.4% -> 2.1%
rust-encoding-0.3.0       4.8% -> 2.9%
syntex-0.42.2             2.9% -> 1.8%
syntex-0.42.2-incr-clean 14.2% -> 8.9%
```
The omitted ones spend 0% of their time in decompression.

And here are actual timings:
```
futures-rs-test  4.110s vs  4.102s --> 1.002x faster (variance: 1.017x, 1.004x)
helloworld       0.223s vs  0.226s --> 0.986x faster (variance: 1.012x, 1.022x)
html5ever-2016-  4.218s vs  4.186s --> 1.008x faster (variance: 1.008x, 1.010x)
hyper.0.5.0      4.746s vs  4.661s --> 1.018x faster (variance: 1.002x, 1.016x)
inflate-0.1.0    4.194s vs  4.143s --> 1.012x faster (variance: 1.007x, 1.006x)
issue-32062-equ  0.317s vs  0.316s --> 1.001x faster (variance: 1.013x, 1.005x)
issue-32278-big  1.811s vs  1.825s --> 0.992x faster (variance: 1.014x, 1.006x)
jld-day15-parse  1.412s vs  1.412s --> 1.001x faster (variance: 1.019x, 1.008x)
piston-image-0. 11.058s vs 10.977s --> 1.007x faster (variance: 1.008x, 1.039x)
reddit-stress    2.331s vs  2.342s --> 0.995x faster (variance: 1.019x, 1.006x)
regex.0.1.30     2.294s vs  2.276s --> 1.008x faster (variance: 1.007x, 1.007x)
rust-encoding-0  1.963s vs  1.924s --> 1.020x faster (variance: 1.009x, 1.006x)
syntex-0.42.2   29.667s vs 29.391s --> 1.009x faster (variance: 1.002x, 1.023x)
syntex-0.42.2-i 15.257s vs 14.148s --> 1.078x faster (variance: 1.018x, 1.008x)
```

r? @alexcrichton
bors added a commit that referenced this issue Oct 22, 2016
Use a faster `deflate` setting

In #37086 we have considered various ideas for reducing the cost of LLVM bytecode compression. This PR implements the simplest of these: use a faster `deflate` setting. It's very simple and reduces the compression time by almost half while increasing the size of the resulting rlibs by only about 2%.

I looked at using zstd, which might be able to halve the compression time again. But integrating zstd is beyond my Rust FFI integration abilities at the moment -- it consists of a few dozen C files, has a non-trivial build system, etc. I decided it was worth getting a big chunk of the possible improvement with minimum effort.

The following table shows the before and after percentages of instructions executed during compression while doing debug builds of some of the rustc-benchmarks with a stage1 compiler.
```
html5ever-2016-08-25      1.4% -> 0.7%
hyper.0.5.0               3.8% -> 2.4%
inflate-0.1.0             1.0% -> 0.5%
piston-image-0.10.3       2.9% -> 1.8%
regex.0.1.30              3.4% -> 2.1%
rust-encoding-0.3.0       4.8% -> 2.9%
syntex-0.42.2             2.9% -> 1.8%
syntex-0.42.2-incr-clean 14.2% -> 8.9%
```
The omitted ones spend 0% of their time in decompression.

And here are actual timings:
```
futures-rs-test  4.110s vs  4.102s --> 1.002x faster (variance: 1.017x, 1.004x)
helloworld       0.223s vs  0.226s --> 0.986x faster (variance: 1.012x, 1.022x)
html5ever-2016-  4.218s vs  4.186s --> 1.008x faster (variance: 1.008x, 1.010x)
hyper.0.5.0      4.746s vs  4.661s --> 1.018x faster (variance: 1.002x, 1.016x)
inflate-0.1.0    4.194s vs  4.143s --> 1.012x faster (variance: 1.007x, 1.006x)
issue-32062-equ  0.317s vs  0.316s --> 1.001x faster (variance: 1.013x, 1.005x)
issue-32278-big  1.811s vs  1.825s --> 0.992x faster (variance: 1.014x, 1.006x)
jld-day15-parse  1.412s vs  1.412s --> 1.001x faster (variance: 1.019x, 1.008x)
piston-image-0. 11.058s vs 10.977s --> 1.007x faster (variance: 1.008x, 1.039x)
reddit-stress    2.331s vs  2.342s --> 0.995x faster (variance: 1.019x, 1.006x)
regex.0.1.30     2.294s vs  2.276s --> 1.008x faster (variance: 1.007x, 1.007x)
rust-encoding-0  1.963s vs  1.924s --> 1.020x faster (variance: 1.009x, 1.006x)
syntex-0.42.2   29.667s vs 29.391s --> 1.009x faster (variance: 1.002x, 1.023x)
syntex-0.42.2-i 15.257s vs 14.148s --> 1.078x faster (variance: 1.018x, 1.008x)
```

r? @alexcrichton
@nnethercote
Copy link
Contributor Author

#37267 removed some useless compression, and #37298 tweaked the deflate settings to make compression faster. Compression still shows up moderately high in some profiles but I think that's enough progress to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests