Fix read_to_end to not grow an exact size buffer #89165

jkugelman · 2021-09-22T02:55:10Z

If you know how much data to expect and use Vec::with_capacity to pre-allocate a buffer of that capacity, Read::read_to_end will still double its capacity. It needs some space to perform a read, even though that read ends up returning 0.

It's a bummer to carefully pre-allocate 1GB to read a 1GB file into memory and end up using 2GB.

This fixes that behavior by special casing a full buffer and reading into a small "probe" buffer instead. If that read returns 0 then it's confirmed that the buffer was the perfect size. If it doesn't, the probe buffer is appended to the normal buffer and the read loop continues.

Fixing this allows several workarounds in the standard library to be removed:

Take no longer needs to override Read::read_to_end.
The reservation_size callback that allowed Take to inhibit the previous over-allocation behavior isn't needed.
fs::read doesn't need to reserve an extra byte in initial_buffer_size.

Curiously, there was a unit test that specifically checked that Read::read_to_end does over-allocate. I removed that test, too.

rust-highfive · 2021-09-22T02:55:12Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Mark-Simulacrum (or someone else) soon.

Please see the contribution instructions for more information.

If you know how much data to expect and use `Vec::with_capacity` to pre-allocate a buffer of that capacity, `Read::read_to_end` will still double its capacity. It needs some space to perform a read, even though that read ends up returning `0`. It's a bummer to carefully pre-allocate 1GB to read a 1GB file into memory and end up using 2GB. This fixes that behavior by special casing a full buffer and reading into a small "probe" buffer instead. If that read returns `0` then it's confirmed that the buffer was the perfect size. If it doesn't, the probe buffer is appended to the normal buffer and the read loop continues. Fixing this allows several workarounds in the standard library to be removed: - `Take` no longer needs to override `Read::read_to_end`. - The `reservation_size` callback that allowed `Take` to inhibit the previous over-allocation behavior isn't needed. - `fs::read` doesn't need to reserve an extra byte in `initial_buffer_size`. Curiously, there was a unit test that specifically checked that `Read::read_to_end` *does* over-allocate. I removed that test, too.

joshtriplett · 2021-09-22T09:35:09Z

@bors try @rust-timer queue

I don't expect this to turn up any surprises, but that's why they're called "surprises".

rust-timer · 2021-09-22T09:35:11Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-09-22T09:35:18Z

⌛ Trying commit 9b9c24e with merge 4b920a40932f74a7159435b06d96cb50212514ff...

the8472 · 2021-09-22T09:51:06Z

This will cause the free functions fs::read and fs::read_to_string to do more syscalls since they now have to probe for EOF even though they already know the file size.

An alternative would be to update the documentation to recommend passing some extra capacity.

bors · 2021-09-22T10:59:13Z

☀️ Try build successful - checks-actions
Build commit: 4b920a40932f74a7159435b06d96cb50212514ff (4b920a40932f74a7159435b06d96cb50212514ff)

rust-timer · 2021-09-22T10:59:15Z

Queued 4b920a40932f74a7159435b06d96cb50212514ff with parent ce45663, future comparison URL.

jkugelman · 2021-09-22T11:34:32Z

This will cause the free functions fs::read and fs::read_to_string to do more syscalls since they now have to probe for EOF even though they already know the file size.

It shouldn't cause them an extra syscall for them since they pass an exact fit buffer. With or without this change, there will be n syscalls that return data plus a final one that returns Ok(0).

When the buffer is not an exact fit then there might be one extra read. I made sure to only do the probe read a single time when the buffer is still at its starting capacity. And it only happens if the buffer started with extra capacity -- an empty Vec::new() with 0 capacity won't require an extra read.

rust-timer · 2021-09-22T13:09:35Z

Finished benchmarking commit (4b920a40932f74a7159435b06d96cb50212514ff): comparison url.

Summary: This change led to moderate relevant mixed results 🤷 in compiler performance.

Moderate improvement in instruction counts (up to -0.8% on full builds of webrender)
Small regression in instruction counts (up to 1.2% on incr-patched: println builds of coercions)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

joshtriplett · 2021-09-22T16:27:16Z

I would normally think this was noise, but it looks like a lot of small regressions about the same size. Is that a coincidence, or is this causing an issue?

Might be worth running the two toolchains with strace and seeing if the syscalls look noticeably different.

jkugelman · 2021-09-22T17:16:37Z

Might be worth running the two toolchains with strace and seeing if the syscalls look noticeably different.

Will do!

What's a good way to do that? I know strace well but this is my first contribution to this project and I'm not very familiar with testing it. Are you suggesting stracing on my computer or is there a CI tool I should be using? To test this PR I've been running ./x.py test library/std. I could strace that command -- or is there a smarter way to do it?

Also, is there some way I can re-test the regressions in the report? I don't really understand what the benchmark did or how to read the results.

joshtriplett · 2021-09-22T20:37:14Z

@jkugelman I'd suggest using https://rustc-dev-guide.rust-lang.org/compiler-debugging.html#downloading-artifacts-from-rusts-ci to install the exact before-and-after builds from the try above, the same ones used in the comparison. Then use both builds to do both a large and a small rustc compile, under strace, and try to compare the output side-by-side to see if there's an interesting difference in syscalls or their parameters. If there is a difference, you could also look at the proportion of time those extra syscalls take, and compare it to the proportion of the regressions.

jkugelman · 2021-09-22T22:19:50Z

@joshtriplett I wrote a script to count number of read syscalls and number of bytes read. I ran rustc +$hash on the hello world program that cargo new spits out. The results aren't completely deterministic so I ran it 10 times with the old toolchain and 10 with the new. The new version is slightly better on both measures on every run.

old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4405, bytes_read=30448537 (0)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4395, bytes_read=30435031 (0)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (1)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4395, bytes_read=30435031 (1)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (2)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (2)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (3)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (3)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (4)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (4)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (5)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4395, bytes_read=30435031 (5)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (6)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (6)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (7)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (7)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4406, bytes_read=30448537 (8)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (8)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4405, bytes_read=30448537 (9)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (9)

TOTAL:
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=44044, bytes_read=304485370 (total)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=43943, bytes_read=304350310 (total)

Is this what you had in mind? If so I can go throw a big hunk of Rust code at it and see what happens.

joshtriplett · 2021-09-22T23:20:28Z

I'm genuinely surprised that it's non-deterministic.

I'm primarily interested in seeing the actual differences in the straces. Can you upload an strace from each toolchain for comparison?

(Also, you may want to invoke the toolchain directly rather than via the rustup wrapper, if possible, to minimize noise.)

jkugelman · 2021-09-22T23:43:24Z

@joshtriplett

$ cat main.rs
fn main() {
    println!("Hello, world!");
}
$ strace -o old-ce45663e14dac3f0f58be698cc530bc2e6e21682.full.strace -f -T ~/.rustup/toolchains/ce45663e14dac3f0f58be698cc530bc2e6e21682/bin/rustc main.rs
$ strace -o new-4b920a40932f74a7159435b06d96cb50212514ff.full.strace -f -T ~/.rustup/toolchains/4b920a40932f74a7159435b06d96cb50212514ff/bin/rustc main.rs

old-ce45663e14dac3f0f58be698cc530bc2e6e21682.full.strace.gz
new-4b920a40932f74a7159435b06d96cb50212514ff.full.strace.gz

jkugelman · 2021-09-22T23:56:52Z

I found it really difficult to visually diff them. I spent about half an hour trying to clean them up and hide all the spurious differences so I could find the "real" diffs. The (apparent) non-determinism was a headache so I went with counting overall reads instead.

I don't know if you know an easy trick to compare the files. If not, I've got a good supply of elbow grease. I can go do more thorough analysis. I'm sure you don't want to spend hours on my PR. (And I really appreciate your help so far!)

Mark-Simulacrum · 2021-09-30T02:42:57Z

Hm, so locally I seem to get entirely reproducible results:

$ sudo perf stat -r 30 -e syscalls:sys_enter_read ~/.rustup/toolchains/ce45663e14dac3f0f58be698cc530bc2e6e21682/bin/rustc t.rs
 Performance counter stats for '/home/mark/.rustup/toolchains/ce45663e14dac3f0f58be698cc530bc2e6e21682/bin/rustc t.rs' (30 runs):
              4370      syscalls:sys_enter_read
          0.109497 +- 0.000566 seconds time elapsed  ( +-  0.52% )
$ sudo perf stat -r 30 -e syscalls:sys_enter_read ~/.rustup/toolchains/4b920a40932f74a7159435b06d96cb50212514ff/bin/rustc t.rs
 Performance counter stats for '/home/mark/.rustup/toolchains/4b920a40932f74a7159435b06d96cb50212514ff/bin/rustc t.rs' (30 runs):
              4360      syscalls:sys_enter_read
          0.108526 +- 0.000544 seconds time elapsed  ( +-  0.50% )

This seems like it's both deterministic (I ran the above command several times; same result) and gives us the (expected) slight decrease in syscalls.

A good number of the regressions seem to be in -doc builds, and in some of those it looks like query counts went down (presumably since this PR deletes a function from std?). That creates some amount of noise in e.g. cachegrind diffs, as we're just doing different/less work. My inclination is to not consider these as blockers for this PR. However, I would like to get some sense why you're seeing nondeterminism with the syscalls -- what system are you running on? Can you try with the above perf commands and maybe see if those are OK?

jkugelman · 2021-09-30T12:28:59Z

The perf runs on my system are consistent like yours and I get the exact same results: 4,370 vs. 4,360.

I fixed my strace script -- it was counting "exited with 0" and "SIGCHLD" lines. Now it's only counting reads and gives 100% consistent results, same as perf:

4,370 vs. 4,360 reads
30,434,827 vs. 30,421,321 bytes read

Script attached for transparency: count-reads.gz

Mark-Simulacrum · 2021-09-30T12:41:50Z

Ah, great, OK. Sounds like we're on the same page then.

I'm still not sure exactly where the regressions in the perf run are coming from. Cycle count regressions are limited to -doc benchmarks entirely, which is a little weird, but may point to the regressions actually being caused by the change to the standard library (e.g., driven by different hashmap layout or something).

I am inclined to move ahead -- @joshtriplett, do you have any remaining concerns with this PR? I think the regressions are fairly minimal and not readily diagnosable (cachegrind doesn't point to anything obviously responsible, for example).

joshtriplett · 2021-10-04T01:33:49Z

I reviewed both straces carefully, filtering out spurious differences, and I don't see any issues either. This seems good to go ahead with.

(I do think there are some interesting things to learn from the straces, for the purposes of future performance improvements. Thanks for collecting those!)

@bors r+

bors · 2021-10-04T01:33:50Z

📌 Commit 9b9c24e has been approved by joshtriplett

joshtriplett · 2021-10-04T01:48:48Z

It looks like almost all of the changes in read calls are in the linker process, and caused by changes in ELF binaries.

The only meaningful behavior change is this:

Old:

read(5, "fn main() {\n    println!(\"Hello,"..., 46) = 45
read(5, "", 1)                    = 0

New:

read(5, "fn main() {\n    println!(\"Hello,"..., 45) = 45
read(5, "", 32)                   = 0

So the first call no longer reads an extra byte, and then the second call tries to read 32 bytes rather than 1. This matches the description of the "probe buffer" approach.

I think there'd be some value in trying to eliminate even the "probe buffer" call, but in the meantime, this seems like an improvement.

bors · 2021-10-04T04:45:05Z

⌛ Testing commit 9b9c24e with merge d25de31...

bors · 2021-10-04T07:25:43Z

☀️ Test successful - checks-actions
Approved by: joshtriplett
Pushing d25de31 to master...

rust-timer · 2021-10-04T09:02:53Z

Finished benchmarking commit (d25de31): comparison url.

Summary: This change led to small relevant regressions 😿 in compiler performance.

Small regression in instruction counts (up to 1.4% on incr-patched: add static arr item builds of coercions)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

jkugelman · 2021-10-04T11:14:35Z

I think there'd be some value in trying to eliminate even the "probe buffer" call, but in the meantime, this seems like an improvement.

I created a follow-up issue: #89516.

rust-highfive assigned Mark-Simulacrum Sep 22, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 22, 2021

This comment has been minimized.

Sign in to view

jkugelman force-pushed the read-to-end-overallocation branch from 30cd7e6 to 6ff4c58 Compare September 22, 2021 04:19

jkugelman force-pushed the read-to-end-overallocation branch from 6ff4c58 to 9b9c24e Compare September 22, 2021 04:56

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Sep 22, 2021

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Sep 22, 2021

Mark-Simulacrum removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 1, 2021

Mark-Simulacrum assigned joshtriplett Oct 1, 2021

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 4, 2021

bors added the merged-by-bors This PR was explicitly merged by bors. label Oct 4, 2021

bors merged commit d25de31 into rust-lang:master Oct 4, 2021

rustbot added this to the 1.57.0 milestone Oct 4, 2021

bors mentioned this pull request Oct 4, 2021

Implement most of RFC 2930, providing the ReadBuf abstraction #81156

Merged

jkugelman deleted the read-to-end-overallocation branch October 4, 2021 10:39

jkugelman mentioned this pull request Oct 4, 2021

Improve performance of Read::read_to_end and Read::read_to_string #89516

Closed

v1gnesh mentioned this pull request Jul 21, 2023

Use size_hint when using read_to_end on something that impl's Read #113927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix read_to_end to not grow an exact size buffer #89165

Fix read_to_end to not grow an exact size buffer #89165

jkugelman commented Sep 22, 2021 •

edited

Loading

rust-highfive commented Sep 22, 2021

This comment has been minimized.

joshtriplett commented Sep 22, 2021

rust-timer commented Sep 22, 2021

bors commented Sep 22, 2021

the8472 commented Sep 22, 2021

bors commented Sep 22, 2021

rust-timer commented Sep 22, 2021

jkugelman commented Sep 22, 2021 •

edited

Loading

rust-timer commented Sep 22, 2021

joshtriplett commented Sep 22, 2021

jkugelman commented Sep 22, 2021 •

edited

Loading

joshtriplett commented Sep 22, 2021 •

edited

Loading

jkugelman commented Sep 22, 2021 •

edited

Loading

joshtriplett commented Sep 22, 2021

jkugelman commented Sep 22, 2021 •

edited

Loading

jkugelman commented Sep 22, 2021 •

edited

Loading

Mark-Simulacrum commented Sep 30, 2021

jkugelman commented Sep 30, 2021

Mark-Simulacrum commented Sep 30, 2021

joshtriplett commented Oct 4, 2021

bors commented Oct 4, 2021

joshtriplett commented Oct 4, 2021

bors commented Oct 4, 2021

bors commented Oct 4, 2021

rust-timer commented Oct 4, 2021

jkugelman commented Oct 4, 2021

Fix read_to_end to not grow an exact size buffer #89165

Fix read_to_end to not grow an exact size buffer #89165

Conversation

jkugelman commented Sep 22, 2021 • edited Loading

rust-highfive commented Sep 22, 2021

This comment has been minimized.

joshtriplett commented Sep 22, 2021

rust-timer commented Sep 22, 2021

bors commented Sep 22, 2021

the8472 commented Sep 22, 2021

bors commented Sep 22, 2021

rust-timer commented Sep 22, 2021

jkugelman commented Sep 22, 2021 • edited Loading

rust-timer commented Sep 22, 2021

joshtriplett commented Sep 22, 2021

jkugelman commented Sep 22, 2021 • edited Loading

joshtriplett commented Sep 22, 2021 • edited Loading

jkugelman commented Sep 22, 2021 • edited Loading

joshtriplett commented Sep 22, 2021

jkugelman commented Sep 22, 2021 • edited Loading

jkugelman commented Sep 22, 2021 • edited Loading

Mark-Simulacrum commented Sep 30, 2021

jkugelman commented Sep 30, 2021

Mark-Simulacrum commented Sep 30, 2021

joshtriplett commented Oct 4, 2021

bors commented Oct 4, 2021

joshtriplett commented Oct 4, 2021

bors commented Oct 4, 2021

bors commented Oct 4, 2021

rust-timer commented Oct 4, 2021

jkugelman commented Oct 4, 2021

jkugelman commented Sep 22, 2021 •

edited

Loading

jkugelman commented Sep 22, 2021 •

edited

Loading

jkugelman commented Sep 22, 2021 •

edited

Loading

joshtriplett commented Sep 22, 2021 •

edited

Loading

jkugelman commented Sep 22, 2021 •

edited

Loading

jkugelman commented Sep 22, 2021 •

edited

Loading

jkugelman commented Sep 22, 2021 •

edited

Loading