Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read_to_end to not grow an exact size buffer #89165

Merged
merged 1 commit into from
Oct 4, 2021

Conversation

jkugelman
Copy link
Contributor

@jkugelman jkugelman commented Sep 22, 2021

If you know how much data to expect and use Vec::with_capacity to pre-allocate a buffer of that capacity, Read::read_to_end will still double its capacity. It needs some space to perform a read, even though that read ends up returning 0.

It's a bummer to carefully pre-allocate 1GB to read a 1GB file into memory and end up using 2GB.

This fixes that behavior by special casing a full buffer and reading into a small "probe" buffer instead. If that read returns 0 then it's confirmed that the buffer was the perfect size. If it doesn't, the probe buffer is appended to the normal buffer and the read loop continues.

Fixing this allows several workarounds in the standard library to be removed:

  • Take no longer needs to override Read::read_to_end.
  • The reservation_size callback that allowed Take to inhibit the previous over-allocation behavior isn't needed.
  • fs::read doesn't need to reserve an extra byte in initial_buffer_size.

Curiously, there was a unit test that specifically checked that Read::read_to_end does over-allocate. I removed that test, too.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Mark-Simulacrum (or someone else) soon.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 22, 2021
@rust-log-analyzer

This comment has been minimized.

If you know how much data to expect and use `Vec::with_capacity` to
pre-allocate a buffer of that capacity, `Read::read_to_end` will still
double its capacity. It needs some space to perform a read, even though
that read ends up returning `0`.

It's a bummer to carefully pre-allocate 1GB to read a 1GB file into
memory and end up using 2GB.

This fixes that behavior by special casing a full buffer and reading
into a small "probe" buffer instead. If that read returns `0` then it's
confirmed that the buffer was the perfect size. If it doesn't, the probe
buffer is appended to the normal buffer and the read loop continues.

Fixing this allows several workarounds in the standard library to be
removed:

- `Take` no longer needs to override `Read::read_to_end`.
- The `reservation_size` callback that allowed `Take` to inhibit the
  previous over-allocation behavior isn't needed.
- `fs::read` doesn't need to reserve an extra byte in
  `initial_buffer_size`.

Curiously, there was a unit test that specifically checked that
`Read::read_to_end` *does* over-allocate. I removed that test, too.
@joshtriplett
Copy link
Member

@bors try @rust-timer queue

I don't expect this to turn up any surprises, but that's why they're called "surprises".

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Sep 22, 2021
@bors
Copy link
Contributor

bors commented Sep 22, 2021

⌛ Trying commit 9b9c24e with merge 4b920a40932f74a7159435b06d96cb50212514ff...

@the8472
Copy link
Member

the8472 commented Sep 22, 2021

This will cause the free functions fs::read and fs::read_to_string to do more syscalls since they now have to probe for EOF even though they already know the file size.

An alternative would be to update the documentation to recommend passing some extra capacity.

@bors
Copy link
Contributor

bors commented Sep 22, 2021

☀️ Try build successful - checks-actions
Build commit: 4b920a40932f74a7159435b06d96cb50212514ff (4b920a40932f74a7159435b06d96cb50212514ff)

@rust-timer
Copy link
Collaborator

Queued 4b920a40932f74a7159435b06d96cb50212514ff with parent ce45663, future comparison URL.

@jkugelman
Copy link
Contributor Author

jkugelman commented Sep 22, 2021

This will cause the free functions fs::read and fs::read_to_string to do more syscalls since they now have to probe for EOF even though they already know the file size.

It shouldn't cause them an extra syscall for them since they pass an exact fit buffer. With or without this change, there will be n syscalls that return data plus a final one that returns Ok(0).

When the buffer is not an exact fit then there might be one extra read. I made sure to only do the probe read a single time when the buffer is still at its starting capacity. And it only happens if the buffer started with extra capacity -- an empty Vec::new() with 0 capacity won't require an extra read.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (4b920a40932f74a7159435b06d96cb50212514ff): comparison url.

Summary: This change led to moderate relevant mixed results 🤷 in compiler performance.

  • Moderate improvement in instruction counts (up to -0.8% on full builds of webrender)
  • Small regression in instruction counts (up to 1.2% on incr-patched: println builds of coercions)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Sep 22, 2021
@joshtriplett
Copy link
Member

I would normally think this was noise, but it looks like a lot of small regressions about the same size. Is that a coincidence, or is this causing an issue?

Might be worth running the two toolchains with strace and seeing if the syscalls look noticeably different.

@jkugelman
Copy link
Contributor Author

jkugelman commented Sep 22, 2021

Might be worth running the two toolchains with strace and seeing if the syscalls look noticeably different.

Will do!

What's a good way to do that? I know strace well but this is my first contribution to this project and I'm not very familiar with testing it. Are you suggesting stracing on my computer or is there a CI tool I should be using? To test this PR I've been running ./x.py test library/std. I could strace that command -- or is there a smarter way to do it?

Also, is there some way I can re-test the regressions in the report? I don't really understand what the benchmark did or how to read the results.

@joshtriplett
Copy link
Member

joshtriplett commented Sep 22, 2021

@jkugelman I'd suggest using https://rustc-dev-guide.rust-lang.org/compiler-debugging.html#downloading-artifacts-from-rusts-ci to install the exact before-and-after builds from the try above, the same ones used in the comparison. Then use both builds to do both a large and a small rustc compile, under strace, and try to compare the output side-by-side to see if there's an interesting difference in syscalls or their parameters. If there is a difference, you could also look at the proportion of time those extra syscalls take, and compare it to the proportion of the regressions.

@jkugelman
Copy link
Contributor Author

jkugelman commented Sep 22, 2021

@joshtriplett I wrote a script to count number of read syscalls and number of bytes read. I ran rustc +$hash on the hello world program that cargo new spits out. The results aren't completely deterministic so I ran it 10 times with the old toolchain and 10 with the new. The new version is slightly better on both measures on every run.

old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4405, bytes_read=30448537 (0)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4395, bytes_read=30435031 (0)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (1)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4395, bytes_read=30435031 (1)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (2)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (2)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (3)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (3)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (4)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (4)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (5)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4395, bytes_read=30435031 (5)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (6)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (6)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (7)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (7)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4406, bytes_read=30448537 (8)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (8)
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4405, bytes_read=30448537 (9)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (9)
TOTAL:
old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=44044, bytes_read=304485370 (total)
new (4b920a40932f74a7159435b06d96cb50212514ff): reads=43943, bytes_read=304350310 (total)

Is this what you had in mind? If so I can go throw a big hunk of Rust code at it and see what happens.

@joshtriplett
Copy link
Member

I'm genuinely surprised that it's non-deterministic.

I'm primarily interested in seeing the actual differences in the straces. Can you upload an strace from each toolchain for comparison?

(Also, you may want to invoke the toolchain directly rather than via the rustup wrapper, if possible, to minimize noise.)

@jkugelman
Copy link
Contributor Author

jkugelman commented Sep 22, 2021

@joshtriplett

$ cat main.rs
fn main() {
    println!("Hello, world!");
}
$ strace -o old-ce45663e14dac3f0f58be698cc530bc2e6e21682.full.strace -f -T ~/.rustup/toolchains/ce45663e14dac3f0f58be698cc530bc2e6e21682/bin/rustc main.rs
$ strace -o new-4b920a40932f74a7159435b06d96cb50212514ff.full.strace -f -T ~/.rustup/toolchains/4b920a40932f74a7159435b06d96cb50212514ff/bin/rustc main.rs

old-ce45663e14dac3f0f58be698cc530bc2e6e21682.full.strace.gz
new-4b920a40932f74a7159435b06d96cb50212514ff.full.strace.gz

@jkugelman
Copy link
Contributor Author

jkugelman commented Sep 22, 2021

I found it really difficult to visually diff them. I spent about half an hour trying to clean them up and hide all the spurious differences so I could find the "real" diffs. The (apparent) non-determinism was a headache so I went with counting overall reads instead.

I don't know if you know an easy trick to compare the files. If not, I've got a good supply of elbow grease. I can go do more thorough analysis. I'm sure you don't want to spend hours on my PR. (And I really appreciate your help so far!)

@Mark-Simulacrum
Copy link
Member

Hm, so locally I seem to get entirely reproducible results:

$ sudo perf stat -r 30 -e syscalls:sys_enter_read ~/.rustup/toolchains/ce45663e14dac3f0f58be698cc530bc2e6e21682/bin/rustc t.rs
 Performance counter stats for '/home/mark/.rustup/toolchains/ce45663e14dac3f0f58be698cc530bc2e6e21682/bin/rustc t.rs' (30 runs):
              4370      syscalls:sys_enter_read
          0.109497 +- 0.000566 seconds time elapsed  ( +-  0.52% )
$ sudo perf stat -r 30 -e syscalls:sys_enter_read ~/.rustup/toolchains/4b920a40932f74a7159435b06d96cb50212514ff/bin/rustc t.rs
 Performance counter stats for '/home/mark/.rustup/toolchains/4b920a40932f74a7159435b06d96cb50212514ff/bin/rustc t.rs' (30 runs):
              4360      syscalls:sys_enter_read
          0.108526 +- 0.000544 seconds time elapsed  ( +-  0.50% )

This seems like it's both deterministic (I ran the above command several times; same result) and gives us the (expected) slight decrease in syscalls.

A good number of the regressions seem to be in -doc builds, and in some of those it looks like query counts went down (presumably since this PR deletes a function from std?). That creates some amount of noise in e.g. cachegrind diffs, as we're just doing different/less work. My inclination is to not consider these as blockers for this PR. However, I would like to get some sense why you're seeing nondeterminism with the syscalls -- what system are you running on? Can you try with the above perf commands and maybe see if those are OK?

@jkugelman
Copy link
Contributor Author

The perf runs on my system are consistent like yours and I get the exact same results: 4,370 vs. 4,360.

I fixed my strace script -- it was counting "exited with 0" and "SIGCHLD" lines. Now it's only counting reads and gives 100% consistent results, same as perf:

  • 4,370 vs. 4,360 reads
  • 30,434,827 vs. 30,421,321 bytes read

Script attached for transparency: count-reads.gz

@Mark-Simulacrum
Copy link
Member

Ah, great, OK. Sounds like we're on the same page then.

I'm still not sure exactly where the regressions in the perf run are coming from. Cycle count regressions are limited to -doc benchmarks entirely, which is a little weird, but may point to the regressions actually being caused by the change to the standard library (e.g., driven by different hashmap layout or something).

I am inclined to move ahead -- @joshtriplett, do you have any remaining concerns with this PR? I think the regressions are fairly minimal and not readily diagnosable (cachegrind doesn't point to anything obviously responsible, for example).

@Mark-Simulacrum Mark-Simulacrum removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 1, 2021
@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Oct 1, 2021
@joshtriplett
Copy link
Member

I reviewed both straces carefully, filtering out spurious differences, and I don't see any issues either. This seems good to go ahead with.

(I do think there are some interesting things to learn from the straces, for the purposes of future performance improvements. Thanks for collecting those!)

@bors r+

@bors
Copy link
Contributor

bors commented Oct 4, 2021

📌 Commit 9b9c24e has been approved by joshtriplett

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 4, 2021
@joshtriplett
Copy link
Member

It looks like almost all of the changes in read calls are in the linker process, and caused by changes in ELF binaries.

The only meaningful behavior change is this:

Old:

read(5, "fn main() {\n    println!(\"Hello,"..., 46) = 45
read(5, "", 1)                    = 0

New:

read(5, "fn main() {\n    println!(\"Hello,"..., 45) = 45
read(5, "", 32)                   = 0

So the first call no longer reads an extra byte, and then the second call tries to read 32 bytes rather than 1. This matches the description of the "probe buffer" approach.

I think there'd be some value in trying to eliminate even the "probe buffer" call, but in the meantime, this seems like an improvement.

@bors
Copy link
Contributor

bors commented Oct 4, 2021

⌛ Testing commit 9b9c24e with merge d25de31...

@bors
Copy link
Contributor

bors commented Oct 4, 2021

☀️ Test successful - checks-actions
Approved by: joshtriplett
Pushing d25de31 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Oct 4, 2021
@bors bors merged commit d25de31 into rust-lang:master Oct 4, 2021
@rustbot rustbot added this to the 1.57.0 milestone Oct 4, 2021
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (d25de31): comparison url.

Summary: This change led to small relevant regressions 😿 in compiler performance.

  • Small regression in instruction counts (up to 1.4% on incr-patched: add static arr item builds of coercions)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

@jkugelman
Copy link
Contributor Author

I think there'd be some value in trying to eliminate even the "probe buffer" call, but in the meantime, this seems like an improvement.

I created a follow-up issue: #89516.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants