Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SliceExt::binary_search performance #45333

Merged
merged 1 commit into from
Nov 11, 2017
Merged

Improve SliceExt::binary_search performance #45333

merged 1 commit into from
Nov 11, 2017

Conversation

alkis
Copy link
Contributor

@alkis alkis commented Oct 16, 2017

Improve the performance of binary_search by reducing the number of unpredictable conditional branches in the loop. In addition improve the benchmarks to test performance in l1, l2 and l3 caches on sorted arrays with or without dups.

Before:

test slice::binary_search_l1                               ... bench:          48 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          63 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups                     ... bench:          36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         153 ns/iter (+/- 6)

After:

test slice::binary_search_l1                               ... bench:          15 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          23 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         100 ns/iter (+/- 17)
test slice::binary_search_l1_with_dups                     ... bench:          15 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          23 ns/iter (+/- 0)
test slice::binary_search_l3_with_dups                     ... bench:          98 ns/iter (+/- 14)

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@@ -20,6 +20,6 @@ extern crate test;
mod any;
mod hash;
mod iter;
mod mem;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcrichton How was this able to get past CI? Do we not run benches as part of libcore's tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we run benches in CI. The file was moved in #44943.


#[bench]
fn binary_search(b: &mut Bencher) {
let mut v = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(0..999).collect::<Vec<_>>() will be more efficient and somewhat easier to read (at least to me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
let mut i = 0;
b.iter(move || {
i += 1299827;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a comment here as to what this number is intended to mean?

Copy link
Contributor Author

@alkis alkis Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a large (compared to 999) prime to form a poor mans LCG. Maybe I should use librand instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me -- and better than random numbers when dealing with benchmarks -- but I would like a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it into a proper LCG and linked to where I got the constants.

@kennytm kennytm added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 17, 2017
@bluss
Copy link
Member

bluss commented Oct 17, 2017

I like it, trading an unpredictable Branch for a predictable one (bound check) is neat.

It doesn't exit when it finds the first equal element -- so does this PR change the result for some inputs?

@alkis
Copy link
Contributor Author

alkis commented Oct 17, 2017

The biggest benefit comes from trading all the unpredictable branches inside the loop with conditional moves. This is otherwise known as "branchless binary search" and has been shown to be faster than "traditional binary search" and as fast as linear search for small inputs. There is a recent paper that covers different layouts for comparison based searching: https://arxiv.org/abs/1509.05053. It covers a lot more than "branchless binary search", nevertheless it is a good read.

Answering your question: the PR should not change the results for any input. I added some extra test cases to increase confidence.

@arthurprs
Copy link
Contributor

arthurprs commented Oct 17, 2017

Cool! Although I think this needs to be tested further with other payloads, like (u64, u64), where the comparison isn't compiled into a single cmov*. My impression is that it's slower when that's not the case as it it's forced to go through all branches, then the conditional move.

@alkis
Copy link
Contributor Author

alkis commented Oct 17, 2017

@arthurprs I don't think anything will change if (u64, u64) is used (or even strings for that matter). The cmov is used to select the new base after the comparison is done and its result is placed in the flags register: it is either base or mid. If the comparison operator has branches of its own that are not predictable there is not much we can do about it at this level. It has to be done by code external to binary_search_by.

@arthurprs
Copy link
Contributor

Good point. I also ran a few benches just to confirm and it checks out.

@bluss
Copy link
Member

bluss commented Oct 17, 2017

@alkis there are no test cases that cover this — with duplicate elements. I've had time to check this at my computer now, and the new algorithm does produce different results for some inputs:

The example (playground link) is searching for 0 in a vector of zeros.

  1. It's important to walk into breaking changes, however small, with open eyes
  2. Now we know that it exists — can we find out if it matters? We've had bugs filed for breakage for less.

(Less important, but still interesting, is that the old algorithm will finish as quickly as possible with this input and the new algorithm it's a worst case.)

@alkis
Copy link
Contributor Author

alkis commented Oct 17, 2017

@bluss: thanks for testing this. I don't think these changes are material and very unlikely to result in breakages. Let me explain why:

  • old: returns the first match. The match is implementation defined. Also it is not a well defined element: it is neither the lower bound nor the one before the upper bound. It is hard to make a case for using such match with any reasonable expectations.
  • new: if a match exists it returns the lower bound. This is a stronger guarantee than the old implementation. This is an implementation detail but it has the risk of falling into Hyrum's Law in the future.

In addition the two implementations will have different behavior when the list is not sorted given the partial order defined by the predicate. I also think that this is not important.

For performance it is true that the new algorithm always does log2(size) steps but the old does at most log2(size) steps. This is both good and bad. Bad because it can be slower in contrived situations and good because it has predictable performance.

I can add a few more benchmarks if you think they add value:

  • string slices as keys
  • with dups
  • with different sizes: sizeof array fits in L1 cache, L2 cache and L3 cache

About possible breakages, if you have suggestions on how to investigate them before merging I would be glad to take a look.

@alkis
Copy link
Contributor Author

alkis commented Oct 17, 2017

I improved the benchmarks a bit. Please take another look.

@Mark-Simulacrum
Copy link
Member

I think we should definitely get a crater run on this to at least note any test failures that would be caused by landing this PR. Is it possible to keep the old behavior without removing the performance gains this PR makes? I'm somewhat hesitant to change the behavior of this, and I'm not sure I entirely agree with all of your points about this being unlikely to hurt anyone in practice.

I agree that the new behavior on equal data is perhaps more elegant, but the old behavior (as I understand) is stable, if defined only by the algorithm used. Since it's stable, changing it now would lead me to believe that someone could depend on it -- even if they shouldn't be -- and we should be very hesitant to change it. Perhaps a survey of Rust code on GH that uses binary search could help here -- how much, if any, of it will change behavior given stretches of equal data? If we determine that probably none, then I'd be more inclined to make this change. If at least a couple different crates, then I'm pretty much against this.

With regards to unsorted data, I don't think there's any problem in changing behavior there -- a binary search on unsorted data isn't going to be reliable, at all, and this is something we explicitly document.

So, to summarize: I think that we should be very hesitant to land this without more data that this doesn't break people in practice (crater run and GH survey). It's "silent" breakage in that code continues to compile but changes in behavior, which is nearly the worst kind of breakage we can introduce.

r? @BurntSushi

cc @rust-lang/libs -- how do we feel about the potential breakage here? Author's thoughts and the breakage are outlined in #45333 (comment), mine are above.

@sfackler
Copy link
Member

I'd be interested in seeing what a crater run turned up, but I'm not particularly worried about this change in behavior. We've never documented which of multiple equal values is arrived at, and it seems like you're in a bit of a weird place if you have that kind of data anyway.

@retep998
Copy link
Member

Honestly, if it was going to pick a value from among equals, then I'd expect it to pick the lower bound. Because the current implementation picks at effectively random (even if it is deterministic), there's absolutely no way I'd be able to rely on which value it picked. The lower bound is still deterministic, so it won't break any code that wants deterministic behavior, but I have a really hard time imagining what sort of code can manage to rely on which specific element the current algorithm picks at random. Of course we should still do a crater run to be sure, but if we can't find any legitimate cases, then we should absolutely do this change.

@arthurprs
Copy link
Contributor

arthurprs commented Oct 18, 2017

I found myself wanting the lower bound of the equal subsequence recently. On the other hand I'd argue against guaranteeing this sort of behavior.

Also, this discussion resembles the stable/unstable sort thing.

@kennytm
Copy link
Member

kennytm commented Oct 18, 2017

The documentation of binary_search never guaranteed which index will be returned when there's an equal range. The new behavior is compatible with the existing definition.

Looks up a series of four elements. The first is found, with a uniquely determined position; the second and third are not found; the fourth could match any position in [1, 4].

let s = [0, 1, 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55];

assert_eq!(s.binary_search(&13),  Ok(9));
assert_eq!(s.binary_search(&4),   Err(7));
assert_eq!(s.binary_search(&100), Err(13));
let r = s.binary_search(&1);
assert!(match r { Ok(1...4) => true, _ => false, });

@alkis
Copy link
Contributor Author

alkis commented Oct 18, 2017

@arthups I think not having lower_bound and upper_bound are holes in the std library. We can definitely add those. I can send a separate PR if there is consensus.

@bluss
Copy link
Member

bluss commented Oct 18, 2017

ping @frankmcsherry, you might be interested in this

@frankmcsherry
Copy link
Contributor

frankmcsherry commented Oct 18, 2017

This appeals to me greatly. I can't recall if I complained to @bluss and that is why I got the ping, but (i) better performance is great, and (ii) being able to get to the lower bound is important for lots of applications. While this PR doesn't commit to that, it does mean that I could in principle start using binary_search which I couldn't do before (I mean, I could, but then I'd have to do geometric search backwards, and .. =/).

My understanding is that this could be slower if the comparison is very expensive, much like quicksort can be slower than mergesort if the comparison is expensive. Benchmarking on large randomly permuted String slices could tease that out (if each comparison is now an access to a random location in GBs of data). This would be more visible if there would be an early match (e.g. "find a string starting with 'q'").

Also, if I understand the linked article, there is the potential downside that most architectures will not prefetch through a computed address, as produced by cmov. In the article they recover the performance with explicit prefetching (see Figure 8), but it seems plausible that there could be more memory stalls with the conditional move approach than with a branching approach. No opinion on which is better / worse, though.

Edit: Also, my understanding is that a lot of the "branchless" benefits go away if your comparison function has a branch in it, e.g. for String.

@Mark-Simulacrum Rust has changed behavior a few times (for me wrt sort) and as the docs don't commit to any specific semantics other than "a" match, I'd be optimistic that this could land. I'm pro "check and see what breaks and try hard to avoid any", having just complained about that to Niko, but I'm also pro "actually commit to some semantics" and the first element is what I typically find needed.

@kennytm The counter-point that I made recently (even though I'd love to have this land asap), is that no matter what the docs say if the change breaks code it breaks code. Not knowing anything about this stuff, a crater run seems like a great thing to do to grok whether people unknowingly over-bound to semantics that weren't documented. If they did, breaking their stuff and saying "sorry, but" still hurts (them, and the perception of stability).

@frankmcsherry
Copy link
Contributor

if cmp == Equal { Ok(base) } else { Err(base + (cmp == Less) as usize) }

I don't know either way, but is (cmp == Less) as usize idiomatic? Is the compiler unable to turn a match statement into the right assembly? I didn't even realize that Rust commits to this being 0 or 1.

@frankmcsherry
Copy link
Contributor

In talking out the issue of intended semantics, what are the use cases where a collection may have multiple matching keys and returning "an arbitrary element" is acceptable? I'm familiar with "collection has distinct keys" in which things are unambiguous, and "collection has multiple keys, return the range please". I admit to never having seen an application that calls for what Rust currently offers; if someone could clue me in I would appreciate it!

@bluss
Copy link
Member

bluss commented Oct 18, 2017

I think this implementation is really neat but just using it in a new set of methods (for lower and upper bounds), leaving binary search unchanged sounds like the best solution. It makes the faster implementation available, it makes the useful lower bound algorithm available, and it avoids doing more steps than needed in binary_search. (Even if it looks like this extra work often has “negative cost”, we know there are cases where it has a cost.)

@alkis
Copy link
Contributor Author

alkis commented Oct 18, 2017

bluss@ it would extremely interesting to see cases that consistently get a regression out of this only because the old implementation bails out early. Also I don't think having a slow binary_search with a note saying "if you want performance use lower_bound" is a good choice for the standard library.

@alkis
Copy link
Contributor Author

alkis commented Oct 18, 2017

@frankmcsherry

  1. I don't see how this can be slower if the comparison is very expensive unless you are talking about getting lucky and landing on the correct element before you do N comparisons. @arthurprs benchmarked a tuple (which presumably is both more expensive and has an extra branch in the comparison) and found it having the same results as this implementation.
  2. "branchless" benefits might go away if the comparison has branches in it. Maybe the CPU can detect dependent branches in this case and elide the cost of the second one. We can only know if we have a benchmark.
  3. (cmp == Less) as usize not sure if this is idiomatic or not. I would be happy to switch to the idiomatic version if it has the same performance and is as succinct.
  4. For sets with multiple matches the usual operation you want on it is equal_range. Perhaps we need to add that as well on top of lower_bound and upper_bound.

@arthurprs
Copy link
Contributor

arthurprs commented Oct 18, 2017

It's clear to me that lower_bound and upper_bound are desirable, deprecating (or not) binary_search. I fell that the name (self documenting/discoverable) is reason alone not to deprecate it though.

@alkis If you have "duplicates" and/or an expensive cmp function the new code might be slower.

@frankmcsherry
Copy link
Contributor

frankmcsherry commented Oct 18, 2017

@alkis

  1. It's not necessarily about getting very lucky. If you are doing a search by a key, rather than by element, you can find a key very early in the search and then spend the rest of your time finding the first element. The example from above was sorted strings and searching using just the first character to find a string starting with 'q'.

Edit: btw, I much prefer your code to what exists at the moment, which I don't use because it doesn't do what I need. I'm just trying to call out the things folks should worry about and be sure they are ok with.

@bluss
Copy link
Member

bluss commented Oct 18, 2017

@alkis To be brief, you can (at least I could) reproduce the slower case using let v = (0..size).map(|i| i / 32).collect::<Vec<_>>(); for the sorted sequence in the benchmark in the PR.

@alkis
Copy link
Contributor Author

alkis commented Oct 18, 2017

@frankmcsherry and @bluss see updated benchmarks:

Before:

test slice::binary_search_l1                               ... bench:          48 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          63 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups                     ... bench:          36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         153 ns/iter (+/- 6)

After:

test slice::binary_search_l1                               ... bench:          17 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          24 ns/iter (+/- 2)
test slice::binary_search_l3                               ... bench:         139 ns/iter (+/- 27)
test slice::binary_search_l1_with_dups                     ... bench:          17 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups                     ... bench:          25 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         137 ns/iter (+/- 21)

I don't see the regression.

Furthermore lets step back a bit. Do we expect no regressions? I do not think this is realistic. If we accept the fact there are going to be regressions we have to use Amdhal's Law to assess the tradeoff. The performance increase on arrays that fit in L1 or L2 cache is about 2x. This is not trivial. So unless you think the regressions on the cases you think it regresses represent the majority of cases, I find it unwise to block this PR.

Think of it in reverse. If the current code was the code in this PR, would be approve a change that changes it to the current code because it is faster on the contrived cases you mention?

@alkis
Copy link
Contributor Author

alkis commented Oct 18, 2017

FWIW: I think the biggest risk of this change is that unit tests will break if they depend on the element/position returned by the current implementation of binary_search.

@arthurprs
Copy link
Contributor

arthurprs commented Nov 11, 2017

@alkis I'd like to point out that with this change using a binary search in the BTreeMap nodes could be a win. Maybe you want to take a stab at that.

@ishitatsuyuki
Copy link
Contributor

ishitatsuyuki commented Nov 11, 2017

How? Binary trees are not contagious arrays, and the optimization techniques should be different.

EDIT: I see your point. BTree uses partially contagious arrays.

@arthurprs
Copy link
Contributor

They're BTree's not BinaryTree's.

@alkis
Copy link
Contributor Author

alkis commented Nov 11, 2017

@ishitatsuyuki perhaps @arthurprs is talking about the array in each BTree node itself. Not sure what is happening there but it might be linear search instead of binary search.

}

#[test]
// When this test changes a crater run is highly advisable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is named pretty obviously, but please leave a comment stating that this is testing implementation-specific behavior of what to do in the case of equivalent elements; and that it is OK to break this but (as you've already mentioned) this should be accompanied with a crater run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

unpredictable conditional branches in the loop. In addition improve the
benchmarks to test performance in l1, l2 and l3 caches on sorted arrays
with or without dups.

Before:

```
test slice::binary_search_l1                               ... bench:  48 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:  63 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench: 152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups                     ... bench:  36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:  64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench: 153 ns/iter (+/- 6)
```

After:

```
test slice::binary_search_l1                               ... bench:  15 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:  23 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench: 100 ns/iter (+/- 17)
test slice::binary_search_l1_with_dups                     ... bench:  15 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:  23 ns/iter (+/- 0)
test slice::binary_search_l3_with_dups                     ... bench:  98 ns/iter (+/- 14)
```
@bluss
Copy link
Member

bluss commented Nov 11, 2017

@bors r+

Thanks a lot @alkis for this improvement!

@bors
Copy link
Contributor

bors commented Nov 11, 2017

📌 Commit 2ca111b has been approved by bluss

@bors
Copy link
Contributor

bors commented Nov 11, 2017

⌛ Testing commit 2ca111b with merge 24bb4d1...

bors added a commit that referenced this pull request Nov 11, 2017
Improve SliceExt::binary_search performance

Improve the performance of binary_search by reducing the number of unpredictable conditional branches in the loop. In addition improve the benchmarks to test performance in l1, l2 and l3 caches on sorted arrays with or without dups.

Before:

```
test slice::binary_search_l1                               ... bench:          48 ns/iter (+/- 1)
test slice::binary_search_l2                               ... bench:          63 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         152 ns/iter (+/- 12)
test slice::binary_search_l1_with_dups                     ... bench:          36 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          64 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         153 ns/iter (+/- 6)
```

After:

```
test slice::binary_search_l1                               ... bench:          15 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          23 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         100 ns/iter (+/- 17)
test slice::binary_search_l1_with_dups                     ... bench:          15 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          23 ns/iter (+/- 0)
test slice::binary_search_l3_with_dups                     ... bench:          98 ns/iter (+/- 14)
```
@alkis
Copy link
Contributor Author

alkis commented Nov 11, 2017

Thanks for the thorough reviews!

@bors
Copy link
Contributor

bors commented Nov 11, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: bluss
Pushing 24bb4d1 to master...

@bors bors merged commit 2ca111b into rust-lang:master Nov 11, 2017
alkis added a commit to alkis/superslice-rs that referenced this pull request Nov 30, 2017
@EFanZh EFanZh mentioned this pull request Jun 27, 2020
Amanieu added a commit to Amanieu/rust that referenced this pull request Jul 26, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Amanieu added a commit to Amanieu/rust that referenced this pull request Jul 26, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Amanieu added a commit to Amanieu/rust that referenced this pull request Jul 27, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Amanieu added a commit to Amanieu/rust that referenced this pull request Jul 27, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Amanieu added a commit to Amanieu/rust that referenced this pull request Jul 27, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Amanieu added a commit to Amanieu/rust that referenced this pull request Jul 30, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
bors added a commit to rust-lang-ci/rust that referenced this pull request Jul 30, 2024
Rewrite binary search implementation

This PR builds on top of rust-lang#128250, which should be merged first.

This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Fixes rust-lang#115271
bors added a commit to rust-lang-ci/rust that referenced this pull request Aug 2, 2024
Rewrite binary search implementation

This PR builds on top of rust-lang#128250, which should be merged first.

This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Fixes rust-lang#115271
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.