slice::sort_by and slice::unstable_sort_by have bad performance when targeting wasm #27

fitzgen · 2018-01-17T21:01:25Z

...but not when targeting native code.

I am very confused.

My naive Quicksort outperforms them both by a couple orders of magnitude(!!!!!) regardless which browser it is executed in (which tells me this is a Rust/LLVM issue, not wasm engine issue). This Quicksort is very naive: it is not pattern defeating, nor optimistically using insertion sort for small and mostly sorted ranges, or any of the other things that the built-in sort is doing.

https://github.com/fitzgen/source-map-mappings/blob/master/src/sort.rs

In fact, just replacing the custom swap routine with core's slice::swap causes a huge slow down. Again, only with wasm, not native code. That makes no sense to me.

Why is this particular performance gap occurring?
In the more general case, how do we debug and profile these cases where perf(wasm) != perf(native) and seemingly irrelevant changes have large effects on performance?

(I guess the first thing to do in this particular case is diff the disassembly and see how the emitted code differs? What other techniques do we have?)

The text was updated successfully, but these errors were encountered:

mgattozzi · 2018-01-17T21:48:19Z

Maybe @stjepang could shed some possible light as to why? He did implement a faster sorting algorithm in rust-lang/rust#38192.

Diggsey · 2018-01-17T23:12:33Z

If it's affecting swap too, then it sounds like a codegen issue - could it be failing to inline functions called with references? Alternatively, JIT compilation could be confusing the benchmarking you are doing.

fitzgen · 2018-01-17T23:27:59Z

I tried to avoid JIT issues by doing 5 warm up iterations, followed by 100 iterations that actually get collected. There is a yield to the event loop in between each iteration. Each iteration takes ~500ms, so it should be enough time and opportunity for the optimized wasm to get swapped in.

pepyakin · 2018-01-18T07:44:51Z

Difference in generated wast:

One notable difference: in slice::swap version there is a lot of memcpy/memmoves, in the same time in custom swap version they seem to be unrolled to loads and stores.

pepyakin · 2018-01-18T12:12:11Z

Second thing that I've noticed, in slice::swap there are more bounds checks than in custom swap.

Although, I think it is more probable that these memcpys are responsible for the slowdown.

ghost · 2018-01-18T20:41:32Z

Ok, I'm still very new to wasm (must admit - running Rust in the browser is fun! :)). In fact, this is the first time ever I've compiled something to wasm. Here's my setup.

hello.rs

use std::mem;
use std::cmp::{self, Ordering};

#[inline(always)]
fn swap<T>(slice: &mut [T], x: usize, y: usize) {
    debug_assert!(x < slice.len(), "(x = {}) < (slice.len() = {})", x, slice.len());
    debug_assert!(y < slice.len(), "(y = {}) < (slice.len() = {})", y, slice.len());

    if x == y {
        return;
    }

    let (x, y) = (cmp::min(x, y), cmp::max(x, y));
    let (low, high) = slice.split_at_mut(y);

    debug_assert!(x < low.len());
    debug_assert!(0 < high.len());

    unsafe {
        mem::swap(low.get_unchecked_mut(x), high.get_unchecked_mut(0));
    }
}

/// Partition the `slice[p..r]` about some pivot element in that range, and
/// return the index of the pivot.
#[inline(always)]
fn partition<T: Ord>(slice: &mut [T], p: usize, r: usize) -> usize {
    let pivot = (p + r) / 2;
    swap(slice, pivot, r);

    let mut i = (p as isize) - 1;

    for j in p..r {
        if let Ordering::Greater = unsafe {
            debug_assert!(j < slice.len());
            debug_assert!(r < slice.len());
            Ord::cmp(slice.get_unchecked(j), slice.get_unchecked(r))
        } {
            continue;
        }

        i += 1;
        swap(slice, i as usize, j);
    }

    swap(slice, (i + 1) as usize, r);
    return (i + 1) as usize;
}

/// Recursive quick sort implementation with all the extra parameters that we
/// want to hide from callers to give them better ergonomics.
fn do_quick_sort<T: Ord>(slice: &mut [T], p: usize, r: usize) {
    if p < r {
        let q = partition(slice, p, r);
        do_quick_sort(slice, p, q.saturating_sub(1));
        do_quick_sort(slice, q + 1, r);
    }
}

/// Do a quick sort on the given slice.
pub fn quick_sort<T: Ord>(slice: &mut [T]) {
    if slice.is_empty() {
        return;
    }

    let len = slice.len();
    do_quick_sort(slice, 0, len - 1);
}

#[no_mangle]
pub extern fn hello() {
    let mut v = (0u64..1_000_000).map(|x| x * x * x * 18913515181).collect::<Vec<_>>();
    v.sort(); // 200 ms
    // v.sort_unstable(); // 70 ms
    // quick_sort(&mut v); // 150 ms
}

fn main() {}

hello.html

<html>
  <body>
    <script>
      fetch("hello.wasm")
        .then(response => response.arrayBuffer())
        .then(bytes => WebAssembly.instantiate(bytes, {}))
        .then(results => results.instance)
        .then(mod => {
          var start = new Date();
          mod.exports.hello();
          console.log(new Date() - start);
        });
    </script>
  </body>
</html>

Compile: rustc --target wasm32-unknown-unknown hello.rs -O
Run: firefox hello.html (and look at the console)

Results for sorting 1M random u64s:

sort: 200 ms
sort_unstable: 70 ms
quick_sort: 150 ms

And here's the same benchmark with compilation to native:

hello.rs

use std::mem;
use std::cmp::{self, Ordering};

#[inline(always)]
fn swap<T>(slice: &mut [T], x: usize, y: usize) {
    debug_assert!(x < slice.len(), "(x = {}) < (slice.len() = {})", x, slice.len());
    debug_assert!(y < slice.len(), "(y = {}) < (slice.len() = {})", y, slice.len());

    if x == y {
        return;
    }

    let (x, y) = (cmp::min(x, y), cmp::max(x, y));
    let (low, high) = slice.split_at_mut(y);

    debug_assert!(x < low.len());
    debug_assert!(0 < high.len());

    unsafe {
        mem::swap(low.get_unchecked_mut(x), high.get_unchecked_mut(0));
    }
}

/// Partition the `slice[p..r]` about some pivot element in that range, and
/// return the index of the pivot.
#[inline(always)]
fn partition<T: Ord>(slice: &mut [T], p: usize, r: usize) -> usize {
    let pivot = (p + r) / 2;
    swap(slice, pivot, r);

    let mut i = (p as isize) - 1;

    for j in p..r {
        if let Ordering::Greater = unsafe {
            debug_assert!(j < slice.len());
            debug_assert!(r < slice.len());
            Ord::cmp(slice.get_unchecked(j), slice.get_unchecked(r))
        } {
            continue;
        }

        i += 1;
        swap(slice, i as usize, j);
    }

    swap(slice, (i + 1) as usize, r);
    return (i + 1) as usize;
}

/// Recursive quick sort implementation with all the extra parameters that we
/// want to hide from callers to give them better ergonomics.
fn do_quick_sort<T: Ord>(slice: &mut [T], p: usize, r: usize) {
    if p < r {
        let q = partition(slice, p, r);
        do_quick_sort(slice, p, q.saturating_sub(1));
        do_quick_sort(slice, q + 1, r);
    }
}

/// Do a quick sort on the given slice.
pub fn quick_sort<T: Ord>(slice: &mut [T]) {
    if slice.is_empty() {
        return;
    }

    let len = slice.len();
    do_quick_sort(slice, 0, len - 1);
}

fn main() {
    let mut v = (0u64..1_000_000).map(|x| x * x * x * 18913515181).collect::<Vec<_>>();
    v.sort(); // 70 ms
    // v.sort_unstable(); // 38 ms
    // quick_sort(&mut v); // 100 ms
}

Compile: rustc hello.rs -O
Run: time ./hello

Results:

sort: 70 ms
sort_unstable: 38 ms
quick_sort: 100 ms

These numbers look completely reasonable. Am I doing something wrong? :)

fitzgen · 2018-01-19T00:22:57Z

Those numbers do look reasonable, however when I replace calls to my quick_sort with .sort_by or .sort_unstable_by I don't get reasonable numbers like that.

Perhaps it is related to the size of the data being sorted and the comparator functions that are sorting them? https://github.com/fitzgen/source-map-mappings/blob/master/src/comparators.rs#L49-L91

The differences in codegen that @pepyakin found are also interesting, and seem promising for understanding better what's going on here.

pepyakin · 2018-01-19T20:06:49Z

I had a brief look at the problem today again: I took hello.rs by @stjepang, converted it to no_std and there were no memcpys with slice::swap 👻

fitzgen · 2018-02-22T19:33:29Z

I just retested on the latest nightly, and it seems the problem has gone away. sort_unstable_by is now much faster than my naive quick sort 🎉

mgattozzi · 2018-02-22T20:40:43Z

Nice! I wouldn't be surprised if lld support helped it out.

pepyakin · 2018-02-22T20:42:46Z

Isn't LLD not landed yet?

mgattozzi · 2018-02-22T21:29:42Z

Woops thought it was for some reason.

fitzgen · 2018-02-22T21:31:53Z

LLVM6 was merged, and it might have to do with that.

/me shrugs

mgattozzi · 2018-02-22T21:55:47Z

The mysteries of compilers and how no one is truly sure why things get fixed, they just do.

fitzgen closed this as completed Feb 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slice::sort_by and slice::unstable_sort_by have bad performance when targeting wasm #27

slice::sort_by and slice::unstable_sort_by have bad performance when targeting wasm #27

fitzgen commented Jan 17, 2018

mgattozzi commented Jan 17, 2018

Diggsey commented Jan 17, 2018

fitzgen commented Jan 17, 2018

pepyakin commented Jan 18, 2018

pepyakin commented Jan 18, 2018 •

edited

Loading

ghost commented Jan 18, 2018 •

edited by ghost

Loading

fitzgen commented Jan 19, 2018

pepyakin commented Jan 19, 2018

fitzgen commented Feb 22, 2018

mgattozzi commented Feb 22, 2018

pepyakin commented Feb 22, 2018

mgattozzi commented Feb 22, 2018

fitzgen commented Feb 22, 2018

mgattozzi commented Feb 22, 2018

slice::sort_by and slice::unstable_sort_by have bad performance when targeting wasm #27

slice::sort_by and slice::unstable_sort_by have bad performance when targeting wasm #27

Comments

fitzgen commented Jan 17, 2018

mgattozzi commented Jan 17, 2018

Diggsey commented Jan 17, 2018

fitzgen commented Jan 17, 2018

pepyakin commented Jan 18, 2018

pepyakin commented Jan 18, 2018 • edited Loading

ghost commented Jan 18, 2018 • edited by ghost Loading

fitzgen commented Jan 19, 2018

pepyakin commented Jan 19, 2018

fitzgen commented Feb 22, 2018

mgattozzi commented Feb 22, 2018

pepyakin commented Feb 22, 2018

mgattozzi commented Feb 22, 2018

fitzgen commented Feb 22, 2018

mgattozzi commented Feb 22, 2018

pepyakin commented Jan 18, 2018 •

edited

Loading

ghost commented Jan 18, 2018 •

edited by ghost

Loading