Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust insufficiently optimizes loop { match { } } state machines #80630

Open
tuxmark5 opened this issue Jan 2, 2021 · 20 comments
Open

Rust insufficiently optimizes loop { match { } } state machines #80630

tuxmark5 opened this issue Jan 2, 2021 · 20 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@tuxmark5
Copy link

tuxmark5 commented Jan 2, 2021

This is how a typical state machine might look in Rust:

#[inline(never)]
fn print(s: &str) {
    println!("{}", s);
}

#[inline(never)]
fn exec(mut s: i32) {
    loop {
        match s {
            0 => { print("0"); s = 2; },
            1 => { print("1"); s = 3; },
            2 => { print("2"); s = 1; },
            _ => return
        }
    }
}

fn main() {
    exec(0);
}

One would expect that the first jump from function start into one of the states (0, 1, 2 or _) would be compiled as a jumptable or a series of cmp/conditional jump instructions and once the execution is in one of the states the remaining jumps would be deterministic (non-conditional) jumps.

And this is how equivalent code in gcc is optimized.

However Rust/LLVM at the end of each state performs needless comparisons to determine to which label to jump to next, even though the jump target can be computed statically. You can inspect the generated assembly in Rust Playground.

This is probably more of a LLVM issue rather than Rust directly, but it's important to get this fixed/optimized, because now it's impossible to efficiently write such state machines in Rust (because there is no goto)

Meta

Tested with all versions available in Rust playground:

  • 1.49.0
  • 1.50.0-beta.2 (2020-12-31 25b3db3)
  • 1.51.0-nightly (2020-12-30 e226704)
@tuxmark5 tuxmark5 added the C-bug Category: This is a bug. label Jan 2, 2021
@the8472
Copy link
Member

the8472 commented Jan 2, 2021

One would expect that the first jump from function start into one of the states (0, 1, 2 or _) would be compiled as a jumptable or a series of cmp/conditional jump instructions

Atleast that part happens once you add more states.

@nagisa nagisa added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Jan 2, 2021
@nagisa
Copy link
Member

nagisa commented Jan 2, 2021

Didn't look too deeply into it but the IR fed to LLVM around the loop-match construct is not that different between rustc and clang. However with clang it becomes a switchtable at llvm-ir level, and we rely on the codegen backend to do so with rust https://godbolt.org/z/9Erv94

FWIW from what I can tell the additional comparisons can be blamed on s = <new> state setting and the phi that results from it.

@jyn514 jyn514 added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jan 3, 2021
@oli-obk oli-obk added the A-mir-opt Area: MIR optimizations label Jan 4, 2021
@oli-obk
Copy link
Contributor

oli-obk commented Jan 4, 2021

I thought we already had an optimization for this

    bb1: {
        switchInt(_1) -> [0_i32: bb3, 1_i32: bb5, 2_i32: bb7, otherwise: bb2]; // scope 0 at src/lib.rs:10:13: 10:14
    }
    bb4: {
        StorageDead(_3);                 // scope 0 at src/lib.rs:10:29: 10:30
        StorageDead(_4);                 // scope 0 at src/lib.rs:10:30: 10:31
        StorageDead(_2);                 // scope 0 at src/lib.rs:10:30: 10:31
        _1 = const 2_i32;                // scope 0 at src/lib.rs:10:32: 10:37
        goto -> bb1;                     // scope 0 at src/lib.rs:9:9: 14:10
    }

just looking locally at this, the goto -> bb1 should become goto -> bb7. Basically if a block assigns a constant to a local and then jumps to a block which does not touch that local and ends up conditionally branching on the local, then we can copy all statements from the branching block to the current block and make it an unconditional jump to the right target.

@simonvandel
Copy link
Contributor

AFAICT this is optimized with #80475

@nikic
Copy link
Contributor

nikic commented Jan 4, 2021

LLVM can optimize this with the -C llvm-args=-jump-threading-across-loop-headers option. By default LLVM does not perform jump threading across loop headers, because it cannot reliably determine profitability for this case right now. Doing this naively results in natural loops becoming irreducible, which will pessimize all following loop optimizations.

@camelid
Copy link
Member

camelid commented Mar 20, 2021

AFAICT this is optimized with #80475

That PR is merged now. Did that fix this issue?

@tuxmark5
Copy link
Author

Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.

@oli-obk
Copy link
Contributor

oli-obk commented Mar 22, 2021

Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.

The playground does not actually run mir optimizations of level 3 or above, so you would need to run this locally with nightly and use -Zmir-opt-level=3

Doing this naively results in natural loops becoming irreducible, which will pessimize all following loop optimizations.

This is worrysome. Our optimization as written won't do anything fancy, so it will make loops more complex. The perf results on that optimization are mostly in the noise (ignoring the CTFE stress tests), though without a dedicated compile-time and runtime test, it's hard to tell. Considering that the compiler bootstrap did not regress, I don't think the opt messes with anything too badly.

@camelid
Copy link
Member

camelid commented Mar 23, 2021

Testing it with (2021-03-20 61edfd5) on the playground shows no improvement.

The playground does not actually run mir optimizations of level 3 or above, so you would need to run this locally with nightly and use -Zmir-opt-level=3

Or you can use Godbolt: https://rust.godbolt.org/z/doK6W9jPT

(I don't what the Assembly is supposed to look like so I can't tell if this is fixed or not.)

@tuxmark5
Copy link
Author

tuxmark5 commented Mar 23, 2021

Unfortunately it still doesn't seem fixed. Assembly from godbolt link:

example::exec:  << The start of exec() function in rust
        sub     rsp, 8
        mov     dword ptr [rsp + 4], edi
.LBB2_1: << The block that implements the 'switch' 
        mov     eax, dword ptr [rsp + 4]
        mov     dword ptr [rsp], eax
        test    eax, eax
        je      .LBB2_3
        jmp     .LBB2_9
.LBB2_9: 
        mov     eax, dword ptr [rsp]
        sub     eax, 1
        je      .LBB2_5
        jmp     .LBB2_10
.LBB2_10:
        mov     eax, dword ptr [rsp]
        sub     eax, 2
        je      .LBB2_7
        jmp     .LBB2_2
.LBB2_2:
        pop     rax
        ret
.LBB2_3: << case 0
        lea     rdi, [rip + .L__unnamed_2]
        mov     esi, 1
        call    qword ptr [rip + example::print@GOTPCREL]
        mov     dword ptr [rsp + 4], 2 << explicit store to stack: s = 2
        jmp     .LBB2_1 << BAD: jumps back to the 'switch', but s value is known
.LBB2_5: << case 1
        lea     rdi, [rip + .L__unnamed_3]
        mov     esi, 1
        call    qword ptr [rip + example::print@GOTPCREL]
        mov     dword ptr [rsp + 4], 3 << explicit store to stack: s = 3
        jmp     .LBB2_1 << BAD: jumps back to the 'switch' , but s value is known
.LBB2_7: << case 2
        lea     rdi, [rip + .L__unnamed_4]
        mov     esi, 1
        call    qword ptr [rip + example::print@GOTPCREL]
        mov     dword ptr [rsp + 4], 1 << explicit store to stack: s = 1
        jmp     .LBB2_1 << BAD: jumps back to the 'switch' , but s value is known

I've annotated it a bit with << to make it easier to understand. Basically all the match arms eventually jump back to the 'switch' block, where s is compared/tested.

@oli-obk
Copy link
Contributor

oli-obk commented Mar 25, 2021

Maybe LLVM "de-optimizes" our MIR again. Not sure, we'll have to look at the MIR before LLVM, at LLVM-IR without opts and at LLVM-IR with opts to be sure. I'm not sure how to coax godbolt into doing MIR dumps though.

@the8472
Copy link
Member

the8472 commented Mar 25, 2021

Add --emit=mir

@tuxmark5
Copy link
Author

The same can be seen in MIR too:

bb0: {
    switchInt(_1) -> [0_i32: bb2, 1_i32: bb4, 2_i32: bb6, otherwise: bb1]; // scope 0 at ./example.rs:10:13: 10:14
}

bb1: {
    return;                          // scope 0 at ./example.rs:16:2: 16:2
}

bb2: {
    _4 = const "0";                  // scope 0 at ./example.rs:10:26: 10:29
    _3 = _4;                         // scope 0 at ./example.rs:10:26: 10:29
    _2 = print(move _3) -> bb3;      // scope 0 at ./example.rs:10:20: 10:30
}

bb3: {
    _1 = const 2_i32;                // scope 0 at ./example.rs:10:32: 10:37
    goto -> bb0;                     // scope 0 at ./example.rs:9:9: 14:10
}

bb4: {
    _7 = const "1";                  // scope 0 at ./example.rs:11:26: 11:29
    _6 = _7;                         // scope 0 at ./example.rs:11:26: 11:29
    _5 = print(move _6) -> bb5;      // scope 0 at ./example.rs:11:20: 11:30
}

bb5: {
    _1 = const 3_i32;                // scope 0 at ./example.rs:11:32: 11:37
    goto -> bb0;                     // scope 0 at ./example.rs:9:9: 14:10
}

bb6: {
    _10 = const "2";                 // scope 0 at ./example.rs:12:26: 12:29
    _9 = _10;                        // scope 0 at ./example.rs:12:26: 12:29
    _8 = print(move _9) -> bb7;      // scope 0 at ./example.rs:12:20: 12:30
}

bb7: {
    _1 = const 1_i32;                // scope 0 at ./example.rs:12:32: 12:37
    goto -> bb0;                     // scope 0 at ./example.rs:9:9: 14:10
}

After seeing the MIR I thought maybe the issue was the extra "suffix" block after each print operation, so i switcher around assignment and print (instead of doing print("0"); s = 2; in match arms I tried s = 2; print("0");). This yielded simpler MIR, but still the same issue was present:

bb0: {
    switchInt(_1) -> [0_i32: bb2, 1_i32: bb3, 2_i32: bb4, otherwise: bb1]; // scope 0 at ./example.rs:10:13: 10:14
}

bb1: {
    return;                          // scope 0 at ./example.rs:16:2: 16:2
}

bb2: {
    _1 = const 2_i32;                // scope 0 at ./example.rs:10:20: 10:25
    _4 = const "0";                  // scope 0 at ./example.rs:10:33: 10:36
    _3 = _4;                         // scope 0 at ./example.rs:10:33: 10:36
    _2 = print(move _3) -> bb0;      // scope 0 at ./example.rs:10:27: 10:37
}

bb3: {
    _1 = const 3_i32;                // scope 0 at ./example.rs:11:20: 11:25
    _7 = const "1";                  // scope 0 at ./example.rs:11:33: 11:36
    _6 = _7;                         // scope 0 at ./example.rs:11:33: 11:36
    _5 = print(move _6) -> bb0;      // scope 0 at ./example.rs:11:27: 11:37
}

bb4: {
    _1 = const 1_i32;                // scope 0 at ./example.rs:12:20: 12:25
    _10 = const "2";                 // scope 0 at ./example.rs:12:33: 12:36
    _9 = _10;                        // scope 0 at ./example.rs:12:33: 12:36
    _8 = print(move _9) -> bb0;      // scope 0 at ./example.rs:12:27: 12:37
}

@oli-obk
Copy link
Contributor

oli-obk commented Mar 25, 2021

I just looked at the optimization itself. I do not think the issue is related to the print. In fact the order as written is the only one the optimization targets right now. So I'm wondering if the MIR that the optimization sees has more statements that get filtered out later. In order to debug this I believe we need some new mir-opt tests and potentially some tracing statements in the optimization

@cynecx
Copy link
Contributor

cynecx commented May 5, 2021

A new optimization pass is currently in review upstream: https://reviews.llvm.org/D99205

@nikic
Copy link
Contributor

nikic commented Aug 28, 2021

With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E

@tuxmark5
Copy link
Author

tuxmark5 commented Aug 28, 2021

With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E

Optimization breaks if you mark exec as pub.

EDIT: It only seems to work, when LLVM can infer the initial value/state of the state machine, which doesn't happen often in practice when dealing with state machines. Marking exec as pub means that it can be called from other compilation units, hence there is no way to know the initial value of the s.

@amehsan
Copy link

amehsan commented Jan 11, 2022

With new DFA jump threading pass: https://rust.godbolt.org/z/v1a4c689E

Optimization breaks if you mark exec as pub.

EDIT: It only seems to work, when LLVM can infer the initial value/state of the state machine, which doesn't happen often in practice when dealing with state machines. Marking exec as pub means that it can be called from other compilation units, hence there is no way to know the initial value of the s.

I have not looked into this test case (just saw this issue today), but the limitation that you have mentioned here is not hard to fix. We will look into it and get back to you

last5bits pushed a commit to llvm/llvm-project that referenced this issue May 26, 2022
Responding to a feature request from the Rust community:

rust-lang/rust#80630

    void foo(X) {
      for (...)
	switch (X)
	  case A
	    X = B
	  case B
	    X = C
    }

Even though the initial switch value is non-constant, the switch
statement can still be threaded: the initial value will hit the switch
statement but the rest of the state changes will proceed by jumping
unconditionally.

The early predictability check is relaxed to allow unpredictable values
anywhere, but later, after the paths through the switch statement have
been enumerated, no non-constant state values are allowed along the
paths. Any state value not along a path will be an initial switch value,
which can be safely ignored.

Differential Revision: https://reviews.llvm.org/D124394
mem-frob pushed a commit to draperlaboratory/hope-llvm-project that referenced this issue Oct 7, 2022
Responding to a feature request from the Rust community:

rust-lang/rust#80630

    void foo(X) {
      for (...)
	switch (X)
	  case A
	    X = B
	  case B
	    X = C
    }

Even though the initial switch value is non-constant, the switch
statement can still be threaded: the initial value will hit the switch
statement but the rest of the state changes will proceed by jumping
unconditionally.

The early predictability check is relaxed to allow unpredictable values
anywhere, but later, after the paths through the switch statement have
been enumerated, no non-constant state values are allowed along the
paths. Any state value not along a path will be an initial switch value,
which can be safely ignored.

Differential Revision: https://reviews.llvm.org/D124394
@Noratrieb Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023
@bryanpkc
Copy link

bryanpkc commented May 5, 2023

@tuxmark5 Does D124394 solve this issue for you?

@tuxmark5
Copy link
Author

tuxmark5 commented May 5, 2023

@bryanpkc This looks like exactly what I need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests