Suboptimal lowering of `vget_high_XXX` on AArch64 #58323

Maratyszcza · 2022-10-12T18:13:52Z

Clang/LLVM lowers a = vget_high_XXX(b) NEON intrinsics to EXT Va.16B, Vb.16B, Vb.16B, #8 instead of the DUP Va.1D, Vb.D[1] suggested by ARM NEON intrinsics guide. This lowering has two drawbacks:

On AArch64 cores which split NEON instructions into 64-bit microoperations, EXT Va.16B, Vb.16B, Vb.16B, #8 generates two 64-bit microoperations instead of one.
On all AArch64 cores EXT Va.16B, Vb.16B, Vb.16B, #8 leaves the upper part of the destination register initialized (but unused), thereby preventing its power-gating.

The text was updated successfully, but these errors were encountered:

llvmbot · 2022-10-12T18:33:09Z

@llvm/issue-subscribers-backend-aarch64

rubdos · 2024-08-22T14:59:43Z

I'd like to see whether I can implement this, but my experience with LLVM is zero (apart from compiling and some packaging).

Cc. @kbeyls, @davemgreen, @Tarinn

@kbeyls said on IRC that sometimes EXT might be faster than DUP, but didn't remember when or why.

I came up with this test case in vcombine.ll, which would cover our case in curve25519-dalek.

define <2 x i32> @vget_high32(ptr %A) nounwind {
; CHECK-LABEL: vget_high32
; CHECK-NOT: vst
; CHECK-NOT: vmov
; CHECK-LE-NOT: vld1.64 {d16, d17}, [r0]
; CHECK: vldr  d16, [r0, #8]
; CHECK: dup
	%tmp1 = load <4 x i32>, ptr %A
        %tmp2 = shufflevector <4 x i32> %tmp1, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
        ret <2 x i32> %tmp2
}

davemgreen · 2024-08-22T16:49:03Z

Do you mean the DUP which is a MOV, as in the GISel version of https://godbolt.org/z/jxEnWa1jz? If so I think that sounds good.

There are tablegen patterns for "extract_subvector", that probably want to use DUPi64 as opposed to the EXT instructions. It looks like they are defined in a couple of places at the moment. The ones here look incorrect and a probably unused/dead:

llvm-project/llvm/lib/Target/AArch64/AArch64InstrInfo.td

Line 9463 in a625435

def : Pat<(v8i8 (extract_subvector (v16i8 FPR128:$Rn), (i64 1))),

There are ones in the ExtPat<> multiclass too. Hopefully one of the two can be updated to use the new instruction, and the other removed?

rubdos · 2024-08-23T12:06:16Z

Do you mean the DUP which is a MOV, as in the GISel version of https://godbolt.org/z/jxEnWa1jz? If so I think that sounds good.

Yes, that's what's meant here, as far as I understand.

I added armv7 with neon as a separate tab too, but I don't observe any neon instructions inserted for that target. https://godbolt.org/z/qP6bs4sq6
I'm unfamiliar with llc, so I'm probably hitting the wrong buttons.

There are tablegen patterns for "extract_subvector", that probably want to use DUPi64 as opposed to the EXT instructions. It looks like they are defined in a couple of places at the moment. The ones here look incorrect and a probably unused/dead:

llvm-project/llvm/lib/Target/AArch64/AArch64InstrInfo.td

Line 9463 in a625435
def : Pat<(v8i8 (extract_subvector (v16i8 FPR128:$Rn), (i64 1))),

There are ones in the ExtPat<> multiclass too. Hopefully one of the two can be updated to use the new instruction, and the other removed?

I'll have to study that file a bit more to understand what exactly needs changing, will continue on Monday! Thanks for the pointers!

davemgreen · 2024-08-26T20:56:59Z

Yeah - Arm has a different register layout to AArch64, where the Q0 register is made up of D0 and D1 so the extract should be as cheap as just using the second register. (In that case it also looks like it is splitting the load into two register).

github-actions bot added the new issue label Oct 12, 2022

EugeneZelenko added backend:AArch64 and removed new issue labels Oct 12, 2022

rubdos mentioned this issue May 22, 2024

NEON backend for aarch64 dalek-cryptography/curve25519-dalek#457

Open

5 tasks

EugeneZelenko added the missed-optimization label Aug 22, 2024

davemgreen assigned rubdos Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal lowering of `vget_high_XXX` on AArch64 #58323

Suboptimal lowering of `vget_high_XXX` on AArch64 #58323

Maratyszcza commented Oct 12, 2022

llvmbot commented Oct 12, 2022

rubdos commented Aug 22, 2024

davemgreen commented Aug 22, 2024

rubdos commented Aug 23, 2024

davemgreen commented Aug 26, 2024

Suboptimal lowering of vget_high_XXX on AArch64 #58323

Suboptimal lowering of vget_high_XXX on AArch64 #58323

Comments

Maratyszcza commented Oct 12, 2022

llvmbot commented Oct 12, 2022

rubdos commented Aug 22, 2024

davemgreen commented Aug 22, 2024

rubdos commented Aug 23, 2024

davemgreen commented Aug 26, 2024

Suboptimal lowering of `vget_high_XXX` on AArch64 #58323

Suboptimal lowering of `vget_high_XXX` on AArch64 #58323