Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal lowering of vget_high_XXX on AArch64 #58323

Open
Maratyszcza opened this issue Oct 12, 2022 · 5 comments
Open

Suboptimal lowering of vget_high_XXX on AArch64 #58323

Maratyszcza opened this issue Oct 12, 2022 · 5 comments

Comments

@Maratyszcza
Copy link

Clang/LLVM lowers a = vget_high_XXX(b) NEON intrinsics to EXT Va.16B, Vb.16B, Vb.16B, #8 instead of the DUP Va.1D, Vb.D[1] suggested by ARM NEON intrinsics guide. This lowering has two drawbacks:

  • On AArch64 cores which split NEON instructions into 64-bit microoperations, EXT Va.16B, Vb.16B, Vb.16B, #8 generates two 64-bit microoperations instead of one.
  • On all AArch64 cores EXT Va.16B, Vb.16B, Vb.16B, #8 leaves the upper part of the destination register initialized (but unused), thereby preventing its power-gating.
@llvmbot
Copy link
Collaborator

llvmbot commented Oct 12, 2022

@llvm/issue-subscribers-backend-aarch64

@rubdos
Copy link

rubdos commented Aug 22, 2024

I'd like to see whether I can implement this, but my experience with LLVM is zero (apart from compiling and some packaging).

Cc. @kbeyls, @davemgreen, @Tarinn

@kbeyls said on IRC that sometimes EXT might be faster than DUP, but didn't remember when or why.

I came up with this test case in vcombine.ll, which would cover our case in curve25519-dalek.

define <2 x i32> @vget_high32(ptr %A) nounwind {
; CHECK-LABEL: vget_high32
; CHECK-NOT: vst
; CHECK-NOT: vmov
; CHECK-LE-NOT: vld1.64 {d16, d17}, [r0]
; CHECK: vldr  d16, [r0, #8]
; CHECK: dup
	%tmp1 = load <4 x i32>, ptr %A
        %tmp2 = shufflevector <4 x i32> %tmp1, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
        ret <2 x i32> %tmp2
}

@davemgreen
Copy link
Collaborator

Do you mean the DUP which is a MOV, as in the GISel version of https://godbolt.org/z/jxEnWa1jz? If so I think that sounds good.

There are tablegen patterns for "extract_subvector", that probably want to use DUPi64 as opposed to the EXT instructions. It looks like they are defined in a couple of places at the moment. The ones here look incorrect and a probably unused/dead:

def : Pat<(v8i8 (extract_subvector (v16i8 FPR128:$Rn), (i64 1))),

There are ones in the ExtPat<> multiclass too. Hopefully one of the two can be updated to use the new instruction, and the other removed?

@rubdos
Copy link

rubdos commented Aug 23, 2024

Do you mean the DUP which is a MOV, as in the GISel version of https://godbolt.org/z/jxEnWa1jz? If so I think that sounds good.

Yes, that's what's meant here, as far as I understand.

I added armv7 with neon as a separate tab too, but I don't observe any neon instructions inserted for that target. https://godbolt.org/z/qP6bs4sq6
I'm unfamiliar with llc, so I'm probably hitting the wrong buttons.

There are tablegen patterns for "extract_subvector", that probably want to use DUPi64 as opposed to the EXT instructions. It looks like they are defined in a couple of places at the moment. The ones here look incorrect and a probably unused/dead:

llvm-project/llvm/lib/Target/AArch64/AArch64InstrInfo.td

Line 9463 in a625435
def : Pat<(v8i8 (extract_subvector (v16i8 FPR128:$Rn), (i64 1))),

There are ones in the ExtPat<> multiclass too. Hopefully one of the two can be updated to use the new instruction, and the other removed?

I'll have to study that file a bit more to understand what exactly needs changing, will continue on Monday! Thanks for the pointers!

@davemgreen
Copy link
Collaborator

Yeah - Arm has a different register layout to AArch64, where the Q0 register is made up of D0 and D1 so the extract should be as cheap as just using the second register. (In that case it also looks like it is splitting the load into two register).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants