Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware acceleration #47

Open
TailsFanLOL opened this issue Apr 30, 2024 · 8 comments
Open

Hardware acceleration #47

TailsFanLOL opened this issue Apr 30, 2024 · 8 comments

Comments

@TailsFanLOL
Copy link

TailsFanLOL commented Apr 30, 2024

Hey! Can this use hardware instructions for conversion? Intel CPUs support hardware conversion since 2013, and the new 12-th gen also has support for arithmetic (I think?). Other architectures had that a while ago.

This might be possible without compiler support using embedded C code, but wouldn't that be out of scope for this?

@TailsFanLOL
Copy link
Author

Nvm, 12-th gen had this for a brief moment added by accident, but then got removed from later revisions. This probably means it is going to be in other upcoming Core series CPUs. It was already a thing for Sapphire Rapids Xeons.

I also opened an upstream issue.

@swdee
Copy link

swdee commented Aug 30, 2024

In my use case we have float16 tensor outputs from a NPU on the RK3588 (Arm processor). ARM does have NEON SIMD instructions to hardware accelerate the conversion from fp16 to fp32. We can't make use of those extensions with Go as the compiler does not support SIMD instructions.

Via CGO you can interface with the ARM Compute library to make use of these instructions, however for our use case which involves converting 856,800 bytes from uint16->fp32 per video frame this is much slower than sticking with pure Go in this library.

However better performance is still attainable by using a precalculated lookup table for the uint16->fp32 conversion.

On the RK3588 we get a 35% performance improvement.

BenchmarkF16toF32NormalConversion-8 150 7872802 ns/op 1720348 B/op 1 allocs/op
BenchmarkF16toF32LookupConversion-8 218 5123550 ns/op 1720342 B/op 1 allocs/op

And on a Threadripper workstation we get a 69% improvement.

BenchmarkF16toF32NormalConversion-20 1302 916041 ns/op 1720322 B/op 1 allocs/op
BenchmarkF16toF32LookupConversion-20 3919 275437 ns/op 1720335 B/op 1 allocs/op

To create such a lookup table we are simply precalculating it in our application with.

import "github.com/x448/float16"

var f16LookupTable [65536]float32

func init() {
	// precompute float16 lookup table for faster conversion to float32
	for i := range f16LookupTable {
		f16 := float16.Frombits(uint16(i))
		f16LookupTable[i] = f16.Float32()
	}
}

Then converting our output buffer from uint16 to fp32 with.

func convertBufferToFloat32(float16Buf []uint16) []float32 {
	float32Buf := make([]float32, len(float16Buf))

	for i, val := range float16Buf {
		float32Buf[i] = f16LookupTable[val]
	}

	return float32Buf
}

@x448
Copy link
Owner

x448 commented Sep 9, 2024

@swdee Thanks for the suggestion! I will try this and see how it goes.

@swdee
Copy link

swdee commented Sep 22, 2024

@x448 We have a CGO version as worked with @TailsFanLOL and discussed here.

@TailsFanLOL
Copy link
Author

Golang declined the request to add this type to the language itself.

@swdee, perhaps we should improve what you have slapped together and merge it here, adding stuff from this project's wishlist while we are at it. My current concern in your fork is that -march=native -mtune=native only works properly on GCC with GNU/Linux. On other libcs (Alpine, Android) often only -mtune works therefore the hardware isn't utilized to the full potential, and in my experience when containerized it does nothing at all. Perhaps one could use inline assembly for each architecture, but I don't know Assembly. I might or might not be able to look into it on the weekends.

@swdee
Copy link

swdee commented Oct 3, 2024

I actually did an Assembly version using NEON instructions for ARM. Unfortunately you can't use Go's inline assembler as its instruction set does not support SIMD instructions on any platform, so I used it inline in C via CGO.

void float16_to_float32_buffer_asm(const uint16_t* input, float* output, size_t count) {
    size_t i = 0;

    // Process in chunks of 8 float16 values
    for (; i + 8 <= count; i += 8) {
        __asm__ volatile (
            "ld1 {v0.8h}, [%[input]], #16\n"  // Load 8 float16 values into NEON register v0
            "fcvtl2 v1.4s, v0.8h\n"           // Convert the upper 4 float16 values to float32 in v1
            "fcvtl v2.4s, v0.4h\n"            // Convert the lower 4 float16 values to float32 in v2
            "st1 {v1.4s}, [%[output]], #16\n" // Store the upper 4 float32 values into output buffer
            "st1 {v2.4s}, [%[output]], #16\n" // Store the lower 4 float32 values into output buffer
            : [input] "+r" (input), [output] "+r" (output)
            :
            : "v0", "v1", "v2", "memory"
        );
    }

    // Handle remaining elements (if count is not a multiple of 8)
    for (; i < count; i++) {
        __asm__ volatile (
            "ld1 {v0.h}[0], [%[input]]\n"     // Load one float16 value into v0
            "fcvt s1, h0\n"                   // Convert float16 to float32
            "str s1, [%[output]]\n"           // Store the converted float32 value to output
            : [input] "+r" (input), [output] "+r" (output)
            :
            : "v0", "s1", "memory"
        );

        input++;   // Increment the input pointer to next float16 value
        output++;  // Increment the output pointer to the next float32 value
    }
}

func F16toF32BufferASM(float16Buf []uint16, float32Buf []float32) {
	C.float16_to_float32_buffer_asm(
		(*C.uint16_t)(unsafe.Pointer(&float16Buf[0])), // Pointer to the input buffer
		(*C.float)(unsafe.Pointer(&float32Buf[0])),    // Pointer to the output buffer
		C.size_t(len(float16Buf)),                     // Number of elements to convert
	)
}

func BenchmarkF16toF32CGOASMConversion(b *testing.B) {
	// Load the buffer outside the loop to avoid reloading it during each iteration.
	float16Buf := loadBuffer()

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		float32Buf := make([]float32, len(float16Buf))

		// Use the NEON-accelerated function
		float16.F16toF32BufferASM(float16Buf, float32Buf)
	}
}

On the RK3588 the benchmark out of this is;

$ go test -bench=^BenchmarkF16toF32 -run=^$
goos: linux
goarch: arm64
pkg: github.com/x448/float16
BenchmarkF16toF32LookupConversion-8                  192           5471118 ns/op
BenchmarkF16toF32NormalConversion-8                  134           9456042 ns/op
BenchmarkF16toF32CGOSingleConversion-8                 8         138150327 ns/op
BenchmarkF16toF32CGOVectorConversion-8                48          22830134 ns/op
BenchmarkF16toF32CGOBufferSingleConversion-8         828           1748324 ns/op
BenchmarkF16toF32CGOBufferVectorConversion-8         778           1466371 ns/op
BenchmarkF16toF32CGOASMConversion-8                  973           1425002 ns/op
PASS
ok      github.com/x448/float16 13.991s

The speed is slightly better than the C version we talked about on the golang issue. In my own project I stuck with the C version purely for the reason its easier to deal with than the ASM version.

@swdee
Copy link

swdee commented Oct 3, 2024

Here is one for x86 using AVX2.


#include <immintrin.h>
#include <math.h>  // For INFINITY

void float16_to_float32_buffer_avx2(const uint16_t* input, float* output, size_t count) {
    size_t i = 0;

    // AVX2 processes 8 float32 values at a time
    for (; i + 8 <= count; i += 8) {
        // Load 8 x uint16_t values into a 128-bit register
        __m128i in = _mm_loadu_si128((const __m128i*)(input + i));

        // Convert 8 float16 values to float32 (F16C intrinsic)
        __m256 result = _mm256_cvtph_ps(in);

        // Store the resulting 8 float32 values
        _mm256_storeu_ps(output + i, result);
    }

    // Handle remaining elements if count is not a multiple of 8
    for (; i < count; i++) {
        uint16_t f16 = input[i];
        uint32_t sign = (f16 & 0x8000) << 16;
        uint32_t exp = (f16 & 0x7C00) >> 10;
        uint32_t mant = f16 & 0x03FF;

        if (exp == 0) {
            if (mant == 0) {
                output[i] = sign ? -0.0f : 0.0f;
            } else {
                output[i] = (float)((sign ? -1 : 1) * (mant * (1.0 / 1024.0)) * (1.0 / (1 << 24)));
            }
        } else if (exp == 0x1F) {
            output[i] = sign ? -INFINITY : INFINITY;
        } else {
            output[i] = (float)((sign ? -1 : 1) * (mant + 1024) * (1.0 / 1024.0) * (1 << (exp - 15)));
        }
    }
}

// F16toF32BufferAVX2 converts float16 to float32 using SIMD on x86 platforms
func F16toF32BufferAVX2(float16Buf []uint16, float32Buf []float32) {
	C.float16_to_float32_buffer_avx2(
		(*C.uint16_t)(unsafe.Pointer(&float16Buf[0])), // Pointer to the input buffer
		(*C.float)(unsafe.Pointer(&float32Buf[0])),    // Pointer to the output buffer
		C.size_t(len(float16Buf)),                     // Number of elements to convert
	)
}


func BenchmarkF16toF32AVX2Conversion(b *testing.B) {
	// Load the buffer outside the loop to avoid reloading it during each iteration.
	float16Buf := loadBuffer()

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		float32Buf := make([]float32, len(float16Buf))

		// Use the AVX2-accelerated function
		float16.F16toF32BufferAVX2(float16Buf, float32Buf)
	}
}

With results on my workstation:

$ go test -bench=^BenchmarkF16toF32 -run=^$
goos: linux
goarch: amd64
pkg: github.com/x448/float16
cpu: AMD Ryzen Threadripper PRO 5975WX 32-Cores     
BenchmarkF16toF32LookupConversion-20                        3188            358243 ns/op
BenchmarkF16toF32NormalConversion-20                        1249            920966 ns/op
BenchmarkF16toF32CGOSingleConversion-20                       69          17451686 ns/op
BenchmarkF16toF32CGOVectorConversion-20                      565           2107527 ns/op
BenchmarkF16toF32CGOBufferSingleConversion-20               3459            289367 ns/op
BenchmarkF16toF32CGOBufferVectorConversion-20               5122            227647 ns/op
BenchmarkF16toF32AVX2Conversion-20                          6366            182427 ns/op
PASS
ok      github.com/x448/float16 8.530s

@TailsFanLOL
Copy link
Author

TailsFanLOL commented Oct 3, 2024

I actually did an Assembly version using NEON instructions for ARM.

That's great. Unfortunately NEON isn't IEEE compatible as it handles handles subnormals as equal to zero (with some other minor differences).
As far as I am aware, the armv8's FPU offers IEEE compatible instructions. It isn't as fast, tho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants