Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: spec: float16 #67127

Closed
TailsFanLOL opened this issue May 1, 2024 · 19 comments
Closed

proposal: spec: float16 #67127

TailsFanLOL opened this issue May 1, 2024 · 19 comments
Labels
LanguageChange Suggested changes to the Go language LanguageChangeReview Discussed by language change review committee Proposal Proposal-FinalCommentPeriod
Milestone

Comments

@TailsFanLOL
Copy link

          We aren't going to add this type to the language without widespread hardware support for it.

Originally posted by @ianlancetaylor in #32022 (comment)

Relevant x86 commands for using float16 as a storage only format were added to AMD and Intel CPUs in ~2013. As for arithmetic and complex32 support, it was added to Sapphire Rapids Xeon series CPUs, and it was also briefly added to Alder Lake by accident which was enableable via certain BIOSes but got removed in later revisions (this probably means it is gonna be in upcoming Core CPUs).
As of other architectures, it was added in certain ARM, RISC-V and PowerISA CPUs.

Can you guys please just give us at least a limited storage-only implementation for float16? It is pretty useful for games and AI. Certain third party packages also provide it, and is used in popular libraries like CBOR, that could speed them up.

@seankhliao
Copy link
Member

What would storage only mean in terms of the language?

@TailsFanLOL
Copy link
Author

It would mean not much in the math package, and no arithmetic operations, only converting from one to another to store on the drive or transfer over the network. We should probably do a more proper implementation if possible, tho, but that would not be the top priority.

@Jorropo
Copy link
Member

Jorropo commented May 1, 2024

@TailsFanLOL you want to reopen #32022 so create two new builtin premitive types ?

type float16 float16
// complex32 is made of two [float16] for real and imaginary part.
type complex32 complex32

But only have float16 ←→ float32, float64 conversion and no operation support ?

@TailsFanLOL
Copy link
Author

TailsFanLOL commented May 1, 2024

Well, that pretty much sums it up. Again, arithmetics would be nice, but I would like some basic conversion only implementation first to get things done.

EDIT: there's also bfloat16, which has a different format and is intended for neural computations only, and is also supported on some xeons, but I guess we won't really need this yet.

@Jorropo
Copy link
Member

Jorropo commented May 1, 2024

Which float16 are we talking about ?

#32022 (comment) gives the specification of bfloat16 which is often used on GPUs and machine learning (make sense with their earlier comments).
However it is different compared to IEEE754's half precision format.

Edit: just saw the edit in #67127 (comment).

@Jorropo
Copy link
Member

Jorropo commented May 1, 2024

So here is a more complete picture AFAIT:


Add two new builtin types:

// float16 is the set of all IEEE754 half precision numbers.
type float16 float16
// complex32 is made of two [float16] for real and imaginary part respectively.
type complex32 complex32

Theses types would not support any operation except converting them back and forth to float32 and float64 kinds.

Add new functions to the math package:

func Float16bits(f float16) uint16
func Float16frombits(f uint16) float16

This allows to use them for compact data interchange as intended by IEEE754.

@randall77
Copy link
Contributor

Why not just bits/frombits which go directly from float32 to uint16?

@Jorropo
Copy link
Member

Jorropo commented May 1, 2024

A liberal interpretation of:

An implementation may combine multiple floating-point operations into a single fused operation, possibly across statements, and produce a result that differs from the value obtained by executing and rounding the instructions individually.

would allow us to implement float16 operations by using float32 and converting back and forth when storing in memory.
So I don't see a valid reason to add a float16 type but not allow to use it to do math with.

I don't yet see value in adding float16 or bfloat16 as their use-case for smaller wire representation is niche (and could be served third party or by doing #67127 (comment)).
They are useful for optimizing memory throughput in computation-heavy routines, but since the go compiler lacks an autovectorizer or SIMD intrinsics, you usually hit instruction dispatch bottlenecks so I would like some example of go code that see significant performance improvement by using float16 over float32.
SIMD can be done in assembly and reach memory throughput bottlenecks, but then do you need language support ?

@TailsFanLOL
Copy link
Author

TailsFanLOL commented May 1, 2024

They are useful for optimizing memory throughput in computation-heavy routines, but since the go compiler lacks an autovectorizer or SIMD intrinsics, you usually hit instruction dispatch bottlenecks so I would like some example of go code that see significant performance improvement by using float16 over float32.

I am in progress of making an ASCII/CP437 art generator, and it uses a score from 0 to 255 to determine how much a character is fit for the image. It doesn't need much accuracy, but it often has values like "82.4" and "82.235" where you would need a float to determine which one is more fit for the image chunk. It uses a lot of goroutines working together at once (user selectable), so using half as much bytes for both the inputs/outputs and the internal variables would significantly reduce the amount of memory needed, and less memory needed means rambus goes vroom-vroom. This is also useful in video games for stuff like mob health. Oh, and remember what I said about machine learning stuff?

Oh, and for the "the go compiler lacks...", well there's GCCGO. It will receive this update like 5 years after this is added tho.

> SIMD can be done in assembly and reach memory throughput bottlenecks, but then do you need language support ?

Not everybody is supposed to understand assembly. This is supposed to be an easy to learn language (in terms of statically typed ones).

EDIT: nevermind, kinda forgot most desktop CPUs can't arithmetic yet, so this would have pretty much the same speed. Here are some better examples:

  1. You have an online game, and you need to transfer health data, durability, etc. from the server to the client to hundreds of players.
  2. You have a weather server that has to provide temperature/humidity data to thousands of clients.
  3. Any app that stores calculations.
  4. Graphical APIs like Vulkan use it inside for intermediate results between the CPU and GPU. Rambus goes vroom-vroom.

Pretty much the same use cases as I discussed earlier. This can benefit existing libraries and implementations (CBOR library, OpenMeteo API, etc.) that currently rely on software only to encode float16.

EDIT0: nvm just ignore everything I said I was sleepy and missed the entire point of the message above

@TailsFanLOL
Copy link
Author

About the "why we should allow doing math" argument, well, the thing is, most programmers may assume that if float32 is faster than float64, then float16 must be even faster on all hardware. Tho by the time everyone gets the update fp16 arithmetic might even become widespread, so we could just make it a disableable compiler warning.

@seankhliao seankhliao changed the title float16/complex32 followup proposal: float16 May 2, 2024
@gopherbot gopherbot added this to the Proposal milestone May 2, 2024
@seankhliao seankhliao changed the title proposal: float16 proposal: spec: float16 Jun 15, 2024
@seankhliao seankhliao added the LanguageChange Suggested changes to the Go language label Jun 15, 2024
@ianlancetaylor ianlancetaylor added the LanguageChangeReview Discussed by language change review committee label Aug 6, 2024
@ianlancetaylor
Copy link
Contributor

As this proposal is for a storage-only format, this can be implemented as a separate package. That package can provide conversions between the new 16-bit floating-point type and float32. The new value could be stored as a uint16. That might be the place to start before adding it to the language.

In the language I don't think we could get away with having a storage-only type. If we have float16, it needs to support all the operations that float32 supports. Anything else would be very strange.

@swdee
Copy link

swdee commented Aug 30, 2024

Our use case is handling float16 tensor outputs from a NPU on the RK3588 processor. We simply convert the output buffer from CGO to uint16 then make use of the https://github.com/x448/float16 package to convert to float32 for handling within Go.

We did attempt to perform the conversion via CGO using the ARM Compute library which has NEON SIMD instructions to accelerate the conversion, but this was slower than sticking with the pure Go library above.

However we do achieve a 35% performance increase (on RK3588) by precalculating a uint16->float32 lookup table to convert the buffer. On a threadripper workstation this method gives us a 69% increase.

Further details on what we are doing here x448/float16#47 (comment)

@ianlancetaylor
Copy link
Contributor

Based on the above discussion this is a likely decline. Leaving open for four weeks for final comments.

@TailsFanLOL
Copy link
Author

I tried a similar method to the one above on Haswell and RK3399, and the cgo implementation is faster than the lookup table. Will post the code soon, I am not home rn

@TailsFanLOL
Copy link
Author

Sorry for the wait, I have forgotten. I lost the original numbers and the program and made a quick and dirty replacement.
On Haswell with gccgo:

Originals: 2.7182817 3.1415927 1.618034 1.4142135 1.6487212 1.7724539 1.2720196 0.6931472
f32tof16 took 8.487µs
f16tof32 took 1.634µs
Converted: 2.71875 3.140625 1.6181641 1.4140625 1.6484375 1.7724609 1.2724609 0.6933594
 Making a lookup table using it (also demostrates the speed of looping through 65535 values)...
Generation of little bobby tables took 10.544385ms
Originals: 2.7182817 3.1415927 1.618034 1.4142135 1.6487212 1.7724539 1.2720196 0.6931472
float32 >> float16 in software took 1.51µs
Lookup of float16 >> float32 took 240ns
Converted: 2.71875 3.140625 1.6181641 1.4140625 1.6484375 1.7724609 1.2724609 0.6933594

I don't know how I got cgo to be quicker. It's probably the data conversion between the two. I will try on other platforms tomorrow.

@TailsFanLOL
Copy link
Author

if one divides the time it took for the table it's 1,2871563720703125 us per the 8 values and that's faster, perhaps it's measuring inaccurately. needs better benchmark

@swdee
Copy link

swdee commented Sep 22, 2024

Sorry for the wait, I have forgotten. I lost the original numbers and the program and made a quick and dirty replacement. On Haswell with gccgo:

Thanks for providing your code. I have taken it and applied it to our use case of converting the tensor outputs from float16 to float32 and benchmarked it against the x448/float16 Go code and Lookup table versions. That code is here.

Benchmark data as follows;

$ go test -bench=^BenchmarkF16toF32 -run=^$
goos: linux
goarch: amd64
pkg: github.com/x448/float16
cpu: AMD Ryzen Threadripper PRO 5975WX 32-Cores     
BenchmarkF16toF32LookupConversion-20                4107            299812 ns/op
BenchmarkF16toF32NormalConversion-20                1286            946051 ns/op
BenchmarkF16toF32CGOSingleConversion-20               64          17829553 ns/op
BenchmarkF16toF32CGOVectorConversion-20              574           2174532 ns/op
PASS
ok      github.com/x448/float16 6.029s

The CGOSingleConversion is converting f16->f32 one by one in a loop, to match the logic in the Normal and Lookup conversion versions. In my own App I did try loop unrolling versions with the Normal and Lookup code, but found no performance advance in using it.

The CGOVectorConversion uses your code to process the f16->f32 in batches of 8.

Unfortunately the CGO version is considerably slower which reflects our attempts we made when using the Arm compute library which made use of NEON instructions.

@swdee
Copy link

swdee commented Sep 22, 2024

@TailsFanLOL An update as I realised I made a bad comparison in the above code as each chunk of our f16 buffer was involving a CGO call in a loop, instead of converting the entire buffer in C with a single CGO call. Updated benchmark code here.

Benchmark data as follows;

$ go test -bench=^BenchmarkF16toF32 -run=^$
goos: linux
goarch: amd64
pkg: github.com/x448/float16
cpu: AMD Ryzen Threadripper PRO 5975WX 32-Cores     
BenchmarkF16toF32LookupConversion-20                        4244            272754 ns/op
BenchmarkF16toF32NormalConversion-20                        1231            907933 ns/op
BenchmarkF16toF32CGOSingleConversion-20                       70          16636862 ns/op
BenchmarkF16toF32CGOVectorConversion-20                      577           2027000 ns/op
BenchmarkF16toF32CGOBufferSingleConversion-20               4364            230207 ns/op
BenchmarkF16toF32CGOBufferVectorConversion-20               5376            228132 ns/op
PASS
ok      github.com/x448/float16 9.042s

As can be seen in CGOBufferSingleConversion benchmark, the Single conversion which converts the buffer in a loop is a similar speed to Lookup version.

The CGOBufferVectorConversion in the above does show some performance enhancement using the batch processing, however when performing multiple runs of the benchmark the results can vary quite a lot, in some cases performance is worse, eg:

BenchmarkF16toF32LookupConversion-20                        4116            274900 ns/op
BenchmarkF16toF32NormalConversion-20                        1311            905580 ns/op
BenchmarkF16toF32CGOSingleConversion-20                       67          18997411 ns/op
BenchmarkF16toF32CGOVectorConversion-20                      560           2131009 ns/op
BenchmarkF16toF32CGOBufferSingleConversion-20               5115            228845 ns/op
BenchmarkF16toF32CGOBufferVectorConversion-20               4610            221454 ns/op

Results running on ARM RK3588 processor show 3x performance improvement over Lookup version.

goos: linux
goarch: arm64
pkg: github.com/x448/float16
BenchmarkF16toF32LookupConversion-8                  229           4688887 ns/op
BenchmarkF16toF32NormalConversion-8                  124           8296945 ns/op
BenchmarkF16toF32CGOSingleConversion-8                 8         138560388 ns/op
BenchmarkF16toF32CGOVectorConversion-8                68          19858146 ns/op
BenchmarkF16toF32CGOBufferSingleConversion-8         742           1492488 ns/op
BenchmarkF16toF32CGOBufferVectorConversion-8         771           1642574 ns/op
PASS
ok      github.com/x448/float16 10.324s

@ianlancetaylor
Copy link
Contributor

No change in consensus.

@ianlancetaylor ianlancetaylor closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LanguageChange Suggested changes to the Go language LanguageChangeReview Discussed by language change review committee Proposal Proposal-FinalCommentPeriod
Projects
Status: Incoming
Development

No branches or pull requests

7 participants