NaNs when training with `attn_bias` (f32) #684

zen-d · 2023-03-08T08:01:29Z

❓ Questions and Help

Hi, I pass in the attn_bias to xformers.ops.memory_efficient_attention, but meet the following error

NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(831, 43, 32, 8) (torch.float32)             
     key         : shape=(831, 43, 32, 8) (torch.float32)
     value       : shape=(831, 43, 32, 8) (torch.float32)  
     attn_bias   : <class 'torch.Tensor'>
     p           : 0.0                                                                                                                                   
`flshattF` is not supported because:
    dtype=torch.float32 (supported: {torch.bfloat16, torch.float16})                                                                                     
    attn_bias type is <class 'torch.Tensor'>
`tritonflashattF` is not supported because:                                                                                                              
    dtype=torch.float32 (supported: {torch.bfloat16, torch.float16})
    attn_bias type is <class 'torch.Tensor'>                                                                                                             
`cutlassF` is not supported because:
    attn_bias.shape[-1] % 8 != 0                                                                                                                         
`smallkF` is not supported because:                                         
    bias with non-zero stride not supported

In my case, attn_bias is indispensable and it is hard to always satisfy that attn_bias.shape[-1] % 8 == 0, so how could I benefit from this repo? Thanks.

The text was updated successfully, but these errors were encountered:

danthe3rd · 2023-03-08T09:35:16Z

Hi,
Thank for opening this issue. That's something we can work on (see #683).
What type of bias do you need? Is it a learnable bias

zen-d · 2023-03-08T13:21:44Z

@danthe3rd Thanks a lot for your prompt reply! #683 is highly related. In that thread I notice you may work on it #683 (comment).
First, may I know when the support for a attn_bias of torch.Tensorwith attn_bias.shape[-1] % 8 != 0 is scheduled? Would it be a very recent plan?
Second, if you could also support a learnable attn_bias, it would become more attractive.

danthe3rd · 2023-03-08T13:23:16Z

The bias is currently learnable :) We just need to add this padding support. Hopefully we can get that out next week

zen-d · 2023-03-08T13:40:04Z

Wow, fantastic! Look forward to seeing the padding support soon to relax the shape constraint.

danthe3rd · 2023-03-10T19:13:13Z

It's merged in b6be33a

zen-d · 2023-03-13T08:14:54Z

@danthe3rd Thanks! Looks good, but I don't have free GPUs temporarily. I will try on the new feature ASAP.

zen-d · 2023-03-14T08:06:44Z

@danthe3rd By following these hints to do padding and slicing, I'm able to run the model now. The memory burden is significantly alleviated. Thanks for your awesome job! I will continue to monitor the training process and the final accuracy.

HINT: To use an attn_bias with a sequence length that is not a multiple of 8,
you need to ensure memory is aligned by slicing a bigger tensor.
Example: use attn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5] instead of torch.zeros([1, 1, 5, 5])

zen-d · 2023-03-14T12:24:08Z

Unfortunately, the training diverges in the middle (loss becomes NaN), which did not happen in the original attention-based model. Would you like to share some insights about that? Thanks.

danthe3rd · 2023-03-14T12:39:11Z

I don't have specific idea for this, but you can detect more precisely where the nan is coming from with the anomaly detection:

torch.autograd.set_detect_anomaly(mode=True, check_nan=True)

zen-d · 2023-03-15T08:33:59Z

Thanks for providing the suggestion. The only difference is the attention implementation in this controlled experiment, but I am not sure of the specific reason temporarily. I will dive deep into the issue. :)

danthe3rd · 2023-03-15T08:48:50Z

Also - this is running in f32 it looks like? Otherwise you might want to try to train with f32 to see if it's related to the numerical precision

zen-d · 2023-03-15T08:53:37Z

Yes, for safety, I am training with FP32 numerical precision now. (Similar to my experience, AMP training seems to have more chance of NaN for Transformer-based models.)

Shannen3206 · 2023-09-07T03:33:11Z

Yes, for safety, I am training with FP32 numerical precision now. (Similar to my experience, AMP training seems to have more chance of NaN for Transformer-based models.)

I meet the same question, and i found that use fp16 can solve this problem.

Shannen3206 · 2023-09-07T03:44:43Z

@danthe3rd By following these hints to do padding and slicing, I'm able to run the model now. The memory burden is significantly alleviated. Thanks for your awesome job! I will continue to monitor the training process and the final accuracy.

HINT: To use an attn_bias with a sequence length that is not a multiple of 8,
you need to ensure memory is aligned by slicing a bigger tensor.
Example: use attn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5] instead of torch.zeros([1, 1, 5, 5])

Hi,
I found that use this method may cause the inference speed lower.#853
Do you have any good way?

zen-d changed the title ~~support for attn_bias of arbitrry format~~ support for attn_bias of arbitrary format Mar 8, 2023

danthe3rd added bug Something isn't working ongoing labels Mar 8, 2023

zen-d closed this as completed Mar 14, 2023

zen-d reopened this Mar 14, 2023

danthe3rd removed the ongoing label Mar 30, 2023

danthe3rd changed the title ~~support for attn_bias of arbitrary format~~ NaNs when training with attn_bias (f32) Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaNs when training with `attn_bias` (f32) #684

NaNs when training with `attn_bias` (f32) #684

zen-d commented Mar 8, 2023

danthe3rd commented Mar 8, 2023

zen-d commented Mar 8, 2023

danthe3rd commented Mar 8, 2023

zen-d commented Mar 8, 2023

danthe3rd commented Mar 10, 2023

zen-d commented Mar 13, 2023

zen-d commented Mar 14, 2023

zen-d commented Mar 14, 2023

danthe3rd commented Mar 14, 2023 •

edited

Loading

zen-d commented Mar 15, 2023

danthe3rd commented Mar 15, 2023

zen-d commented Mar 15, 2023

Shannen3206 commented Sep 7, 2023

Shannen3206 commented Sep 7, 2023

NaNs when training with attn_bias (f32) #684

NaNs when training with attn_bias (f32) #684

Comments

zen-d commented Mar 8, 2023

❓ Questions and Help

danthe3rd commented Mar 8, 2023

zen-d commented Mar 8, 2023

danthe3rd commented Mar 8, 2023

zen-d commented Mar 8, 2023

danthe3rd commented Mar 10, 2023

zen-d commented Mar 13, 2023

zen-d commented Mar 14, 2023

zen-d commented Mar 14, 2023

danthe3rd commented Mar 14, 2023 • edited Loading

zen-d commented Mar 15, 2023

danthe3rd commented Mar 15, 2023

zen-d commented Mar 15, 2023

Shannen3206 commented Sep 7, 2023

Shannen3206 commented Sep 7, 2023

NaNs when training with `attn_bias` (f32) #684

NaNs when training with `attn_bias` (f32) #684

danthe3rd commented Mar 14, 2023 •

edited

Loading