[`0.0.18`, `memory_efficient_attention` with `attn_bias`]Getting NANs with arbitrary attn_bias mask with xformers==0.0.18 #722

toothacher17 · 2023-04-08T08:40:23Z

🐛 Bug

I am trying to use xformers to replace my native pytorch MHA implementation, sth like:

scale = 1 / query.shape[-1] ** 0.5
query = query * scale
attn = query @ key.transpose(-2, -1)
if attn_bias is not None:
    attn = attn + attn_bias
attn = attn.softmax(-1)
attn = F.dropout(attn, p)
return attn @ value

After switching to xformers, I am usiing xops.memory_efficient_attention(q, k ,v, attn_bias).

This works fine when I am using a lower triangular mask, either by passing in a LowerTriangularMask() or passing a torch.tensor with the same shape that built by my own.

However, when I am switching to use a arbitrary mask (supposing in pretraining stage, you opened reset_position_ids and reset_attention_mask flags, so you'll get a new start inside one sequence), I am getting NANs during evaluating (no grad forward) or training (with grad). Based on the log, the program is using the CUTLASS op.

Based on my observations, xformers saves 10-15% GPU memory and improves overall TFLOPs by 10-15%, so I really want to use it to replace with my native pytorch implementations. Could you help on this issue?

Environment

some depenedencies
trition==2.0
xformers==0.0.18
pytorch==2.0

The text was updated successfully, but these errors were encountered:

toothacher17 · 2023-04-08T08:42:24Z

So my question is: is arbitrary attention mask supported by xformers 0.0.18 yet?
I tried to follow several issues thread from the past and based on the doc, my understanding is that the answer to this question is YES. However, I could not make it work

danthe3rd · 2023-04-11T12:29:36Z

Hi @toothacher17
This should be supported by xFormers, but the behavior you report is definitively a bug. Unless a line in your mask is entirely masked-out, it shouldn't give you NaNs.
Do you have an independant minimum repro example so I can try it?

danthe3rd · 2023-04-11T12:35:53Z

I actually managed to repro it with this script:

import math
import torch
import xformers.ops.fmha as fmha

B, M, H, K = 1, 1024, 2, 64
dtype = torch.float16
device = "cuda"

q, k, v = [torch.randn([B, M, H, K], dtype=dtype, device=device) for _ in range(3)]
mask = torch.zeros([B, H, M, M], dtype=dtype, device=device)
mask[:, :, :256, :256] = -math.inf
out = fmha.memory_efficient_attention(q, k, v, attn_bias=mask)
print(out.sum())

It looks like it happens when the first 128 tokens of a sentence are entirely masked out.

Regarding your issue specifically, as you want to handle sequences of varying length, I recommend you use this mask, it will also save compute

toothacher17 · 2023-04-11T13:37:24Z

hi, @danthe3rd

Thanks a lot for the reply.

Yes, I took another look at the BlockDiagonalCausalMask at flash_attention and Megatron-LM repo a few days ago. Basically it meets my need of pretraining with reset-position-id and reset-attention-mask, as resetting positions changes a lower-triangular matrix to a block diagonal matrix. To use this mask, like Megatron-LM, I'll need to change the shape, and merge all sentences in a batch into a single sentence, calculate the cumulative sequence length and pass them in to the ops.

toothacher17 · 2023-04-11T13:40:17Z

Btw, I tested with cutlass based operators as it is the only class that supports customized attention bias. And the NANs is generated by the cutlass operator...

danthe3rd · 2023-04-11T13:55:22Z

To use this mask, like Megatron-LM, I'll need to change the shape, and merge all sentences in a batch into a single sentence, calculate the cumulative sequence length and pass them in to the ops.

It shouldn't be that involved, as normally everything operates at the token level, except for the attention. It might be a bit more involved if you are using some specific positional embedding tho.

Regarding the bug, I believe I understand where it comes from and should have a fix coming soon

toothacher17 · 2023-04-11T14:04:16Z

Thanks for getting back to me. I am using sth like ROPE, but that's before it enters into attention, so as long as the cumulative sequence length for qkv is calculated correctly, the block diagonal causal matrix should be fine.

How soon do you think you can release the fix? If it is really coming soon, I'll wait for the fix before changing my code for the block diagonal causal matrix

danthe3rd · 2023-04-11T14:38:49Z

How soon do you think you can release the fix? If it is really coming soon, I'll wait for the fix before changing my code for the block diagonal causal matrix

Hopefully this week or next week for the xFormers development version (eg we might not necessarily push a new version tag yet)

danthe3rd · 2023-04-14T13:29:12Z

It should be fixed as of 540fcbf, and will be included in the next release (0.0.19). In the meantime, you can also use a development build >=0.0.19.dev516

toothacher17 · 2023-04-15T12:44:06Z

Thanks, @danthe3rd . We managed to change to use flash_atten_unppad_func and get correct expected loss. Maybe after the 0.0.19 is released, we might change back to use xformers and see if cutlass is faster than flash_attn implementation.

Thanks for the quick fix! Cheers!

nofreewill42 · 2024-03-23T21:41:20Z

Is an arbitrary mask with i.e. an off diagonal rectangular shaped attention mask being memory efficient if my understanding is correct, but it computes also the grey area as well so compute efficiency is lacking?
I'm awaiting nervously for this feature :P because of cross attention to audio from a transcript text efficiently like in this image:

EDIT:
Have I just found the solution to my specific problem withBlockDiagonalGappyKeysMask ? :O

https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.fmha.attn_bias.BlockDiagonalGappyKeysMask

danthe3rd · 2024-03-25T09:13:55Z

@nofreewill42 I believe the mask you are looking for is this one - it also supports training, whereas the one you found only supports inference

toothacher17 mentioned this issue Apr 10, 2023

training with reset-position-ids and reset-attention-mask Dao-AILab/flash-attention#161

Closed

danthe3rd self-assigned this Apr 11, 2023

danthe3rd added bug Something isn't working ongoing labels Apr 11, 2023

danthe3rd pinned this issue Apr 11, 2023

danthe3rd added this to the v0.0.19 milestone Apr 14, 2023

danthe3rd closed this as completed Apr 14, 2023

danthe3rd changed the title ~~Getting NANs with arbitrary attn_bias mask with xformers==0.0.18~~ [0.0.18, memory_efficient_attention with attn_bias]Getting NANs with arbitrary attn_bias mask with xformers==0.0.18 Apr 14, 2023

danthe3rd unpinned this issue May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`0.0.18`, `memory_efficient_attention` with `attn_bias`]Getting NANs with arbitrary attn_bias mask with xformers==0.0.18 #722

[`0.0.18`, `memory_efficient_attention` with `attn_bias`]Getting NANs with arbitrary attn_bias mask with xformers==0.0.18 #722

toothacher17 commented Apr 8, 2023 •

edited

Loading

toothacher17 commented Apr 8, 2023 •

edited

Loading

danthe3rd commented Apr 11, 2023

danthe3rd commented Apr 11, 2023

toothacher17 commented Apr 11, 2023

toothacher17 commented Apr 11, 2023

danthe3rd commented Apr 11, 2023

toothacher17 commented Apr 11, 2023

danthe3rd commented Apr 11, 2023

danthe3rd commented Apr 14, 2023

toothacher17 commented Apr 15, 2023

nofreewill42 commented Mar 23, 2024 •

edited

Loading

danthe3rd commented Mar 25, 2024

[0.0.18, memory_efficient_attention with attn_bias]Getting NANs with arbitrary attn_bias mask with xformers==0.0.18 #722

[0.0.18, memory_efficient_attention with attn_bias]Getting NANs with arbitrary attn_bias mask with xformers==0.0.18 #722

Comments

toothacher17 commented Apr 8, 2023 • edited Loading

🐛 Bug

Environment

toothacher17 commented Apr 8, 2023 • edited Loading

danthe3rd commented Apr 11, 2023

danthe3rd commented Apr 11, 2023

toothacher17 commented Apr 11, 2023

toothacher17 commented Apr 11, 2023

danthe3rd commented Apr 11, 2023

toothacher17 commented Apr 11, 2023

danthe3rd commented Apr 11, 2023

danthe3rd commented Apr 14, 2023

toothacher17 commented Apr 15, 2023

nofreewill42 commented Mar 23, 2024 • edited Loading

danthe3rd commented Mar 25, 2024

[`0.0.18`, `memory_efficient_attention` with `attn_bias`]Getting NANs with arbitrary attn_bias mask with xformers==0.0.18 #722

[`0.0.18`, `memory_efficient_attention` with `attn_bias`]Getting NANs with arbitrary attn_bias mask with xformers==0.0.18 #722

toothacher17 commented Apr 8, 2023 •

edited

Loading

toothacher17 commented Apr 8, 2023 •

edited

Loading

nofreewill42 commented Mar 23, 2024 •

edited

Loading