-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RTX 30xx/Sm86] memory_efficient_attention
backward not supported for K>64 (f32 & f16?)
#517
Comments
Hi thanks for reporting. Can you also add the output of |
Sure,
note that pytorch 1.13+ functorch is part of pytorch, so isn't a seperate install, see
https://pytorch.org/blog/PyTorch-1.13-release/ Output is with the Is_functorch_available now set in
Can do a patch to set it automatically for pytorch >= 1.13 Rerunning with it set and still not passing though. Will retest again after the latest commits with the cutlass update |
No improvement with latest git |
Thanks for the update. We haven't tested xFormers with PyTorch 1.13 (we're still working on 1.12.1) so that might be the issue? NOTE: functorch shouldn't matter for this test |
I was able to run some more tests. I confirm the tests pass on pytorch 1.13 (and today's 1.14 nightly) with cuda 11.7. Might be due to the GPU (RTX 3060) with compute capability 8.6? |
I can confirm this test (and a lot of the others as mentioned) also fails on my 3090. Environment is the default given in the Readme.md, installed the latest build from conda.
Issue also appears when built from source instead.
|
I suspect this might be due to the shared-memory amount - it's 160kb in Sm80 vs 100kb for Sm86. |
I've attached the full log and a full log With CUDA_LAUNCH_BLOCKING=1 to this comment. Running the full test normally a 2nd time, the amount of errors changed yet again to 1171 failed. Not sure if this is expected behavior, So I've added 2 additional run logs without CUDA_LAUNCH_BLOCKING here |
the RuntimeError: Expected is_sm80 to be true, but got false. and torch.cuda.OutOfMemoryError are also included in the log above, slight variations in GPU usage from other processes or differences in allocation/fragmentation can result on the OutOfMemoryError sometimes not occurring, which probably is why the amount of errors change. |
If I compile using
The tests don't pass either, an interesting pattern I noticed is that it appears to be the float32 variants that are predominantly failing, the float16 and bfloat16 are passing usually. Running the tests with -v make the pattern fairly clear, for instance, here is the first group of 8 float16 and 8 bloat16 passing compared to the the first 8 equivalent float32 that fail. (I've removed the skips)
|
Here are the results for my 3060 GPU (laptop variant 6GB) 1538 fails, the additional are due to more frequently exceeding VRAM capacity, |
Thanks a lot for the logs, that really helps! From the results, it looks like:
[*] I don't count the OOM errors, as they are unrelated to the kernels For CUTLASS, it looks like the 3060 GPUs don't have enough shared-memory to run the backward:
I'll try to address that |
If you have some time, is it possible for you to check if the issue is solved with #526 ? |
Here is the test with sm86 update, no VRAM OOMs (skipped now it looks like) there are now 40 fails, ran with CUDA_LAUNCH_BLOCKING=1 == 40 failed, 14455 passed, 18553 skipped, 408 warnings in 286.51s (0:04:46) === previous was = 1538 failed, 13761 passed, 17749 skipped, 578 warnings in 486.09s (0:08:06) == So 694 more passed; and 804 skipped, fail reduction of 1498
|
Thanks a lot! This is really helpful! I updated #526 and it should address the missing tests that were failing. |
All passed or skipped now ============================ 14455 passed, 18593 skipped, 408 warnings in 193.37s (0:03:13) ============================ |
Awesome! I'll merge the PR then. I'll leave this open as we still would like to support K>64 on those GPUs for the backward (but as a lower priority tho) |
same here, |
Awesome! (reopening as we still need to support K>64) |
memory_efficient_attention
backward not supported for K>64/f32
memory_efficient_attention
backward not supported for K>64/f32memory_efficient_attention
backward not supported for K>64 (f32 & f16)
memory_efficient_attention
backward not supported for K>64 (f32 & f16)memory_efficient_attention
backward not supported for K>64 (f32 & f16?)
3080 Ti met same problem. Has there been any update?
|
Hi, |
👍 Looks good now, Thanks! I use xformers to train dreambooth on 12GB GPU and it works.
|
🐛 Bug
numerous tests in test_mem_eff_attetion.py failing due to assertion errors,
here is the first one
Environment
The text was updated successfully, but these errors were encountered: