-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA error: invalid argument when using xformers #1946
Comments
This might be an upstream bug in xformers facebookresearch/xformers#563 |
Related issue #1829 |
@davidpfahler in the meantime, using this to enable xformers instead of the built-in enable xformers method should work: https://github.com/cloneofsimo/lora/blob/master/lora_diffusion/xformers_utils.py#L42 |
cc @patil-suraj |
Could be an issue with |
Thanks for the tip. I had the same issue. I solved it by installing this xformers pre-release package as @patil-suraj said and updating pytorch version to 1.13.1+cu117. |
@patil-suraj this is arch specific. What arch are you testing on? It's possible they've fixed it, but the bugs are still open facebookresearch/xformers#517 (I'll check latest xformers in a bit, but I already have a fix for myself.) |
So far, I've only tried it on A100 and T4 |
The two where definitely it works :). The arch I know has issues is SM8x except SM80 (so 30xx and 40xx mostly). (Although it looks like there's a bit more action in the xformers repo, so this might actually get fixed upstream at some point now.) |
This worked on my A6000. pytorch 1.13.1 is a must as I installed xformers 436 manually for 1.12.1 and I still got that error. Edit: it may not error out anymore just it's a silent one now. |
While I'm no longer getting an error, it looks like the model doesn't learn anymore. The images which are generated after the training are the same as before it. However, I've found an older version of xformers which works just fine: facebookresearch/xformers@0bad001. This seems to be the last commit that works for me, as far as I can tell from a few tests using later commits. Here's my environment and installation process. GPU: 3060 Installation: cd examples/dreambooth
pip install \
-r requirements.txt \
git+https://github.com/huggingface/diffusers.git@7c82a16fc14840429566aec40eb9e65aa57005fd \
torch==1.13.1 \
bitsandbytes==0.35.1 \
triton==2.0.0.dev20221202 \
scikit-learn \
datasets \
ninja
pip install git+https://github.com/facebookresearch/xformers.git@0bad001ddd56c080524d37c84ff58d9cd030ebfd If nvcc is not on PATH="$PATH:/opt/cuda/bin" pip install git+https://github.com/facebookresearch/xformers.git@0bad001ddd56c080524d37c84ff58d9cd030ebfd Some details about versions:
Full package version list
Edit: seems to work with both torch 1.12.1 and 1.13.1, updated the version information. |
I too have been running into issues with xFormers on A10G (aws g5 instance) for training textual inversion, not dreambooth (though same issues would likely apply). The environment is containerized (only showing essential lines below):
Somewhere between 100-300 steps into training, loss goes to NaN. I know the issue is xFormers because it runs fine w/o it. No C++ errors, just silent failure. Installations I've tried (pytorch 1.13.1 and cuda 11.6/7 for all):
The weird thing is
|
Would be nice to report this in |
I don't know it fixes it, but there has been a new release for |
I tried this and the 0.17 pre-release.
I'll report in xformers, but I believe I found a related issue there
already.
Best
Evan Jones
Website: www.ea-jones.com
…On Wed, Feb 1, 2023 at 3:44 AM Suraj Patil ***@***.***> wrote:
I don't know it fixes it, but there has been a new release for xformers
yesterday
https://github.com/facebookresearch/xformers/releases/tag/v0.0.16
—
Reply to this email directly, view it on GitHub
<#1946 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJ2T6AJ444RUBSQJ2LHQCCTWVIO5LANCNFSM6AAAAAATT5QNQM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Before the bot makes this issue disappear is it resolved? I'm still currently using an older version of xformers. |
yes
Best
Evan Jones
Website: www.ea-jones.com
…On Sat, Feb 25, 2023 at 10:05 AM USBhost ***@***.***> wrote:
Before the bot makes this issue disappear is it resolved?
—
Reply to this email directly, view it on GitHub
<#1946 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJ2T6AK273T5XFZL62TQGKLWZINVPANCNFSM6AAAAAATT5QNQM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Describe the bug
When trying to run train_dreambooth.py with --enable_xformers_memory_efficient_attention the process exits with this error:
Reproduction
accelerate launch train_dreambooth.py --pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4 --instance_data_dir=./inputs --output_dir=./outputs --instance_prompt="a photo of sks dog" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=5e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=400 --enable_xformers_memory_efficient_attention
Logs
No response
System Info
diffusers
version: 0.12.0.dev0The text was updated successfully, but these errors were encountered: