Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid argument when using xformers #1946

Closed
vmajor opened this issue Jan 7, 2023 · 19 comments
Closed

RuntimeError: CUDA error: invalid argument when using xformers #1946

vmajor opened this issue Jan 7, 2023 · 19 comments
Assignees
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@vmajor
Copy link

vmajor commented Jan 7, 2023

Describe the bug

When trying to run train_dreambooth.py with --enable_xformers_memory_efficient_attention the process exits with this error:

RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Steps:   0%|                                                                                                                          | 0/400 [00:07<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/*****/anaconda3/envs/sd-gpu/bin/accelerate:8 in <module>                                  │
│                                                                                                  │
│   5 from accelerate.commands.accelerate_cli import main                                          │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/commands/accelerate_c │
│ li.py:45 in main                                                                                 │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/commands/launch.py:11 │
│ 04 in launch_command                                                                             │
│                                                                                                  │
│   1101 │   elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA  │
│   1102 │   │   sagemaker_launcher(defaults, args)                                                │
│   1103 │   else:                                                                                 │
│ ❱ 1104 │   │   simple_launcher(args)                                                             │
│   1105                                                                                           │
│   1106                                                                                           │
│   1107 def main():                                                                               │
│                                                                                                  │
│ /home/*****/anaconda3/envs/sd-gpu/lib/python3.10/site-packages/accelerate/commands/launch.py:56 │
│ 7 in simple_launcher                                                                             │
│                                                                                                  │
│    564 │   process = subprocess.Popen(cmd, env=current_env)                                      │
│    565 │   process.wait()                                                                        │
│    566 │   if process.returncode != 0:                                                           │
│ ❱  567 │   │   raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)       │
│    568                                                                                           │
│    569                                                                                           │
│    570 def multi_gpu_launcher(args):                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Reproduction

accelerate launch train_dreambooth.py --pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4 --instance_data_dir=./inputs --output_dir=./outputs --instance_prompt="a photo of sks dog" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=5e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=400 --enable_xformers_memory_efficient_attention

Logs

No response

System Info

  • diffusers version: 0.12.0.dev0
  • Platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
  • Python version: 3.10.8
  • PyTorch version (GPU?): 1.13.0 (True)
  • Huggingface_hub version: 0.11.1
  • Transformers version: 0.15.0
  • Accelerate version: not installed
  • xFormers version: 0.0.15.dev395+git.7e05e2c
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: single GPU
@vmajor vmajor added the bug Something isn't working label Jan 7, 2023
@davidpfahler
Copy link

This might be an upstream bug in xformers facebookresearch/xformers#563

@hafriedlander
Copy link
Contributor

Related issue #1829

@hafriedlander
Copy link
Contributor

@davidpfahler in the meantime, using this to enable xformers instead of the built-in enable xformers method should work:

https://github.com/cloneofsimo/lora/blob/master/lora_diffusion/xformers_utils.py#L42

@patrickvonplaten
Copy link
Contributor

cc @patil-suraj

@patil-suraj
Copy link
Contributor

Could be an issue with xformers version, I have been using the xformers pre-release and it seems to be working without any issues https://pypi.org/project/xformers/#history

@TsykunovDmitriy
Copy link

Thanks for the tip. I had the same issue. I solved it by installing this xformers pre-release package as @patil-suraj said and updating pytorch version to 1.13.1+cu117.

@hafriedlander
Copy link
Contributor

@patil-suraj this is arch specific. What arch are you testing on? It's possible they've fixed it, but the bugs are still open

facebookresearch/xformers#517
facebookresearch/xformers#628

(I'll check latest xformers in a bit, but I already have a fix for myself.)

@patil-suraj
Copy link
Contributor

So far, I've only tried it on A100 and T4

@hafriedlander
Copy link
Contributor

The two where definitely it works :). The arch I know has issues is SM8x except SM80 (so 30xx and 40xx mostly).

(Although it looks like there's a bit more action in the xformers repo, so this might actually get fixed upstream at some point now.)

@USBhost
Copy link

USBhost commented Jan 18, 2023

Thanks for the tip. I had the same issue. I solved it by installing this xformers pre-release package as @patil-suraj said and updating pytorch version to 1.13.1+cu117.

This worked on my A6000. pytorch 1.13.1 is a must as I installed xformers 436 manually for 1.12.1 and I still got that error.

Edit: it may not error out anymore just it's a silent one now.

@gleb-akhmerov
Copy link
Contributor

gleb-akhmerov commented Jan 19, 2023

Thanks for the tip. I had the same issue. I solved it by installing this xformers pre-release package as @patil-suraj said and updating pytorch version to 1.13.1+cu117.

While I'm no longer getting an error, it looks like the model doesn't learn anymore. The images which are generated after the training are the same as before it.

However, I've found an older version of xformers which works just fine: facebookresearch/xformers@0bad001. This seems to be the last commit that works for me, as far as I can tell from a few tests using later commits.

Here's my environment and installation process.

GPU: 3060
CUDA version: 11.8
Python version: 3.10
OS: Arch Linux

Installation:

cd examples/dreambooth
pip install \
    -r requirements.txt \
    git+https://github.com/huggingface/diffusers.git@7c82a16fc14840429566aec40eb9e65aa57005fd \
    torch==1.13.1 \
    bitsandbytes==0.35.1 \
    triton==2.0.0.dev20221202 \
    scikit-learn \
    datasets \
    ninja
pip install git+https://github.com/facebookresearch/xformers.git@0bad001ddd56c080524d37c84ff58d9cd030ebfd

If nvcc is not on $PATH (like on Arch Linux), you can change the last line and specify the path to cuda like this:

PATH="$PATH:/opt/cuda/bin" pip install git+https://github.com/facebookresearch/xformers.git@0bad001ddd56c080524d37c84ff58d9cd030ebfd

Some details about versions:

  • ninja is installed to build xformers faster
  • bitsandbytes must be 0.35 because of this. Also, training with 0.35.4 makes the model generate blue noise for me, while 0.35.1 works fine.
Full package version list
absl-py                  1.4.0
accelerate               0.15.0
aiohttp                  3.8.3
aiosignal                1.3.1
async-timeout            4.0.2
attrs                    22.2.0
bitsandbytes             0.35.1
cachetools               5.2.1
certifi                  2022.12.7
charset-normalizer       2.1.1
cmake                    3.25.0
datasets                 2.8.0
diffusers                0.12.0.dev0
dill                     0.3.6
exceptiongroup           1.1.0
filelock                 3.9.0
frozenlist               1.3.3
fsspec                   2022.11.0
ftfy                     6.1.1
google-auth              2.16.0
google-auth-oauthlib     0.4.6
grpcio                   1.51.1
huggingface-hub          0.11.1
idna                     3.4
importlib-metadata       6.0.0
iniconfig                2.0.0
Jinja2                   3.1.2
joblib                   1.2.0
Markdown                 3.4.1
MarkupSafe               2.1.2
modelcards               0.1.6
multidict                6.0.4
multiprocess             0.70.14
mypy-extensions          0.4.3
ninja                    1.11.1
numpy                    1.24.1
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
oauthlib                 3.2.2
packaging                23.0
pandas                   1.5.3
Pillow                   9.4.0
pip                      22.3.1
pluggy                   1.0.0
protobuf                 3.20.3
psutil                   5.9.4
pyarrow                  10.0.1
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pyre-extensions          0.0.23
python-dateutil          2.8.2
pytz                     2022.7.1
PyYAML                   6.0
regex                    2022.10.31
requests                 2.28.2
requests-oauthlib        1.3.1
responses                0.18.0
rsa                      4.9
scikit-learn             1.2.0
scipy                    1.10.0
setuptools               65.5.0
six                      1.16.0
tensorboard              2.11.2
tensorboard-data-server  0.6.1
tensorboard-plugin-wit   1.8.1
threadpoolctl            3.1.0
tokenizers               0.13.2
tomli                    2.0.1
torch                    1.13.1
torchvision              0.14.1
tqdm                     4.64.1
transformers             4.25.1
triton                   2.0.0.dev20221202
typing_extensions        4.4.0
typing-inspect           0.8.0
urllib3                  1.26.14
wcwidth                  0.2.6
Werkzeug                 2.2.2
wheel                    0.38.4
xformers                 0.0.15.dev0+0bad001.d20230119
xxhash                   3.2.0
yarl                     1.8.2
zipp                     3.11.0

Edit: seems to work with both torch 1.12.1 and 1.13.1, updated the version information.

@EandrewJones
Copy link
Contributor

EandrewJones commented Feb 1, 2023

I too have been running into issues with xFormers on A10G (aws g5 instance) for training textual inversion, not dreambooth (though same issues would likely apply). The environment is containerized (only showing essential lines below):

FROM nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04

# CUDA xformers build args
ENV TORCH_CUDA_ARCH_LIST = "8.0;8.6"

#
# Deep learning training and inference dependencies
#
RUN pip install -qq -U git+https://github.com/EandrewJones/diffusers
RUN pip install -q -U --pre triton
RUN pip install -q \
    ninja \
    torch==1.13.1 \
    torchvision==0.14.1 \
    accelerate==0.12.0 \
    # mlflow==2.1.1 \
    transformers \
    datasets \
    ftfy \
    pathlib
RUN pip install --upgrade \
    scipy
# RUN conda install -y xformers=0.0.16.dev430+git.bac8718 xformers/label/dev
RUN pip install -v -U git+https://github.com/facebookresearch/xformers.git@0bad001ddd56c080524d37c84ff58d9cd030ebfd
# RUN pip install -v xformers==0.0.17.dev435

Rest of file...

Somewhere between 100-300 steps into training, loss goes to NaN. I know the issue is xFormers because it runs fine w/o it. No C++ errors, just silent failure.

Installations I've tried (pytorch 1.13.1 and cuda 11.6/7 for all):

  • Every pip release > 0.0.13 (including the one @patil-suraj mentioned above)
  • Conda install from dev and main (used different base image pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime)
  • Compiling from scratch on both base images (ran into issues mentioned by OP)

The weird thing is python -m xformers.info always indicates success:

memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.flshattF:               available
memory_efficient_attention.flshattB:               available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        available
memory_efficient_attention.tritonflashattB:        available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
is_functorch_available:                            False
pytorch.version:                                   1.13.1
pytorch.cuda:                                      available
gpu.compute_capability:                            8.6
gpu.name:                                          NVIDIA A10G
build.info:                                        available
build.cuda_version:                                1106
build.python_version:                              3.10.9
build.torch_version:                               1.13.1
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE:                     None
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   None
source.privacy:                                    open source

@patil-suraj
Copy link
Contributor

Would be nice to report this in xformers issues.

@patil-suraj
Copy link
Contributor

I don't know it fixes it, but there has been a new release for xformers yesterday https://github.com/facebookresearch/xformers/releases/tag/v0.0.16

@EandrewJones
Copy link
Contributor

EandrewJones commented Feb 1, 2023 via email

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Feb 25, 2023
@USBhost
Copy link

USBhost commented Feb 25, 2023

Before the bot makes this issue disappear is it resolved? I'm still currently using an older version of xformers.

@EandrewJones
Copy link
Contributor

EandrewJones commented Feb 25, 2023 via email

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

9 participants