Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dreambooth inference result is not correct when training with xformer #1954

Closed
mingqizhang opened this issue Jan 9, 2023 · 16 comments
Closed
Assignees
Labels
stale Issues that haven't received updates

Comments

@mingqizhang
Copy link

I train dreambooth with xformer but the result is not correct (It’s correct without xformer). The model initialize from runwayml/stable-diffusion-v1-5, but the inference result is the same as the runwayml/stable-diffusion-v1-5. Maybe there are some errors when saving the model with xformer? I also add the pipe.enable_xformers_memory_efficient_attention() when inference, but not work.

@patrickvonplaten
Copy link
Contributor

Hey @mingqizhang,

It's quite difficult to act on this issue as there is no code to reproduce what is incorrect when using xformers. Could you try to add a reproducible code snippet or create a google colab?

cc @patil-suraj as well

@patil-suraj
Copy link
Contributor

Same comment as Patrick's would be nice if you could post something so we could re-produce.

@TsykunovDmitriy
Copy link

TsykunovDmitriy commented Jan 16, 2023

I have the same problem. I printed out the gradients and this is really equal to nan when set --enable_xformers_memory_efficient_attention. I think this is related to this issue facebookresearch/xformers#631.

My env:
GPU: 3090
Pytorch: 1.13.1+cu117
xformers: 0.0.16rc425 (from pypi)

@patil-suraj
Copy link
Contributor

From the linked issue, it seems like they are using bfloat16 for training; not sure if bfloat16 works well with PyTorch stable diffusion. I'm using xformers==0.0.16rc396 and it's working well.

@arpowers
Copy link

@patrickvonplaten @patil-suraj this problem is occuring when using latest Xformers installed via the new method:
pip install --pre -U xformers

aside from that, the issue will occur when running the vanilla dreambooth example from diffusers

@patrickvonplaten
Copy link
Contributor

Think we would need to have a fully reproducible code snippet here. Also as @patil-suraj said, bfloat16 is not the standard precision for dreambooth.

@dai-ichiro
Copy link

I have the same problem.

My Env is
Ubuntu 20.04 on WSL2 (WIndows 11)
RTX 3080
Python 3.8.10
Pytorch: 1.13.1+cu116

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install git+https://github.com/huggingface/diffusers.git
pip install git+https://github.com/huggingface/transformers.git
pip install accelerate==0.15.0 scipy==1.10.0 datasets==2.8.0 ftfy==6.1.1 tensorboard==2.11.2
pip install xformers==0.0.16rc425
pip install triton==2.0.0.dev20221202
accelerate config
------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all
------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
fp16

train

export MODEL_NAME="stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=100\
  --enable_xformers_memory_efficient_attention

inference

from diffusers import StableDiffusionPipeline
import torch

seed = 10000
prompt = "a photo of sks dog"

# original
model_id = 'stable-diffusion-v1-4'
pipe = StableDiffusionPipeline.from_pretrained(
    model_id, 
    torch_dtype=torch.float16).to("cuda")

generator = torch.Generator(device="cuda").manual_seed(seed)
image = pipe(
    prompt = prompt, 
    num_inference_steps = 25,
    generator = generator,
    num_images_per_prompt = 1).images[0]
image.save(f'{model_id}.png')

# finetune
model_id = 'dreambooth_dog'
pipe = StableDiffusionPipeline.from_pretrained(
    model_id, 
    torch_dtype=torch.float16).to("cuda")

generator = torch.Generator(device="cuda").manual_seed(seed)
image = pipe(
    prompt = prompt, 
    num_inference_steps = 25,
    generator = generator,
    num_images_per_prompt = 1).images[0]
image.save(f'{model_id}.png')

The two results are exactly the same.

@patrickvonplaten
Copy link
Contributor

cc @patil-suraj could you take a look here?

@patil-suraj
Copy link
Contributor

I couldn't reproduce the issue. It's working fine for me. I tried the same command as you, with the same dependencies version (accelerate, xformers, triton) and with torch==1.13.1, diffusers main, and I got these results after running both the fine-tuned and sd-1.4 model.

sks-dog

The first row is sd-1.4, and the second is the fine-tuned model. And the model is learned and is working well.

Also, in your example, you are using 100 max_train_steps, which I feel is too low. The issue could be because of too few steps or too small LR.

@dai-ichiro
Copy link

Thank you for your quick response.

I changed max_train_steps and LR but had no changes.
I think there is something wrong with my environment or settings.
OS?
CUDA Driver?

I will try different settings.
Sorry for bothering you.

@patil-suraj
Copy link
Contributor

No worries at all, maybe try in colab or setup a fesh env, that would help with env issue.

@mingqizhang
Copy link
Author

Maybe you can reference to #1829 , sd 1.5 is not work well in 3090/A10, some layers in unet can not backpropagation with xofrmer.

@dai-ichiro
Copy link

I installed xformers==0.0.17.dev441 via pip, and my problem is solved.

Thanks.

@rajbala
Copy link

rajbala commented Feb 16, 2023

@patrickvonplaten @patil-suraj this problem is occuring when using latest Xformers installed via the new method: pip install --pre -U xformers

aside from that, the issue will occur when running the vanilla dreambooth example from diffusers

@patrickvonplaten @patil-suraj this problem is occuring when using latest Xformers installed via the new method: pip install --pre -U xformers

Did you find a solution to the issue with even the vanilla dreambooth example?

@SilentD1
Copy link

SilentD1 commented Feb 22, 2023

And how do you change the xformers version for stable diffusion webui specifically, cause when I install it using that command, it seems to install it inside python but the version in the sstable diffusion loading screen keeps saying 0.0.16rc425 installed.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Mar 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

8 participants