Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug/Future] bitsandbytes higher than 0.35.0 breaks training on 8bit adamW (Windows) #523

Closed
Panchovix opened this issue May 20, 2023 · 15 comments

Comments

@Panchovix
Copy link

Panchovix commented May 20, 2023

Hi there! When updating bitsandbytes to any version higher than 0.35.0, all trainings get loss value of nan.

I've tested with cu116, cu117, cu118 and cu121 binaries (with torch+cu116, +cu117, +cu118 and +cu121 respectively) and the issue happens on all of them.

I know it is more for a future issue, but if somehow the binaries have to be updated, they may suffer this issue.

The bitsandbytes whls were obtained from https://github.com/jllllll/bitsandbytes-windows-webui

@enranime
Copy link

this, I thought something broke so It's seem like this to be a cause thanks!

@sdbds
Copy link
Contributor

sdbds commented Jul 28, 2023

did u use fp16?

@Panchovix
Copy link
Author

did u use fp16?

I think I was using fp16 setting on accelerate config

@sdbds
Copy link
Contributor

sdbds commented Jul 29, 2023

did u use fp16?

I think I was using fp16 setting on accelerate config

fp16 has some problem in bnb over 0.35

@Panchovix
Copy link
Author

did u use fp16?

I think I was using fp16 setting on accelerate config

fp16 has some problem in bnb over 0.35

You think if using bf16, it should work correctly? I can try in some hours.

@sdbds
Copy link
Contributor

sdbds commented Jul 29, 2023

did u use fp16?

I think I was using fp16 setting on accelerate config

fp16 has some problem in bnb over 0.35

You think if using bf16, it should work correctly? I can try in some hours.

yeah, you can try bf16.
i think fp16 problem form SDXL vae.
did u train on SD1.5 or SDXL?

@Panchovix
Copy link
Author

did u use fp16?

I think I was using fp16 setting on accelerate config

fp16 has some problem in bnb over 0.35

You think if using bf16, it should work correctly? I can try in some hours.

yeah, you can try bf16. i think fp16 problem form SDXL vae. did u train on SD1.5 or SDXL?

The issue happened in both SD1.5 and SDXL (training loss going to nan)

@Panchovix
Copy link
Author

Just tested with BF16 on SD1.5 and it still goes to nan/1+.

image

bitsandbytes 0.41.0

@sdbds
Copy link
Contributor

sdbds commented Jul 31, 2023

Just tested with BF16 on SD1.5 and it still goes to nan/1+.

image

bitsandbytes 0.41.0

bnb over 0.35 has more weight on LR
you can use less LR such as 2e-6~2e-5 on it

@2kpr
Copy link

2kpr commented Jul 31, 2023

I've known of an issue with bitsandbytes since 0.35.0 and at the time no one knew what the issue was caused by they just knew to keep bitsandbytes at 0.35.0, and with the advent of SDXL I started training again and thought that that issue might have been fixed until I trained SD v1.5 and SDXL in the last two days and got bad results, seemingly overtrainings, blotchyness, nans, etc...

So I went back and tried to track down the issue and what might have happened from bitsandbytes 0.35.0 and after, and it turns out it was the probable mistaken indentation of a section of code in the 8bit dynamic map code in bitsandbytes!

Which also explains why people for months have been seeing this issue only when using the 8bit optimizers like AdamW8bit with bitsandbytes >0.35.0, and not when using just AdamW for instance.

Here are some links related to the issue:
ShivamShrirao/diffusers#178
bitsandbytes-foundation/bitsandbytes#152

And in the second link ArrowM tracked down the issue in this commit of bitsandbytes:

This is a total shot in the dark, but I wonder if bitsandbytes/functional.py:218-223 was accidentally indented in bitsandbytes-foundation/bitsandbytes@2f2063b

And RossM noticied in the actual commit here: bitsandbytes-foundation/bitsandbytes@2f2063b#r109622091

The indent here looks clearly wrong, this repeats the additional_items for each value of exponent bits while it should apply to only the last value.

@sdbds
Copy link
Contributor

sdbds commented Jul 31, 2023

I've known of an issue with bitsandbytes since 0.35.0 and at the time no one knew what the issue was caused by they just knew to keep bitsandbytes at 0.35.0, and with the advent of SDXL I started training again and thought that that issue might have been fixed until I trained SD v1.5 and SDXL in the last two days and got bad results, seemingly overtrainings, blotchyness, nans, etc...

So I went back and tried to track down the issue and what might have happened from bitsandbytes 0.35.0 and after, and it turns out it was the probable mistaken indentation of a section of code in the 8bit dynamic map code in bitsandbytes!

Which also explains why people for months have been seeing this issue only when using the 8bit optimizers like AdamW8bit with bitsandbytes >0.35.0, and not when using just AdamW for instance.

Here are some links related to the issue: ShivamShrirao/diffusers#178 TimDettmers/bitsandbytes#152

And in the second link ArrowM tracked down the issue in this commit of bitsandbytes:

This is a total shot in the dark, but I wonder if bitsandbytes/functional.py:218-223 was accidentally indented in TimDettmers/bitsandbytes@2f2063b

And RossM noticied in the actual commit here: TimDettmers/bitsandbytes@2f2063b#r109622091

The indent here looks clearly wrong, this repeats the additional_items for each value of exponent bits while it should apply to only the last value.

Thank u for reporting,so TimDettmers/bitsandbytes havent fix yet?

bitsandbytes-foundation/bitsandbytes#262
i found a pr for fixing but closed.
idk if it is bug or new feature...

@2kpr
Copy link

2kpr commented Jul 31, 2023

Thank u for reporting,so TimDettmers/bitsandbytes havent fix yet?

bitsandbytes-foundation/bitsandbytes#262
i found a pr for fixing but closed.
idk if it is bug or new feature...

It has not been fixed, no, it's been in the code since 0.35.0, and like I mentioned we knew back at the end of 2022 that 'something' happened to bitsandbytes after 0.35.0 but just didn't know what.

Seems like it shouldn't have been indented and is in fact a bug as RossM pointed out:

The indent here looks clearly wrong, this repeats the additional_items for each value of exponent bits while it should apply to only the last value.

It was probably an accident, and Tim Dettmers didn't close that PR or comment on it, so not sure why ArrowM decided to close it with no comments either.

There are currently 28 open PRs on bitsandbytes and 2 weeks ago 7 were merged, but the last time any were merged was May 7th, so it makes sense that Tim is very busy and hasn't had time to read all the issues(318)/PRs(28) to even see that someone mentioned about the indent issue from 0.35.0 and on, etc.

@Panchovix
Copy link
Author

Thanks for the info! Sadly tried to fix the ident but I keep getting my loss to 1.0/nan.

image

So I'm not sure if there's something more besides this.

@sdbds I'm using
learning_rate = 0.0005 (5e-4)
text_encoder_lr=5e-5
unet_lr=0.0002125 (2.125e-4)

on bitsandbytes 0.35.0, and as far I know, at least on SD 1.5 training, if you set unet_lr, learning rate doesn't gets used. These give me good results for LoRA training at 1024x1024 in models based on SD1.5

Are these values too high for bitsandbytes 0.36.0+?

@victorchall
Copy link

victorchall commented Aug 31, 2023

I've tested BNB 38.1 vs 41.1 using AdamW8bit and it seems repaired now in 41.1. Here are results of fine tuning Stable Diffusion 1.5 (unfreezing both unet and text encoder) some characters from Final Fantasy.

image

The top row of labels is the prompt used for inference for that given column.

The top row of outputs is SD1.5.

The middle two rows are fine tuned on a few thousand images using standard supervised (detailed human-written labels) fine tuning using either BNB 38.1 or BNB 41.1 as marked.

The bottom row is just a reference for each character chosen from the training data since I'm sure plenty of people looking at this are not familiar with these specific subjects.

I've trained this data or variations with the same software probably dozens or hundreds of time using known working settings for learning rate, batch size, epochs, etc. as it is my personal "reference" set for debugging training issues.

My conclusions: Most obviously, using BNB 38.1 turned "morgan freeman" into a caucasian female with a sort of video-game render aesthetic (an aesthetic which is present in all the training data). Emma Watson is also forgotten and turned into a "video game render" aesthetic. No data for Morgan Freeman or Emma Watson is present in the fine tuning data and I would typically not expect such significant loss in prior knowledge. More generally, using BNB 41.1 the quality is higher and there is either no or significantly less prior loss.

BNB 41.1 AdamW8bit behaves as I would normally expect, and like the reference Torch AdamW.

@Panchovix
Copy link
Author

Finally, managed to get some time to train and I can confirm, that latest BNB fixes it. Thanks to all and @victorchall!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants