-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sana activations explode / clamping issue #10336
Comments
Update: switching to the |
i went through the same rabbithole, see #10241 for details |
good job finding exact spot. |
This problem is due to the fact that we add the value clamping during training with mix_precesion(here), actually the model never saw value out of the scope of (-65504, 65504), so when you try FP32 or BF16 to inference using FP16 trained model, the value of self-attention output will not be clamped(refer to here) and that's why it won't give you the desired results. We provide the FP32 model only for reference, in case of someone need it for fine-tuning or something. If this makes any confusing, then should we just remove the FP32 version of safetensors in our FP16-trained models? Cc: @vladmandic @Nerogar |
I don't think this is caused by the precision, at least, I don't have a provement for it. If you have any insight, please let me know. I'm curious about it. @Nerogar |
To be honest, I don't really see the point in having the fp16 weights at all. If I load To me it looks like those weights are just broken and there is no point in using them. |
We set the BF16 to the default checkpoint and the original fp16 models will serve as a reference, in case someone need to compare. |
Describe the bug
I'm using the pretrained weights from
Efficient-Large-Model/Sana_1600M_1024px_diffusers
. I don't know if this is an issue with these weights, or if the implementation is broken.Things I've observed so far:
The attention output here is very different between fp16 the fp32 version.
The
hidden_states
are in the+/-5*10^5
range here (sometimes even higher, I've seen values as high as1.3*10^6
).Using fp16 calculations, they become inf, which is clamped down to (-65504, 65504) (or about
6*10^4
, more than an order of magnitude less). Using fp32 calculations, this clamping is not done, which means the output of that attention block is also different.Enabling this clamping even for fp32 calculations fixes the issue, but this seems like a hack. That clamping operation looks like a safeguard, not like an essential part of the attention calculations. Adding
print(f"hidden_states: {hidden_states}")
just before and after the clamping operation shows the issue pretty well. You can seeHere are some examples (all using the same prompt/seed/cfg/sampler/etc.)
fp16 weights (with clamping)
fp32 weights (without clamping)
fp32 weights (with clamping)
(tagging @lawrence-cj as the original author)
Reproduction
Logs
No response
System Info
Who can help?
No response
The text was updated successfully, but these errors were encountered: